An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Overview

Bottom-Up and Top-Down Attention for Visual Question Answering

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The implementation follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" (https://arxiv.org/abs/1707.07998) and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" (https://arxiv.org/abs/1708.02711).

Results

Model Validation Accuracy Training Time
Reported Model 63.15 12 - 18 hours (Tesla K40)
Implemented Model 63.58 40 - 50 minutes (Titan Xp)

The accuracy was calculated using the VQA evaluation metric.

About

This is part of a project done at CMU for the course 11-777 Advanced Multimodal Machine Learning and a joint work between Hengyuan Hu, Alex Xiao, and Henry Huang.

As part of our project, we implemented bottom up attention as a strong VQA baseline. We were planning to integrate object detection with VQA and were very glad to see that Peter Anderson and Damien Teney et al. had already done that beautifully. We hope this clean and efficient implementation can serve as a useful baseline for future VQA explorations.

Implementation Details

Our implementation follows the overall structure of the papers but with the following simplifications:

  1. We don't use extra data from Visual Genome.
  2. We use only a fixed number of objects per image (K=36).
  3. We use a simple, single stream classifier without pre-training.
  4. We use the simple ReLU activation instead of gated tanh.

The first two points greatly reduce the training time. Our implementation takes around 200 seconds per epoch on a single Titan Xp while the one described in the paper takes 1 hour per epoch.

The third point is simply because we feel the two stream classifier and pre-training in the original paper is over-complicated and not necessary.

For the non-linear activation unit, we tried gated tanh but couldn't make it work. We also tried gated linear unit (GLU) and it works better than ReLU. Eventually we choose ReLU due to its simplicity and since the gain from using GLU is too small to justify the fact that GLU doubles the number of parameters.

With these simplifications we would expect the performance to drop. For reference, the best result on validation set reported in the paper is 63.15. The reported result without extra data from visual genome is 62.48, the result using only 36 objects per image is 62.82, the result using two steam classifier but not pre-trained is 62.28 and the result using ReLU is 61.63. These numbers are cited from the Table 1 of the paper: "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge". With all the above simplification aggregated, our first implementation got around 59-60 on validation set.

To shrink the gap, we added some simple but powerful modifications. Including:

  1. Add dropout to alleviate overfitting
  2. Double the number of neurons
  3. Add weight normalization (BN seems not work well here)
  4. Switch to Adamax optimizer
  5. Gradient clipping

These small modifications bring the number back to ~62.80. We further change the concatenation based attention module in the original paper to a projection based module. This new attention module is inspired by the paper "Modeling Relationships in Referential Expressions with Compositional Modular Networks" (https://arxiv.org/pdf/1611.09978.pdf), but with some modifications (implemented in attention.NewAttention). With the help of this new attention, we boost the performance to ~63.58, surpassing the reported best result with no extra data and less computation cost.

Usage

Prerequisites

Make sure you are on a machine with a NVIDIA GPU and Python 2 with about 70 GB disk space.

  1. Install PyTorch v0.3 with CUDA and Python 2.7.
  2. Install h5py.

Data Setup

All data should be downloaded to a 'data/' directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. The features are provided by and downloaded from the original authors' repo. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

Training

Simply run python main.py to start training. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.

Owner
Hengyuan Hu
Hengyuan Hu
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
Perform zero-order Hankel Transform for an 1D array (float or real valued).

perform zero-order Hankel Transform for an 1D array (float or real valued). An discrete form of Parseval theorem is guaranteed. Suit for iterative problems.

1 Jan 17, 2022
[NeurIPS'20] Multiscale Deep Equilibrium Models

Multiscale Deep Equilibrium Models ๐Ÿ’ฅ ๐Ÿ’ฅ ๐Ÿ’ฅ ๐Ÿ’ฅ This repo is deprecated and we will soon stop actively maintaining it, as a more up-to-date (and simple

CMU Locus Lab 221 Dec 26, 2022
Implementation detail for paper "Multi-level colonoscopy malignant tissue detection with adversarial CAC-UNet"

Multi-level-colonoscopy-malignant-tissue-detection-with-adversarial-CAC-UNet Implementation detail for our paper "Multi-level colonoscopy malignant ti

CVSM Group - email: <a href=[email protected]"> 84 Nov 22, 2022
The repository for our EMNLP 2021 paper "Finnish Dialect Identification: The Effect of Audio and Text"

Finnish Dialect Identification The repository for our EMNLP 2021 paper "Finnish Dialect Identification: The Effect of Audio and Text". We present a te

Rootroo Ltd 2 Dec 25, 2021
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ( Zitnik Lab @ Harvard 44 Dec 07, 2022

Lightweight Face Image Quality Assessment

LightQNet This is a demo code of training and testing [LightQNet] using Tensorflow. Uncertainty Losses: IDQ loss PCNet loss Uncertainty Networks: Mobi

Kaen 5 Nov 18, 2022
๐Ÿ›ฐ๏ธ Awesome Satellite Imagery Datasets

Awesome Satellite Imagery Datasets List of aerial and satellite imagery datasets with annotations for computer vision and deep learning. Newest datase

Christoph Rieke 3k Jan 03, 2023
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 09, 2023
Binary Stochastic Neurons in PyTorch

Binary Stochastic Neurons in PyTorch http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html https://github.com/pytorch/examples/tree/master/mnis

Onur Kaplan 54 Nov 21, 2022
:fire: 2D and 3D Face alignment library build using pytorch

Face Recognition Detect facial landmarks from Python using the world's most accurate face alignment network, capable of detecting points in both 2D an

Adrian Bulat 6k Dec 31, 2022
OpenMMLab Pose Estimation Toolbox and Benchmark.

Introduction English | ็ฎ€ไฝ“ไธญๆ–‡ MMPose is an open-source toolbox for pose estimation based on PyTorch. It is a part of the OpenMMLab project. The master b

OpenMMLab 2.8k Dec 31, 2022
This is the official code release for the paper Shape and Material Capture at Home

This is the official code release for the paper Shape and Material Capture at Home. The code enables you to reconstruct a 3D mesh and Cook-Torrance BRDF from one or more images captured with a flashl

89 Dec 10, 2022
โš“ Eurybia monitor model drift over time and securize model deployment with data validation

View Demo ยท Documentation ยท Medium article ๐Ÿ” Overview Eurybia is a Python library which aims to help in : Detecting data drift and model drift Valida

MAIF 172 Dec 27, 2022
PanopticBEV - Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images

Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images This r

63 Dec 16, 2022
Object Database for Super Mario Galaxy 1/2.

Super Mario Galaxy Object Database Welcome to the public object database for Super Mario Galaxy and Super Mario Galaxy 2. Here, we document all object

Aurum 9 Dec 04, 2022
Task-related Saliency Network For Few-shot learning

Task-related Saliency Network For Few-shot learning This is an official implementation in Tensorflow of TRSN. Abstract An essential cue of human wisdo

1 Nov 18, 2021
Code repo for "FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation" (ICCV 2021)

FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation (ICCV 2021) This repository contains the implementation of th

Yuhang Zang 21 Dec 17, 2022
Survival analysis in Python

What is survival analysis and why should I learn it? Survival analysis was originally developed and applied heavily by the actuarial and medical commu

Cameron Davidson-Pilon 2k Jan 08, 2023
learning and feeling SLAM together with hands-on-experiments

modern-slam-tutorial-python Learning and feeling SLAM together with hands-on-experiments ๐Ÿ˜€ ๐Ÿ˜ƒ ๐Ÿ˜† Dependencies Most of the examples are based on GTSAM

Giseop Kim 59 Dec 22, 2022