Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Last update: Jan 09, 2023

Overview

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (Salesforce Research)

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

Requirements:

pytorch 1.8.0
transformers 4.8.1
timm 0.4.9

Download:

Visualization:

We provide code in visualize.ipynb to visualize the important areas in an image for each word in a text. Here is an example visualization using the visual grounding checkpoint.

Pre-training on custom datasets:

Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
In configs/Pretrain.yaml, set the paths for the json files.
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

Image-Text Retrieval:

Download MSCOCO or Flickr30k datasets from the original websites.
Download and extract the provided dataset json files.
In configs/Retrieval_coco.yaml or configs/Retrieval_flickr.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/Retrieval_flickr \
--checkpoint [Pretrained checkpoint]

VQA:

Download VQA v2 dataset and Visual Genome dataset from the original websites.
Download and extract the provided dataset json files.
In configs/VQA.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/vqa \
--checkpoint [Pretrained checkpoint]

Evaluate the result using the official evaluation server.

Visual Entailment:

Download SNLI-VE dataset from the original website.
Download and extract the provided dataset json files.
In configs/VE.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/VE \
--checkpoint [Pretrained checkpoint]

Visual Grounding on RefCOCO+:

Download MSCOCO dataset from the original website.
Download and extract the provided dataset json files.
In configs/Grounding.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py \
--config ./configs/Grounding.yaml \
--output_dir output/RefCOCO \
--gradcam_mode itm \ 
--block_num 8 \
--checkpoint [Pretrained checkpoint]

NLVR2:

NLVR2 requires an additional pre-training step with text-assignment (TA) to adapt the model for image-pair inputs. In order to perform TA, first set the paths for the json training files in configs/NLVR_pretrain.yaml, then run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain_nlvr.py \
--config ./configs/NLVR_pretrain.yaml \
--output_dir output/NLVR_pretrain \
--checkpoint [Pretrained checkpoint]

We provide the checkpoint after TA pre-training, which can be fine-tuned with the following steps.

Download NLVR2 dataset from the original website.
Download and extract the provided dataset json files.
In configs/NLVR.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/NLVR \
--checkpoint [TA pretrained checkpoint]

Citation

If you find this code to be useful for your research, please consider citing.

@article{ALBEF,
      title={Align before Fuse: Vision and Language Representation Learning with Momentum Distillation}, 
      author={Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Deepak Gotmare and Shafiq Joty and Caiming Xiong and Steven Hoi},
      year={2021},
      journal={arXiv preprint arXiv:2107.07651},
}

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Related tags

Overview

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (Salesforce Research)

Requirements:

Download:

Visualization:

Pre-training on custom datasets:

Image-Text Retrieval:

VQA:

Visual Entailment:

Visual Grounding on RefCOCO+:

NLVR2:

Citation

Owner

Salesforce

Poisson Surface Reconstruction for LiDAR Odometry and Mapping

Vector Quantization, in Pytorch

Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors

A rule-based log analyzer & filter

Image restoration with neural networks but without learning.

Datasets, Transforms and Models specific to Computer Vision

Official code of CVPR 2021's PLOP: Learning without Forgetting for Continual Semantic Segmentation

Invasive Plant Species Identification

Weakly Supervised Segmentation by Tensorflow.

Spatiotemporal resampling methods for mlr3

Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth, in ICCV 2021 (oral)

Multiwavelets-based operator model

POCO: Point Convolution for Surface Reconstruction

Implementation of Pix2Seq in PyTorch

A project to make Amazon Echo respond to sign language using your webcam

Introducing neural networks to predict stock prices

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

CMT: Convolutional Neural Networks Meet Vision Transformers

Music Generation using Neural Networks Streamlit App