Compact Bidirectional Transformer for Image Captioning

Last update: Dec 12, 2022

Related tags

Deep Learning CBTrans

Overview

Compact Bidirectional Transformer for Image Captioning

Requirements

Python 3.8
Pytorch 1.6
lmdb
h5py
tensorboardX

Prepare Data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md.
Download the preprocessd dataset from this link and extract it to data/.
Please download the converted VinVL feature from this link and place them under data/mscoco_VinVL/. You can also optionally follow this instruction to prepare the fixed or adaptive bottom-up features extracted by Anderson and place them under data/mscoco/ or data/mscoco_adaptive/.
Download part checkpoints from here and extract them to save/.

Offline Evaluation

To reproduce the results of single CBTIC model on Karpathy test split, just run

python  eval.py  --model  save/nsc-transformer-cb-VinVL-feat/model-best.pth   --infos_path  save/nsc-transformer-cb-VinVL-feat/infos_nsc-transformer-cb-VinVL-feat-best.pkl      --beam_size   2   --id  nsc-transformer-cb-VinVL-feat   --split test

To reproduce the results of ensemble of CBTIC models on Karpathy test split, just run

python eval_ensemble.py   --ids   nsc-transformer-cb-VinVL-feat  nsc-transformer-cb-VinVL-feat-seed1   nsc-transformer-cb-VinVL-feat-seed2  nsc-transformer-cb-VinVL-feat-seed3 --weights  1 1 1 1  --beam_size  2   --split  test

Online Evaluation

Please first run

python eval_ensemble.py   --split  test  --language_eval 0  --ids   nsc-transformer-cb-VinVL-feat  nsc-transformer-cb-VinVL-feat-seed1   nsc-transformer-cb-VinVL-feat-seed2  nsc-transformer-cb-VinVL-feat-seed3 --weights  1 1 1 1  --input_json  data/cocotest.json  --input_fc_dir data/mscoco_VinVL/cocobu_test2014/cocobu_fc --input_att_dir  data/mscoco_VinVL/cocobu_test2014/cocobu_att   --input_label_h5    data/cocotalk_bw_label.h5    --language_eval 0        --batch_size  128   --beam_size   2   --id   captions_test2014_cbtic_results

and then follow the instruction to upload results.

Training

In the first training stage, such as using VinVL feature, run

python  train.py   --noamopt --noamopt_warmup 20000   --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0  --scheduled_sampling_start 0  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --max_epochs 15     --checkpoint_path   save/transformer-cb-VinVL-feat   --id   transformer-cb-VinVL-feat   --caption_model  cbt     --input_fc_dir   data/mscoco_VinVL/cocobu_fc   --input_att_dir   data/mscoco_VinVL/cocobu_att    --input_box_dir    data/mscoco_VinVL/cocobu_box

Then in the second training stage, you need two GPUs with 12G memory each, please copy the above pretrained model first

cd save
./copy_model.sh  transformer-cb-VinVL-feat    nsc-transformer-cb-VinVL-feat
cd ..

and then run

python  train.py    --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 1e-5 --num_layers 6 --input_encoding_size 512 --rnn_size 2048  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --self_critical_after 14  --max_epochs    30  --start_from   save/nsc-transformer-cb-VinVL-feat     --checkpoint_path   save/nsc-transformer-cb-VinVL-feat   --id  nsc-transformer-cb-VinVL-feat   --caption_model  cbt    --input_fc_dir   data/mscoco_VinVL/cocobu_fc   --input_att_dir   data/mscoco_VinVL/cocobu_att    --input_box_dir    data/mscoco_VinVL/cocobu_box

Note

Even if fixing all random seed, we find that the results of the two runs are still slightly different when using DataParallel on two GPUs. However, the results can be reproduced exactly when using one GPU.
If you are interested in the ablation studies, you can use the git reflog to list all commits and use git reset --hard commit_id to change to corresponding commit.

Citation

@misc{zhou2022compact,
      title={Compact Bidirectional Transformer for Image Captioning}, 
      author={Yuanen Zhou and Zhenzhen Hu and Daqing Liu and Huixia Ben and Meng Wang},
      year={2022},
      eprint={2201.01984},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

This repository is built upon self-critical.pytorch. Thanks for the released code.

Compact Bidirectional Transformer for Image Captioning

Related tags

Overview

Compact Bidirectional Transformer for Image Captioning

Requirements

Prepare Data

Offline Evaluation

Online Evaluation

Training

Note

Citation

Acknowledgements

Owner

YE Zhou

Contrastive Learning for Metagenomic Binning

The FIRST GANs-based omics-to-omics translation framework

Mememoji - A facial expression classification system that recognizes 6 basic emotions: happy, sad, surprise, fear, anger and neutral.

Distance-Ratio-Based Formulation for Metric Learning

SMPLpix: Neural Avatars from 3D Human Models

Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.

A Genetic Programming platform for Python with TensorFlow for wicked-fast CPU and GPU support.

Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

AI4Good project for detecting waste in the environment

Music source separation is a task to separate audio recordings into individual sources

Label-Free Model Evaluation with Semi-Structured Dataset Representations

The aim of this project is to build an AI bot that can play the Wordle game, or more generally Squabble

Contains code for Deep Kernelized Dense Geometric Matching

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

PyTorch code for the paper "Curriculum Graph Co-Teaching for Multi-target Domain Adaptation" (CVPR2021)

PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

PyTorch implementation of the paper:A Convolutional Approach to Melody Line Identification in Symbolic Scores.

Compact Bidirectional Transformer for Image Captioning

Related tags

Overview

Compact Bidirectional Transformer for Image Captioning

Requirements

Prepare Data

Offline Evaluation

Online Evaluation

Training

Note

Citation

Acknowledgements

Owner

YE Zhou

Contrastive Learning for Metagenomic Binning

The FIRST GANs-based omics-to-omics translation framework

Mememoji - A facial expression classification system that recognizes 6 basic emotions: happy, sad, surprise, fear, anger and neutral.

Distance-Ratio-Based Formulation for Metric Learning

SMPLpix: Neural Avatars from 3D Human Models

Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

A Genetic Programming platform for Python with TensorFlow for wicked-fast CPU and GPU support.

Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

AI4Good project for detecting waste in the environment

Music source separation is a task to separate audio recordings into individual sources

Label-Free Model Evaluation with Semi-Structured Dataset Representations

The aim of this project is to build an AI bot that can play the Wordle game, or more generally Squabble

Contains code for Deep Kernelized Dense Geometric Matching

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

PyTorch code for the paper "Curriculum Graph Co-Teaching for Multi-target Domain Adaptation" (CVPR2021)

PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

PyTorch implementation of the paper:A Convolutional Approach to Melody Line Identification in Symbolic Scores.

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.