Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Last update: Dec 16, 2022

Related tags

Deep Learning Grounded-Image-Captioning

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Python 3.7
Pytorch 1.2

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
Download the preprocessd dataset from this link and extract it to data/.
For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
Download the pretrained models from here and extract them to log/.
Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.

Evaluation

To reproduce the results reported in the paper, just simply run

bash eval_flickr.sh

fro Flickr30k-Entities and

bash eval_coco.sh

for MSCOCO.

Training

In the first training stage, run like

python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30  --att_supervise  True   --att_supervise_weight 0.1

In the second training stage, run like

python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30  --max_epochs  110      --cider_reward_weight  1
--ground_reward_weight   1

Citation

@inproceedings{zhou2020grounded,
  title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
  author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and  Hu, Zhenzhen and Zhang, Hanwang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Acknowledgements

This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Related tags

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Prepare data

Evaluation

Training

Citation

Acknowledgements

Owner

YE Zhou

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Learning to Identify Top Elo Ratings with A Dueling Bandits Approach

Pytorch implementation of Feature Pyramid Network (FPN) for Object Detection

Sound-guided Semantic Image Manipulation - Official Pytorch Code (CVPR 2022)

The code for Expectation-Maximization Attention Networks for Semantic Segmentation (ICCV'2019 Oral)

DTCN SMP Challenge - Sequential prediction learning framework and algorithm

Code, Models and Datasets for OpenViDial Dataset

Autoregressive Predictive Coding: An unsupervised autoregressive model for speech representation learning

Code for "ATISS: Autoregressive Transformers for Indoor Scene Synthesis", NeurIPS 2021

[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

render sprites into your desktop environment as shaped windows using GTK

Graph Robustness Benchmark: A scalable, unified, modular, and reproducible benchmark for evaluating the adversarial robustness of Graph Machine Learning.

Adversarial Attacks on Probabilistic Autoregressive Forecasting Models.

The full training script for Enformer (Tensorflow Sonnet) on TPU clusters

Ranking Models in Unlabeled New Environments （iccv21）

CryptoFrog - My First Strategy for freqtrade

A blender add-on that automatically re-aligns wrong axis objects.

NAS Benchmark in "Prioritized Architecture Sampling with Monto-Carlo Tree Search", CVPR2021