Unsupervised captioning - Code for Unsupervised Image Captioning

Last update: Dec 24, 2022

Related tags

Overview

Unsupervised Image Captioning

by Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo

Introduction

Most image captioning models are trained using paired image-sentence data, which are expensive to collect. We propose unsupervised image captioning to relax the reliance on paired data. For more details, please refer to our paper.

Citation

@InProceedings{feng2019unsupervised,
  author = {Feng, Yang and Ma, Lin and Liu, Wei and Luo, Jiebo},
  title = {Unsupervised Image Captioning},
  booktitle = {CVPR},
  year = {2019}
}

Requirements

mkdir ~/workspace
cd ~/workspace
git clone https://github.com/tensorflow/models.git tf_models
git clone https://github.com/tylin/coco-caption.git
touch tf_models/research/im2txt/im2txt/__init__.py
touch tf_models/research/im2txt/im2txt/data/__init__.py
touch tf_models/research/im2txt/im2txt/inference_utils/__init__.py
wget http://download.tensorflow.org/models/inception_v4_2016_09_09.tar.gz
mkdir ckpt
tar zxvf inception_v4_2016_09_09.tar.gz -C ckpt
git clone https://github.com/fengyang0317/unsupervised_captioning.git
cd unsupervised_captioning
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:`pwd`

Dataset (Optional. The files generated below can be found at Gdrive).

In case you do not have the access to Google, the files are also available at One Drive.

Crawl image descriptions. The descriptions used when conducting the experiments in the paper are available at link. You may download the descriptions from the link and extract the files to data/coco.
```
pip3 install absl-py
python3 preprocessing/crawl_descriptions.py
```
Extract the descriptions. It seems that NLTK is changing constantly. So the number of the descriptions obtained may be different.
```
python -c "import nltk; nltk.download('punkt')"
python preprocessing/extract_descriptions.py
```
Preprocess the descriptions. You may need to change the vocab_size, start_id, and end_id in config.py if you generate a new dictionary.
```
python preprocessing/process_descriptions.py --word_counts_output_file \ 
  data/word_counts.txt --new_dict
```
Download the MSCOCO images from link and put all the images into ~/dataset/mscoco/all_images.
Object detection for the training images. You need to first download the detection model from here and then extract the model under tf_models/research/object_detection.
```
python preprocessing/detect_objects.py --image_path\
  ~/dataset/mscoco/all_images --num_proc 2 --num_gpus 1
```

Generate tfrecord files for images.

python preprocessing/process_images.py --image_path\
  ~/dataset/mscoco/all_images

Training

Train the model without the intialization pipeline.

python im_caption_full.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --multi_gpu --batch_size 512 --save_checkpoint_steps 1000\
  --gen_lr 0.001 --dis_lr 0.001

Evaluate the model. The last element in the b34.json file is the best checkpoint.

CUDA_VISIBLE_DEVICES='0,1' python eval_all.py\
  --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --data_dir ~/dataset/mscoco/all_images
js-beautify saving/b34.json

Evaluate the model on test set. Suppose the best validation checkpoint is 20000.

python test_model.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --data_dir ~/dataset/mscoco/all_images --job_dir saving/model.ckpt-20000

Initialization (Optional. The files can be found at here).

Train a object-to-sentence model, which is used to generate the pseudo-captions.
```
python initialization/obj2sen.py
```

Find the best obj2sen model.

python initialization/eval_obj2sen.py --threads 8

Generate pseudo-captions. Suppose the best validation checkpoint is 35000.

python initialization/gen_obj2sen_caption.py --num_proc 8\
  --job_dir obj2sen/model.ckpt-35000

Train a captioning using pseudo-pairs.

python initialization/im_caption.py --o2s_ckpt obj2sen/model.ckpt-35000\
  --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt

Evaluate the model.

CUDA_VISIBLE_DEVICES='0,1' python eval_all.py\
  --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --data_dir ~/dataset/mscoco/all_images --job_dir saving_imcap
js-beautify saving_imcap/b34.json

Train sentence auto-encoder, which is used to initialize sentence GAN.
```
python initialization/sentence_ae.py
```
Train sentence GAN.
```
python initialization/sentence_gan.py
```

Train the full model with initialization. Suppose the best imcap validation checkpoint is 18000.

python im_caption_full.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --imcap_ckpt saving_imcap/model.ckpt-18000\
  --sae_ckpt sen_gan/model.ckpt-30000 --multi_gpu --batch_size 512\
  --save_checkpoint_steps 1000 --gen_lr 0.001 --dis_lr 0.001

Credits

Part of the code is from coco-caption, im2txt, tfgan, resnet, Tensorflow Object Detection API and maskgan.

Xinpeng told me the idea of self-critic, which is crucial to training.

Unsupervised captioning - Code for Unsupervised Image Captioning

Related tags

Overview

Unsupervised Image Captioning

Introduction

Citation

Requirements

Dataset (Optional. The files generated below can be found at Gdrive).

Training

Initialization (Optional. The files can be found at here).

Credits

Owner

Yang Feng

A best practice for tensorflow project template architecture.

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

Repo for EMNLP 2021 paper "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression"

NVIDIA container runtime

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

ObjDetApp deploys a pytorch model for object detection

MoCoPnet - Deformable 3D Convolution for Video Super-Resolution

Unofficial PyTorch Implementation of "DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features"

Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

PyTorch implementation of Asymmetric Siamese (https://arxiv.org/abs/2204.00613)

Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Hypernetwork-Ensemble Learning of Segmentation Probability for Medical Image Segmentation with Ambiguous Labels

CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution

Non-stationary GP package written from scratch in PyTorch

Official code for "Decoupling Zero-Shot Semantic Segmentation"

An experimentation and research platform to investigate the interaction of automated agents in an abstract simulated network environments.

Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

Alpha-Zero - Telegram Group Manager Bot Written In Python Using Pyrogram

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

Unsupervised captioning - Code for Unsupervised Image Captioning

Related tags

Overview

Unsupervised Image Captioning

Introduction

Citation

Requirements

Dataset (Optional. The files generated below can be found at Gdrive).

Training

Initialization (Optional. The files can be found at here).

Credits

Owner

Yang Feng

A best practice for tensorflow project template architecture.

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

Repo for EMNLP 2021 paper "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression"

NVIDIA container runtime

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

*ObjDetApp* deploys a pytorch model for object detection

MoCoPnet - Deformable 3D Convolution for Video Super-Resolution

Unofficial PyTorch Implementation of "DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features"

Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

PyTorch implementation of Asymmetric Siamese (https://arxiv.org/abs/2204.00613)

Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Hypernetwork-Ensemble Learning of Segmentation Probability for Medical Image Segmentation with Ambiguous Labels

CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution

Non-stationary GP package written from scratch in PyTorch

Official code for "Decoupling Zero-Shot Semantic Segmentation"

An experimentation and research platform to investigate the interaction of automated agents in an abstract simulated network environments.

Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

Alpha-Zero - Telegram Group Manager Bot Written In Python Using Pyrogram

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

ObjDetApp deploys a pytorch model for object detection