[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Last update: Dec 18, 2022

Overview

Disentangled Representation Learning for Text-Video Retrieval

This is a PyTorch implementation of the paper Disentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,
  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
  journal = {arXiv:2203.07111},
  title   = {Disentangled Representation Learning for Text-Video Retrieval},
  year    = {2022},
}

Catalog

Setup
Fine-tuning code
Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

Train on MSR-VTT 1k.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experiments scripts

configs	feature	gpus	Text-Video					Video-Text					train time (h)
configs	feature	gpus	[email protected]	[email protected]	[email protected]	MdR	MnR	[email protected]	[email protected]	[email protected]	MdR	MnR	train time (h)
CLIP4Clip	ViT/B-32	4	42.8	72.1	81.4	2.0	16.3	44.1	70.5	80.5	2.0	11.8	10.5
zero-shot	ViT/B-32	4	31.1	53.7	63.4	4.0	41.6	26.5	50.1	61.7	5.0	39.9	-
Interaction
DP+None	ViT/B-32	4	42.9	70.6	81.4	2.0	15.4	43.0	71.1	81.1	2.0	11.8	2.5
DP+seqTransf	ViT/B-32	4	42.8	71.1	81.1	2.0	15.6	44.1	70.9	80.9	2.0	11.7	2.6
XTI+None	ViT/B-32	4	40.5	71.1	82.6	2.0	13.6	42.7	70.8	80.2	2.0	12.5	14.3
XTI+seqTransf	ViT/B-32	4	42.4	71.3	80.9	2.0	15.2	40.1	69.2	79.6	2.0	15.8	16.8
TI+seqTransf	ViT/B-32	4	44.8	73.0	82.2	2.0	13.4	42.6	72.7	82.8	2.0	9.1	2.6
WTI+seqTransf	ViT/B-32	4	46.6	73.4	83.5	2.0	13.0	45.4	73.4	81.9	2.0	9.2	2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR	ViT/B-32	4	43.9	71.1	81.2	2.0	15.3	42.3	70.3	81.1	2.0	11.4	2.6
TI+seqTransf+CDCR	ViT/B-32	4	45.8	73.0	81.9	2.0	12.8	43.3	71.8	82.7	2.0	8.9	2.6
WTI+seqTransf+CDCR	ViT/B-32	4	47.6	73.4	83.3	2.0	12.8	45.1	72.9	83.5	2.0	9.2	2.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo using matplotlib (no GPU needed):

License

See LICENSE for details.

Acknowledgments

Our code is partly based on CLIP4Clip.

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Related tags

Overview

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

Owner

Qiang Wang

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Convnet transfer - Code for paper How transferable are features in deep neural networks?

Shitty gaze mouse controller

An official source code for "Augmentation-Free Self-Supervised Learning on Graphs"

Package for working with hypernetworks in PyTorch.

这是一个yolo3-tf2的源码，可以用于训练自己的模型。

RAMA: Rapid algorithm for multicut problem

Rax is a Learning-to-Rank library written in JAX

Annotate with anyone, anywhere.

The implementation for the SportsCap (IJCV 2021)

Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search

ImageNet Adversarial Image Evaluation

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen, Kaixiong Zhou, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021]

Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image

Code for the ICCV2021 paper "Personalized Image Semantic Segmentation"

KoCLIP: Korean port of OpenAI CLIP, in Flax

Customizable RecSys Simulator for OpenAI Gym

Artifacts for paper "MMO: Meta Multi-Objectivization for Software Configuration Tuning"

STBP is a way to train SNN with datasets by Backward propagation.

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Related tags

Overview

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

Owner

Qiang Wang

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Convnet transfer - Code for paper How transferable are features in deep neural networks?

Shitty gaze mouse controller

An official source code for "Augmentation-Free Self-Supervised Learning on Graphs"

Package for working with hypernetworks in PyTorch.

这是一个yolo3-tf2的源码，可以用于训练自己的模型。

RAMA: Rapid algorithm for multicut problem

Rax is a Learning-to-Rank library written in JAX

Annotate with anyone, anywhere.

The implementation for the SportsCap (IJCV 2021)

Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search

ImageNet Adversarial Image Evaluation

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen*, Kaixiong Zhou*, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021]

Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image

Code for the ICCV2021 paper "Personalized Image Semantic Segmentation"

KoCLIP: Korean port of OpenAI CLIP, in Flax

Customizable RecSys Simulator for OpenAI Gym

Artifacts for paper "MMO: Meta Multi-Objectivization for Software Configuration Tuning"

STBP is a way to train SNN with datasets by Backward propagation.

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen, Kaixiong Zhou, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang