A PyTorch implementation of VIOLET

Last update: Dec 30, 2022

Overview

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are parital examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

Here is our best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json

Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json

Text-to-Video Retrieval (MSRVTT, DiDeMo, YouCook2, and LSMDC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all trained downstream checkpoints.

Citation

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

A PyTorch implementation of VIOLET

Related tags

Overview

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

Overview

Requirements

Usage

Data preprocessing

Pretraining

Downstream

Citation

Owner

Tsu-Jui Fu

a chinese segment base on crf

Py65 65816 - Add support for the 65C816 to py65

The training code for the 4th place model at MDX 2021 leaderboard A.

Installation, test and evaluation of Scribosermo speech-to-text engine

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Open Source Neural Machine Translation in PyTorch

ADCS - Automatic Defect Classification System (ADCS) for SSMC

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

KoBART model on huggingface transformers

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Perform sentiment analysis and keyword extraction on Craigslist listings

Train BPE with fastBPE, and load to Huggingface Tokenizer.

A high-level Python library for Quantum Natural Language Processing