This is an official implementation for "Video Swin Transformers".

Last update: Jan 03, 2023

Overview

Video Swin Transformer

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Results and Models

Kinetics 400

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-T	ImageNet-1K	30ep	224	78.8	93.6	28M	87.9G	config	github/baidu
Swin-S	ImageNet-1K	30ep	224	80.6	94.5	50M	165.9G	config	github/baidu
Swin-B	ImageNet-1K	30ep	224	80.6	94.6	88M	281.6G	config	github/baidu
Swin-B	ImageNet-22K	30ep	224	82.7	95.5	88M	281.6G	config	github/baidu

Kinetics 600

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	ImageNet-22K	30ep	224	84.0	96.5	88M	281.6G	config	github/baidu

Something-Something V2

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	Kinetics 400	60ep	224	69.6	92.7	89M	320.6G	config	github/baidu

Notes:

Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
The pre-trained model of SSv2 could be downloaded at github/baidu.
Access code for baidu is swin.

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

This is an official implementation for "Video Swin Transformers".

Related tags

Overview

Video Swin Transformer

Updates

Introduction

Results and Models

Kinetics 400

Kinetics 600

Something-Something V2

Usage

Installation

Data Preparation

Inference

Training

Apex (optional):

Citation

Other Links

Owner

Swin Transformer

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Python with OpenCV - MediaPip Framework Hand Detection

shufflev2-yolov5：lighter, faster and easier to deploy

Repository for the AugmentedPCA Python package.

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

PyTorch implementations of neural network models for keyword spotting

Bayes-Newton—A Gaussian process library in JAX, with a unifying view of approximate Bayesian inference as variants of Newton's algorithm.

An official implementation of "Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation" (CVPR 2021) in PyTorch.

CoRe: Contrastive Recurrent State-Space Models

Meta Representation Transformation for Low-resource Cross-lingual Learning

Creating a Linear Program Solver by Implementing the Simplex Method in Python with NumPy

Pytorch Implementation of "Desigining Network Design Spaces", Radosavovic et al. CVPR 2020.

Office source code of paper UniFuse: Unidirectional Fusion for 360$^\circ$ Panorama Depth Estimation

Pytorch implementation of CoCon: A Self-Supervised Approach for Controlled Text Generation

Object classification with basic computer vision techniques

Repository of best practices for deep learning in Julia, inspired by fastai

Attention for PyTorch with Linear Memory Footprint

Basics of 2D and 3D Human Pose Estimation.

dualPC.R contains the R code for the main functions.

[SIGGRAPH Asia 2021] Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN