XViT - Space-time Mixing Attention for Video Transformer

Last update: Dec 23, 2022

Related tags

Overview

XViT - Space-time Mixing Attention for Video Transformer

This is the official implementation of the XViT paper:

@inproceedings{bulat2021space,
  title={Space-time Mixing Attention for Video Transformer},
  author={Bulat, Adrian and Perez-Rua, Juan-Manuel and Sudhakaran, Swathikiran and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={NeurIPS},
  year={2021}
}

In XViT, we introduce a novel Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. Our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time is significantly more efficient than other Video Transformer models.

Model Zoo

We provide a series of models pre-trained on Kinetics-600 and Something-Something-v2.

Kinetics-600

Architecture	frames	views	Top-1	Top-5	url
XViT-B16	16	3x1	84.51%	96.26%	model
XViT-B16	16	3x2	84.71%	96.39%	model

Something-Something-V2

Architecture	frames	views	Top-1	Top-5	url
XViT-B16	16	32x2	67.19%	91.00%	model

Installation

Please make sure your setup satisfies the following requirements:

Requirements

Largely follows the original SlowFast repo requirements:

Python >= 3.8
Numpy
PyTorch >= 1.3
hdf5
fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this.
simplejson: pip install simplejson
GCC >= 4.9
PyAV: conda install av -c conda-forge
ffmpeg (4.0 is prefereed, will be installed along with PyAV)
PyYaml: (will be installed along with fvcore)
tqdm: (will be installed along with fvcore)
iopath: pip install -U iopath or conda install -c iopath iopath
psutil: pip install psutil
OpenCV: pip install opencv-python
torchvision: pip install torchvision or conda install torchvision -c pytorch
tensorboard: pip install tensorboard
PyTorchVideo: pip install pytorchvideo
Detectron2:

    pip install -U torch torchvision cython
    pip install -U 'git+https://github.com/facebookresearch/fvcore.git' 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
    git clone https://github.com/facebookresearch/detectron2 detectron2_repo
    pip install -e detectron2_repo
    # You can find more details at https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md

Datasets

1. Kenetics

You can download Kinetics 400/600 datasets following the instructions provided by the cvdfundation repo: https://github.com/cvdfoundation/kinetics-dataset

Afterwars, resize the videos to the shorte edge size of 256 and prepare the csv files for training, validation in testting: train.csv, val.csv, test.csv. The formatof the csv file is:

path_to_video_1 label_1
path_to_video_2 label_2
...
path_to_video_N label_N

Depending on your system, we recommend decoding the videos to frames and then packing each set of frames into a h5 file with the same name as the original video.

2. Something-Something v2

You can download the datasets from the authors webpage: https://20bn.com/datasets/something-something

Perform the same packing procedure as for Kinetics.

Usage

Training

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset

Evaluation

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Acknowledgements

This repo is built using components from SlowFast and timm

License

XViT code is released under the Apache 2.0 license.

XViT - Space-time Mixing Attention for Video Transformer

Related tags

Overview

XViT - Space-time Mixing Attention for Video Transformer

Model Zoo

Kinetics-600

Something-Something-V2

Installation

Requirements

Datasets

Usage

Training

Evaluation

Acknowledgements

License

Owner

Adrian Bulat

Implementation of paper "Graph Condensation for Graph Neural Networks"

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Towards uncontrained hand-object reconstruction from RGB videos

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

Udacity Suse Cloud Native Foundations Scholarship Course Walkthrough

CLADE - Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)

Code for ICMI2020 and ICMI2021 papers: "Studying Person-Specific Pointing and Gaze Behavior for Multimodal Referencing of Outside Objects from a Moving Vehicle" and "ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion Approach for Referencing Outside Objects From a Moving Vehicle"

Controlling a game using mediapipe hand tracking

Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4

Experiments for Operating Systems Lab (ETCS-352)

Physics-Informed Neural Networks (PINN) and Deep BSDE Solvers of Differential Equations for Scientific Machine Learning (SciML) accelerated simulation

MODNet: Trimap-Free Portrait Matting in Real Time

Official PyTorch implementation and pretrained models of the paper Self-Supervised Classification Network

THIS IS THE OLD PYMC PROJECT. PLEASE USE PYMC3 INSTEAD:

PyTorch code to run synthetic experiments.

Machine Unlearning with SISA

Ratatoskr: Worcester Tech's conference scheduling system

GND-Nets (Graph Neural Diffusion Networks) in TensorFlow.

XViT - Space-time Mixing Attention for Video Transformer

Related tags

Overview

XViT - Space-time Mixing Attention for Video Transformer

Model Zoo

Kinetics-600

Something-Something-V2

Installation

Requirements

Datasets

Usage

Training

Evaluation

Acknowledgements

License

Owner

Adrian Bulat

Implementation of paper "Graph Condensation for Graph Neural Networks"

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Towards uncontrained hand-object reconstruction from RGB videos

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

Udacity Suse Cloud Native Foundations Scholarship Course Walkthrough

CLADE - Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)

Code for ICMI2020 and ICMI2021 papers: "Studying Person-Specific Pointing and Gaze Behavior for Multimodal Referencing of Outside Objects from a Moving Vehicle" and "ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion Approach for Referencing Outside Objects From a Moving Vehicle"

Controlling a game using mediapipe hand tracking

Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4

Experiments for Operating Systems Lab (ETCS-352)

Physics-Informed Neural Networks (PINN) and Deep BSDE Solvers of Differential Equations for Scientific Machine Learning (SciML) accelerated simulation

MODNet: Trimap-Free Portrait Matting in Real Time

Official PyTorch implementation and pretrained models of the paper Self-Supervised Classification Network

THIS IS THE **OLD** PYMC PROJECT. PLEASE USE PYMC3 INSTEAD:

PyTorch code to run synthetic experiments.

Machine Unlearning with SISA

Ratatoskr: Worcester Tech's conference scheduling system

GND-Nets (Graph Neural Diffusion Networks) in TensorFlow.

THIS IS THE OLD PYMC PROJECT. PLEASE USE PYMC3 INSTEAD: