Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Last update: Nov 12, 2022

Related tags

Computer Vision STAM

Overview

An Image is Worth 16x16 Words, What is a Video Worth?

paper

Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models	Top-1 Accuracy (%)	Flops × views (10^9)	# Input Frames	Runtime (Videos/sec)
X3D-M	76.0	6.2 × 30	480	1.3
X3D-L	77.5	24.8 × 30	480	0.46
X3D-XL	79.1	48.4 × 30	480	N/A
STAM-16	77.8	270 × 1	16	20.0
STAM-64	79.2	1080 × 1	64	4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name	checkpoint
STAM_16	link
STAM_32	link
STAM_64	link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224

Citations

@misc{sharir2021image,
    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Overview

An Image is Worth 16x16 Words, What is a Video Worth?

Main Article Results

Pretrained Models

Reproduce Article Scores

Citations

Acknowledgements

Owner

Python bindings for JIGSAW: a Delaunay-based unstructured mesh generator.

Semantic-based Patch Detection for Binary Programs

Handwriting Recognition System based on a deep Convolutional Recurrent Neural Network architecture

Code for the paper: Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution

A simple component to display annotated text in Streamlit apps.

Pre-Recognize Library - library with algorithms for improving OCR quality.

Make OpenCV camera loops less of a chore by skipping the boilerplate and getting right to the interesting stuff

A tensorflow implementation of EAST text detector

Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Papers, Datasets, Algorithms, SOTA for STR. Long-time Maintaining

Use Youdao OCR API to covert your clipboard image to text.

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicing

A little but useful tool to explore OCR data extracted with `pytesseract` and `opencv`

Image Detector and Convertor App created using python's Pillow, OpenCV, cvlib, numpy and streamlit packages.

Repositório para registro de estudo da biblioteca opencv (Python)

Hand Detection and Finger Detection on Live Feed

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

Line based ATR Engine based on OCRopy

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Overview

An Image is Worth 16x16 Words, What is a Video Worth?

Main Article Results

Pretrained Models

Reproduce Article Scores

Citations

Acknowledgements

Owner

Python bindings for JIGSAW: a Delaunay-based unstructured mesh generator.

Semantic-based Patch Detection for Binary Programs

Handwriting Recognition System based on a deep Convolutional Recurrent Neural Network architecture

Code for the paper: Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution

A simple component to display annotated text in Streamlit apps.

Pre-Recognize Library - library with algorithms for improving OCR quality.

Make OpenCV camera loops less of a chore by skipping the boilerplate and getting right to the interesting stuff

A tensorflow implementation of EAST text detector

Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Papers, Datasets, Algorithms, SOTA for STR. Long-time Maintaining

Use Youdao OCR API to covert your clipboard image to text.

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicing

A little but useful tool to explore OCR data extracted with `pytesseract` and `opencv`

Image Detector and Convertor App created using python's Pillow, OpenCV, cvlib, numpy and streamlit packages.

Repositório para registro de estudo da biblioteca opencv (Python)

Hand Detection and Finger Detection on Live Feed

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

Line based ATR Engine based on OCRopy

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約