Towards Long-Form Video Understanding

Last update: Dec 26, 2022

Related tags

Deep Learning lvu

Overview

Towards Long-Form Video Understanding

Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021

[Paper] [Project Page] [Dataset]

Citation

@inproceedings{lvu2021,
  Author    = {Chao-Yuan Wu and Philipp Kr\"{a}henb\"{u}hl},
  Title     = {{Towards Long-Form Video Understanding}},
  Booktitle = {{CVPR}},
  Year      = {2021}}

Overview

This repo implements Object Transformers for long-form video understanding.

Getting Started

Please organize data/ as follows

data
|_ ava
|_ features
|_ instance_meta
|_ lvu_1.0

ava, features, and instance_meta could be found at this Google Drive folder. lvu_1.0 can be found at here.

Please also download pre-trained weights at this Google Drive folder and put them in pretrained_models/.

Pre-training

python3 -u run_pretrain.py

This pretrains on a small demo dataset data/instance_meta/instance_meta_pretrain_demo.pkl as an example. Please follow its file format if you'd like to pretrain on a larger dataset (e.g., latest full version of MovieClips).

Training and evaluating on AVA v2.2

python3 -u run_ava.py

This should achieve 31.0 mAP.

Training and evaluating on LVU tasks

python3 -u run.py [1-9]

The argument selects a task to run on. Please see run.py for details.

Acknowledgment

This implementation largely borrows from Huggingface Transformers. Please consider citing it if you use this repo.

Towards Long-Form Video Understanding

Related tags

Overview

Towards Long-Form Video Understanding

[Paper] [Project Page] [Dataset]

Citation

Overview

Getting Started

Pre-training

Training and evaluating on AVA v2.2

Training and evaluating on LVU tasks

Acknowledgment

Owner

Chao-Yuan Wu

This repository contains the code for the paper ``Identifiable VAEs via Sparse Decoding''.

Simple Python application to transform Serial data into OSC messages

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Simple reference implementation of GraphSAGE.

Reliable probability face embeddings

STRIVE: Scene Text Replacement In Videos

NovelD: A Simple yet Effective Exploration Criterion

CondenseNet V2: Sparse Feature Reactivation for Deep Networks

An efficient framework for reinforcement learning.

DuBE: Duple-balanced Ensemble Learning from Skewed Data

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

Collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and related datasets

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Swin-Transformer is basically a hierarchical Transformer whose representation is computed with shifted windows.

Resources complimenting the Machine Learning Course led in the Faculty of mathematics and informatics part of Sofia University.

A voice recognition assistant similar to amazon alexa, siri and google assistant.

Autotype on websites that have copy-paste disabled like Moodle, HackerEarth contest etc.

Learning to Prompt for Vision-Language Models.

Official Implementation of Few-shot Visual Relationship Co-localization

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios