End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Deep LearningMTTR
Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

License Framework

This repo contains the official implementation of the paper:


End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files) 

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

  • Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

  • Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

  • COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

  • Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command Description
-c path to dataset configuration file
-rm running mode (train/eval)
-ws window size
-bs training batch size per GPU
-ebs eval batch size per GPU (if not provided, training batch size is used)
-ng number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 46.1
8 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 39.2
8 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size Command Checkpoint File Output Zip J&F Result
12 python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8 Link Link 55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

  • The command for the following configuration was tested on 2 A6000 48GB GPUs:
Window Size Command
10 python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2
  • The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:
Window Size Command
8 python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3

Refer-YouTube-VOS

  • The command for the following configuration was tested on 4 A6000 48GB GPUs:
Window Size Command
12 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4
  • The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.
Window Size Command
8 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

An official implementation of MobileStyleGAN in PyTorch

MobileStyleGAN: A Lightweight Convolutional Neural Network for High-Fidelity Image Synthesis Official PyTorch Implementation The accompanying videos c

Sergei Belousov 602 Jan 07, 2023
Dynamica causal Bayesian optimisation

Dynamic Causal Bayesian Optimization This is a Python implementation of Dynamic Causal Bayesian Optimization as presented at NeurIPS 2021. Abstract Th

nd308 18 Nov 22, 2022
Dataset for the Research2Clinics @ NeurIPS 2021 Paper: What Do You See in this Patient? Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
ObjectDetNet is an easy, flexible, open-source object detection framework

Getting started with the ObjectDetNet ObjectDetNet is an easy, flexible, open-source object detection framework which allows you to easily train, resu

5 Aug 25, 2020
Code and data for ImageCoDe, a contextual vison-and-language benchmark

ImageCoDe This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions. Data All collected descriptions for the

McGill NLP 27 Dec 02, 2022
Code for Paper Predicting Osteoarthritis Progression via Unsupervised Adversarial Representation Learning

Predicting Osteoarthritis Progression via Unsupervised Adversarial Representation Learning (c) Tianyu Han and Daniel Truhn, RWTH Aachen University, 20

Tianyu Han 7 Nov 22, 2022
Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

DART Implementation for ICLR2022 paper Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. Environment

ZJUNLP 83 Dec 27, 2022
Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

kNN_From_Scratch I implemented the k nearest neighbors (kNN) classification algorithm on python. This algorithm is used to predict the classes of new

1 Dec 14, 2021
This repository is the offical Pytorch implementation of ContextPose: Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021).

Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021) Introduction This repository is the offical Pytorch implementation of

37 Nov 21, 2022
Reimplementation of Dynamic Multi-scale filters for Semantic Segmentation.

Paddle implementation of Dynamic Multi-scale filters for Semantic Segmentation.

Hongqiang.Wang 2 Nov 01, 2021
hySLAM is a hybrid SLAM/SfM system designed for mapping

HySLAM Overview hySLAM is a hybrid SLAM/SfM system designed for mapping. The system is based on ORB-SLAM2 with some modifications and refactoring. Raú

Brian Hopkinson 15 Oct 10, 2022
A Haskell kernel for IPython.

IHaskell You can now try IHaskell directly in your browser at CoCalc or mybinder.org. Alternatively, watch a talk and demo showing off IHaskell featur

Andrew Gibiansky 2.4k Dec 29, 2022
Decensoring Hentai with Deep Neural Networks. Formerly named DeepMindBreak.

DeepCreamPy Decensoring Hentai with Deep Neural Networks. Formerly named DeepMindBreak. A deep learning-based tool to automatically replace censored a

616 Jan 06, 2023
[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

Graph Contrastive Learning Automated PyTorch implementation for Graph Contrastive Learning Automated [talk] [poster] [appendix] Yuning You, Tianlong C

Shen Lab at Texas A&M University 80 Nov 23, 2022
NAVER BoostCamp Final Project

CV 14조 final project Super Resolution and Deblur module Inference code & Pretrained weight Repo SwinIR Deblur 실행 방법 streamlit run WebServer/Server_SRD

JiSeong Kim 5 Sep 06, 2022
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

3k Jan 08, 2023
TVNet: Temporal Voting Network for Action Localization

TVNet: Temporal Voting Network for Action Localization This repo holds the codes of paper: "TVNet: Temporal Voting Network for Action Localization". P

hywang 5 Jul 26, 2022
Generate pixel-style avatars with python.

face2pixel Generate pixel-style avatars with python. Run: Clone the project: git clone https://github.com/theodorecooper/face2pixel install requiremen

Theodore Cooper 2 May 11, 2022
Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks - Official Project Page This repository contains the code develope

Amirsina Torfi 1.7k Dec 18, 2022
QRec: A Python Framework for quick implementation of recommender systems (TensorFlow Based)

Introduction QRec is a Python framework for recommender systems (Supported by Python 3.7.4 and Tensorflow 1.14+) in which a number of influential and

Yu 1.4k Jan 01, 2023