X-modaler is a versatile and high-performance codebase for cross-modal analytics.

Overview

X-modaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.

The original paper can be found here.

Installation

See installation instructions.

Requiremenets

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.8 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
  • fvcore
  • pytorch_transformers
  • jsonlines
  • pycocotools

Getting Started

See Getting Started with X-modaler

Training & Evaluation in Command Line

We provide a script in "train_net.py", that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.

To train a model(e.g., UpDown) with "train_net.py", first setup the corresponding datasets following datasets, then run:

# Teacher Force
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown.yaml

# Reinforcement Learning
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown_rl.yaml

Model Zoo and Baselines

A large set of baseline results and trained models are available here.

Image Captioning
Attention Show, attend and tell: Neural image caption generation with visual attention ICML 2015
LSTM-A3 Boosting image captioning with attributes ICCV 2017
Up-Down Bottom-up and top-down attention for image captioning and visual question answering CVPR 2018
GCN-LSTM Exploring visual relationship for image captioning ECCV 2018
Transformer Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning ACL 2018
Meshed-Memory Meshed-Memory Transformer for Image Captioning CVPR 2020
X-LAN X-Linear Attention Networks for Image Captioning CVPR 2020
Video Captioning
MP-LSTM Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL HLT 2015
TA Describing Videos by Exploiting Temporal Structure ICCV 2015
Transformer Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning ACL 2018
TDConvED Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning AAAI 2019
Vision-Language Pretraining
Uniter UNITER: UNiversal Image-TExt Representation Learning ECCV 2020
TDEN Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network AAAI 2021

Image Captioning on MSCOCO (Cross-Entropy Loss)

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
LSTM-A3 GoogleDrive 75.3 59.0 45.4 35.0 26.7 55.6 107.7 19.7
Attention GoogleDrive 76.4 60.6 46.9 36.1 27.6 56.6 113.0 20.4
Up-Down GoogleDrive 76.3 60.3 46.6 36.0 27.6 56.6 113.1 20.7
GCN-LSTM GoogleDrive 76.8 61.1 47.6 36.9 28.2 57.2 116.3 21.2
Transformer GoogleDrive 76.4 60.3 46.5 35.8 28.2 56.7 116.6 21.3
Meshed-Memory GoogleDrive 76.3 60.2 46.4 35.6 28.1 56.5 116.0 21.2
X-LAN GoogleDrive 77.5 61.9 48.3 37.5 28.6 57.6 120.7 21.9
TDEN GoogleDrive 75.5 59.4 45.7 34.9 28.7 56.7 116.3 22.0

Image Captioning on MSCOCO (CIDEr Score Optimization)

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
LSTM-A3 GoogleDrive 77.9 61.5 46.7 35.0 27.1 56.3 117.0 20.5
Attention GoogleDrive 79.4 63.5 48.9 37.1 27.9 57.6 123.1 21.3
Up-Down GoogleDrive 80.1 64.3 49.7 37.7 28.0 58.0 124.7 21.5
GCN-LSTM GoogleDrive 80.2 64.7 50.3 38.5 28.5 58.4 127.2 22.1
Transformer GoogleDrive 80.5 65.4 51.1 39.2 29.1 58.7 130.0 23.0
Meshed-Memory GoogleDrive 80.7 65.5 51.4 39.6 29.2 58.9 131.1 22.9
X-LAN GoogleDrive 80.4 65.2 51.0 39.2 29.4 59.0 131.0 23.2
TDEN GoogleDrive 81.3 66.3 52.0 40.1 29.6 59.8 132.6 23.4

Video Captioning on MSVD

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
MP-LSTM GoogleDrive 77.0 65.6 56.9 48.1 32.4 68.1 73.1 4.8
TA GoogleDrive 80.4 68.9 60.1 51.0 33.5 70.0 77.2 4.9
Transformer GoogleDrive 79.0 67.6 58.5 49.4 33.3 68.7 80.3 4.9
TDConvED GoogleDrive 81.6 70.4 61.3 51.7 34.1 70.4 77.8 5.0

Video Captioning on MSR-VTT

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
MP-LSTM GoogleDrive 73.6 60.8 49.0 38.6 26.0 58.3 41.1 5.6
TA GoogleDrive 74.3 61.8 50.3 39.9 26.4 59.4 42.9 5.8
Transformer GoogleDrive 75.4 62.3 50.0 39.2 26.5 58.7 44.0 5.9
TDConvED GoogleDrive 76.4 62.3 49.9 38.9 26.3 59.0 40.7 5.7

Visual Question Answering

Name Model Overall Yes/No Number Other
Uniter GoogleDrive 70.1 86.8 53.7 59.6
TDEN GoogleDrive 71.9 88.3 54.3 62.0

Caption-based image retrieval on Flickr30k

Name Model R1 R5 R10
Uniter GoogleDrive 61.6 87.7 92.8
TDEN GoogleDrive 62.0 86.6 92.4

Visual commonsense reasoning

Name Model Q -> A QA -> R Q -> AR
Uniter GoogleDrive 73.0 75.3 55.4
TDEN GoogleDrive 75.0 76.5 57.7

License

X-modaler is released under the Apache License, Version 2.0.

Citing X-modaler

If you use X-modaler in your research, please use the following BibTeX entry.

@inproceedings{Xmodaler2021,
  author =       {Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei},
  title =        {X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics},
  booktitle =    {Proceedings of the 29th ACM international conference on Multimedia},
  year =         {2021}
}
pytorch implementation of openpose including Hand and Body Pose Estimation.

pytorch-openpose pytorch implementation of openpose including Body and Hand Pose Estimation, and the pytorch model is directly converted from openpose

Hzzone 1.4k Jan 07, 2023
Code for paper [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot] (ICCV 2021, oral))

ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot This repository is the official PyTorch implementation of ICCV-21 pape

Jiarui 21 May 09, 2022
Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

41 Jan 06, 2023
[NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature"

IP-IRM [NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature". Codes will be relea

Wang Tan 67 Dec 24, 2022
Hyperbolic Hierarchical Clustering.

Hyperbolic Hierarchical Clustering (HypHC) This code is the official PyTorch implementation of the NeurIPS 2020 paper: From Trees to Continuous Embedd

HazyResearch 154 Dec 15, 2022
PiRapGenerator - Make anyone rap the digits of pi

PiRapGenerator Make anyone rap the digits of pi (sample files are of Ted Nivison

7 Oct 02, 2022
Monify: an Expense tracker Program implemented in a Graphical User Interface that allows users to keep track of their expenses

💳 MONIFY (EXPENSE TRACKER PRO) 💳 Description Monify is an Expense tracker Program implemented in a Graphical User Interface allows users to add inco

Moyosore Weke 1 Dec 14, 2021
Code, environments, and scripts for the paper: "How Private Is Your RL Policy? An Inverse RL Based Analysis Framework"

Privacy-Aware Inverse RL (PRIL) Analysis Framework Code, environments, and scripts for the paper: "How Private Is Your RL Policy? An Inverse RL Based

1 Dec 06, 2021
A PyTorch Implementation of Single Shot MultiBox Detector

SSD: Single Shot MultiBox Object Detector, in PyTorch A PyTorch implementation of Single Shot MultiBox Detector from the 2016 paper by Wei Liu, Dragom

Max deGroot 4.8k Jan 07, 2023
Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Jian Zhang 20 Oct 24, 2022
Implementation for "Manga Filling Style Conversion with Screentone Variational Autoencoder" (SIGGRAPH ASIA 2020 issue)

Manga Filling with ScreenVAE SIGGRAPH ASIA 2020 | Project Website | BibTex This repository is for ScreenVAE introduced in the following paper "Manga F

30 Dec 24, 2022
[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

336 Dec 25, 2022
FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics

FusionNet_Pytorch FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics Requirements Pytorch 0.1.11 Pyt

Choi Gunho 102 Dec 13, 2022
Finite-temperature variational Monte Carlo calculation of uniform electron gas using neural canonical transformation.

CoulombGas This code implements the neural canonical transformation approach to the thermodynamic properties of uniform electron gas. Building on JAX,

FermiFlow 9 Mar 03, 2022
Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Implicit Internal Video Inpainting Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation paper | project

202 Dec 30, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intel ISL (Intel Intelligent Systems Lab) 1.3k Dec 28, 2022
List of all dependencies affected by node-ipc malicious commit

node-ipc-dependencies-list List of all dependencies affected by node-ipc malicious commit as of 17/3/2022 - 19/3/2022 (timestamp) Please improve upon

99 Oct 15, 2022
Motion planning algorithms commonly used on autonomous vehicles. (path planning + path tracking)

Overview This repository implemented some common motion planners used on autonomous vehicles, including Hybrid A* Planner Frenet Optimal Trajectory Hi

Huiming Zhou 1k Jan 09, 2023
PyTorch implementation for paper Neural Marching Cubes.

NMC PyTorch implementation for paper Neural Marching Cubes, Zhiqin Chen, Hao Zhang. Paper | Supplementary Material (to be updated) Citation If you fin

Zhiqin Chen 109 Dec 27, 2022
The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

Introduction This repository includes the source code for "Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks", which is pu

machen 11 Nov 27, 2022