Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Overview

Learning the Best Pooling Strategy for Visual Semantic Embedding

License: MIT

Official PyTorch implementation of the paper Learning the Best Pooling Strategy for Visual Semantic Embedding (CVPR 2021 Oral).

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@inproceedings{chen2021vseinfty,
     title={Learning the Best Pooling Strategy for Visual Semantic Embedding},
     author={Chen, Jiacheng and Hu, Hexiang and Wu, Hao and Jiang, Yuning and Wang, Changhu},
     booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
     year={2021}
} 

We referred to the implementations of VSE++ and SCAN to build up our codebase.

Introduction

Illustration of the standard Visual Semantic Embedding (VSE) framework with the proposed pooling-based aggregator, i.e., Generalized Pooling Operator (GPO). It is simple and effective, which automatically adapts to the appropriate pooling strategy given different data modality and feature extractor, and improves VSE models at negligible extra computation cost.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone, please check out to the bigru branch for the code and pre-trained models for using BiGRU as the text backbone.

Note that the VSE++ entries in the following tables are the VSE++ model with the specified feature backbones, thus the results are different from the original VSE++ paper.

Results of 5-fold evaluation on COCO 1K Test Split

Visual Backbone Text Backbone R1 R5 R1 R5 Link
VSE++ BUTD region BERT-base 67.9 91.9 54.0 85.6 -
VSEInfty BUTD region BERT-base 79.7 96.4 64.8 91.4 Here
VSEInfty BUTD grid BERT-base 80.4 96.8 66.4 92.1 Here
VSEInfty WSL grid BERT-base 84.5 98.1 72.0 93.9 Here

Results on Flickr30K Test Split

Visual Backbone Text Backbone R1 R5 R1 R5 Link
VSE++ BUTD region BERT-base 63.4 87.2 45.6 76.4 -
VSEInfty BUTD region BERT-base 81.7 95.4 61.4 85.9 Here
VSEInfty BUTD grid BERT-base 81.5 97.1 63.7 88.3 Here
VSEInfty WSL grid BERT-base 88.4 98.3 74.2 93.7 Here

Result (in [email protected]) on Crisscrossed Caption benchmark (trained on COCO)

Visual Backbone Text Backbone I2T T2I T2T I2I
VSRN BUTD region BiGRU 52.4 40.1 41.0 44.2
DE EfficientNet-B4 grid BERT-base 55.9 41.7 42.6 38.5
VSEInfty BUTD grid BERT-base 60.6 46.2 45.9 44.4
VSEInfty WSL grid BERT-base 67.9 53.6 46.7 51.3

Preparation

Environment

We trained and evaluated our models with the following key dependencies:

  • Python 3.7.3

  • Pytorch 1.2.0

  • Transformers 2.1.0

Run pip install -r requirements.txt to install the exactly same dependencies as our experiments. However, we also verified that using the latest Pytorch 1.8.0 and Transformers 4.4.2 can also produce similar results.

Data

We organize all data used in the experiments in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── images   # raw coco images
│   │      ├── train2014
│   │      └── val2014
│   │
│   ├── cxc_annots # annotations for evaluating COCO-trained models on the CxC benchmark
│   │
│   └── id_mapping.json  # mapping from coco-id to image's file name
│   
│
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── flickr30k-images   # raw coco images
│   │      ├── xxx.jpg
│   │      └── ...
│   └── id_mapping.json  # mapping from f30k index to image's file name
│   
├── weights
│      └── original_updown_backbone.pth # the BUTD CNN weights
│
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the offical repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images.

The id_mapping.json files are the mapping from image index (ie, the COCO id for COCO images) to corresponding filenames, we generated these mappings to eliminate the need of the pycocotools package.

weights/original_updowmn_backbone.pth is the pre-trained ResNet-101 weights from Bottom-up Attention Model, we converted the original Caffe weights into Pytorch. Please download it from this link.

The data/coco/cxc_annots directory contains the necessary data files for running the Criscrossed Caption (CxC) evaluation. Since there is no official evaluation protocol in the CxC repo, we processed their raw data files and generated these data files to implement our own evaluation. We have verified our implementation by aligning the evaluation results of the official VSRN model with the ones reported by the CxC paper Please download the data files at this link.

Please download all necessary data files and organize them in the above manner, the path to the data directory will be the argument to the training script as shown below.

Training

Assuming the data root is /tmp/data, we provide example training scripts for:

  1. Grid feature with BUTD CNN for the image feature, BERT-base for the text feature. See train_grid.sh

  2. BUTD Region feature for the image feature, BERT-base for the text feature. See train_region.sh

To use other CNN initializations for the grid image feature, change the --backbone_source argument to different values:

  • (1). the default detector is to use the BUTD ResNet-101, we have adapted the original Caffe weights into Pytorch and provided the download link above;
  • (2). wsl is to use the backbones from large-scale weakly supervised learning;
  • (3). imagenet_res152 is to use the ResNet-152 pre-trained on ImageNet.

Evaluation

Run eval.py to evaluate specified models on either COCO and Flickr30K. For evaluting pre-trained models on COCO, use the following command (assuming there are 4 GPUs, and the local data path is /tmp/data):

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco

For evaluting pre-trained models on Flickr-30K, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset f30k --data_path /tmp/data/f30k

For evaluating pre-trained COCO models on the CxC dataset, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco --evaluate_cxc

For evaluating two-model ensemble, first run single-model evaluation commands above with the argument --save_results, and then use eval_ensemble.py to get the results (need to manually specify the paths to the saved results).

Owner
Jiacheng Chen
Jiacheng Chen
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators [Project Website] [Replicate.ai Project] StyleGAN-NADA: CLIP-Guided Domain Adaptation

992 Dec 30, 2022
LaneDetectionAndLaneKeeping - Lane Detection And Lane Keeping

LaneDetectionAndLaneKeeping This project is part of my bachelor's thesis. The go

5 Jun 27, 2022
Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

Code for Neural Reflectance Surfaces (NeRS) [arXiv] [Project Page] [Colab Demo] [Bibtex] This repo contains the code for NeRS: Neural Reflectance Surf

Jason Y. Zhang 234 Dec 30, 2022
git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Beta R-CNN: Looking into Pedestrian Detection from Another Perspective This is the pytorch implementation of our paper "[Beta R-CNN: Looking into Pede

35 Sep 08, 2021
PyTorch implementation of saliency map-aided GAN for Auto-demosaic+denosing

Saiency Map-aided GAN for RAW2RGB Mapping The PyTorch implementations and guideline for Saiency Map-aided GAN for RAW2RGB Mapping. 1 Implementations B

Yuzhi ZHAO 20 Oct 24, 2022
Official PyTorch implementation of "Edge Rewiring Goes Neural: Boosting Network Resilience via Policy Gradient".

Edge Rewiring Goes Neural: Boosting Network Resilience via Policy Gradient This repository is the official PyTorch implementation of "Edge Rewiring Go

Shanchao Yang 4 Dec 12, 2022
Reproduces ResNet-V3 with pytorch

ResNeXt.pytorch Reproduces ResNet-V3 (Aggregated Residual Transformations for Deep Neural Networks) with pytorch. Tried on pytorch 1.6 Trains on Cifar

Pau Rodriguez 481 Dec 23, 2022
level1-image-classification-level1-recsys-09 created by GitHub Classroom

level1-image-classification-level1-recsys-09 ❗ 주제 설명 COVID-19 Pandemic 상황 속 마스크 착용 유무 판단 시스템 구축 마스크 착용 여부, 성별, 나이 총 세가지 기준에 따라 총 18개의 class로 구분하는 모델 ?

6 Mar 17, 2022
Implementation of Hierarchical Transformer Memory (HTM) for Pytorch

Hierarchical Transformer Memory (HTM) - Pytorch Implementation of Hierarchical Transformer Memory (HTM) for Pytorch. This Deepmind paper proposes a si

Phil Wang 63 Dec 29, 2022
SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.

SOLO: Segmenting Objects by Locations This project hosts the code for implementing the SOLO algorithms for instance segmentation. SOLO: Segmenting Obj

Xinlong Wang 1.5k Dec 31, 2022
An unofficial styleguide and best practices summary for PyTorch

A PyTorch Tools, best practices & Styleguide This is not an official style guide for PyTorch. This document summarizes best practices from more than a

IgorSusmelj 1.5k Jan 05, 2023
Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

BRIMs Bidirectional Recurrent Independent Mechanisms Implementation of the paper Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neura

Sarthak Mittal 26 May 26, 2022
Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"

Symbolic Learning to Optimize This is the official implementation for ICLR-2022 paper "Symbolic Learning to Optimize: Towards Interpretability and Sca

VITA 8 Dec 19, 2022
This repository contains the source code of Auto-Lambda and baselines from the paper, Auto-Lambda: Disentangling Dynamic Task Relationships.

Auto-Lambda This repository contains the source code of Auto-Lambda and baselines from the paper, Auto-Lambda: Disentangling Dynamic Task Relationship

Shikun Liu 76 Dec 20, 2022
Educational API for 3D Vision using pose to control carton.

Educational API for 3D Vision using pose to control carton.

41 Jul 10, 2022
Source Code For Template-Based Named Entity Recognition Using BART

Template-Based NER Source Code For Template-Based Named Entity Recognition Using BART Training Training train.py Inference inference.py Corpus ATIS (h

174 Dec 19, 2022
IDA file loader for UF2, created for the DEFCON 29 hardware badge

UF2 Loader for IDA The DEFCON 29 badge uses the UF2 bootloader, which conveniently allows you to dump and flash the firmware over USB as a mass storag

Kevin Colley 6 Feb 08, 2022
A Small and Easy approach to the BraTS2020 dataset (2D Segmentation)

BraTS2020 A Light & Scalable Solution to BraTS2020 | Medical Brain Tumor Segmentation (2D Segmentation) Developed the segmentation models for segregat

Gunjan Haldar 0 Jan 19, 2022
Multi Agent Reinforcement Learning for ROS in 2D Simulation Environments

IROS21 information To test the code and reproduce the experiments, follow the installation steps in Installation.md. Afterwards, follow the steps in E

11 Oct 29, 2022
A blender add-on that automatically re-aligns wrong axis objects.

Auto Align A blender add-on that automatically re-aligns wrong axis objects. Usage There are three options available in the 3D Viewport Sidebar It

29 Nov 25, 2022