An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.

The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor classes easily to customize different input and output grid resolution.

To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9 CIDEr score in teacher-forcing setting and 129.6 in reinforcement learning setting when using ViT-B/32 CLIP model, which conflicts with the experiment results in CLIP-ViL paper[1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0 CIDEr score in CLIP-ViT-B_Transformer setting in COCO Captioning).

We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.

1. Supported CLIP Models

Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.

Visual Backbone CLIP Model Input Resolution Output Resolution Feature Map Downsample Grid Feature Shape Global Feature Shape
Standard RN101 RN101 224 x 224 7 x 7 None 49 x 2048 1 x 512
ViT-B/32 ViT-B/32 224 x 224 7 x 7 None 49 x 768 1 x 512
ViT-B/16 ViT-B/16 224 x 224 14 x 14 AvgPool2d(kernel_size=(2,2), stride=2) 49 x 768 1 x 512
Customized RN101_448 RN101 448 x 448 14 x 14 AvgPool2d(kernel_size=(2,2), stride=2) 49 x 2048 1 x 512
ViT-B/32_448 ViT-B/32 448 x 448 14 x 14 AvgPool2d(kernel_size=(2,2), stride=2) 49 x 768 1 x 512

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4 in ViT-B/16 and ViT-B/32_448 teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.

2.1 Teacher-forcing

Name [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
BUTD 76.4 60.3 46.5 35.8 28.2 56.7 116.6 21.3
RN101 77.3 61.3 47.7 36.9 28.7 57.5 120.6 21.8
ViT-B/32 76.4 60.3 46.5 35.6 28.1 56.7 116.9 21.2
ViT-B/16 78.0 62.1 48.2 37.2 28.8 57.6 122.3 22.1
RN101_448 78.1 62.3 48.4 37.5 29.0 58.0 122.9 22.2
ViT-B/32_448 75.8 59.6 45.9 35.1 27.8 56.3 114.2 21.0

2.2 Self-critical Reinforcement Learning

Name [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
BUTD 80.5 65.4 51.1 39.2 29.1 58.7 130.0 23.0
RN101 - - - - - - - -
ViT-B/32 79.9 64.6 50.4 38.5 29.0 58.6 129.6 22.8
ViT-B/16 82.0 67.3 53.1 41.1 29.9 59.8 136.6 23.8
RN101_448 81.7 66.9 52.6 40.5 29.9 59.7 136.1 23.9
ViT-B/32_448 - - - - - - - -

3. Get Started

Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.

3.1 Requirements

  • PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
  • timm ≥ 0.4.5

3.2 Examples

  1. Use CLIP ViT-B/32 model to extract global textual features of MSCOCO sentences from dataset_coco.json in Karpathy's released annotations.
CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
    --anno dataset_coco.json \
    --output_dir ${TXT_OUTPUT_DIR} \
    --model_type_or_path 'ViT-B/32'
  1. Use CLIP ViT-B/16 model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'ViT-B/16' \
    --model_type_or_path 'ViT-B/16'
  1. Use CLIP RN101 model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101' \
    --model_type_or_path 'RN101'
  1. Use CLIP RN101 model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101_448' \
    --model_type_or_path 'RN101'

3.3 Speeding up feature extraction with Multiple GPUs

You can run the same script with same input list (i.e. --image_list or --anno) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.

4. License

MIT

5. Acknowledgement

This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

6. References

[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.

[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.

Owner
Jianjie(JJ) Luo
SYSU & JDAIR Joint-PhD candidate.
Jianjie(JJ) Luo
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 03, 2023
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Checking spelling of form elements

Checking spelling of form elements. You can check the source files of external workflows/reports and configuration files

СКБ Контур (команда 1с) 15 Sep 12, 2022
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

Eric 20 Nov 18, 2022
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022