An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Last update: Jan 06, 2023

Related tags

Text Data & NLP openai-clip

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.

The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor classes easily to customize different input and output grid resolution.

To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9 CIDEr score in teacher-forcing setting and 129.6 in reinforcement learning setting when using ViT-B/32 CLIP model, which conflicts with the experiment results in CLIP-ViL paper [1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0 CIDEr score in CLIP-ViT-B_Transformer setting in COCO Captioning).

We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.

1. Supported CLIP Models

Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.

	Visual Backbone	CLIP Model	Input Resolution	Output Resolution	Feature Map Downsample	Grid Feature Shape	Global Feature Shape
Standard	RN101	RN101	224 x 224	7 x 7	None	49 x 2048	1 x 512
	ViT-B/32	ViT-B/32	224 x 224	7 x 7	None	49 x 768	1 x 512
	ViT-B/16	ViT-B/16	224 x 224	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 768	1 x 512
Customized	RN101_448	RN101	448 x 448	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 2048	1 x 512
Customized	ViT-B/32_448	ViT-B/32	448 x 448	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 768	1 x 512

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4 in ViT-B/16 and ViT-B/32_448 teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.

2.1 Teacher-forcing

Name	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr-D	SPICE
BUTD	76.4	60.3	46.5	35.8	28.2	56.7	116.6	21.3
RN101	77.3	61.3	47.7	36.9	28.7	57.5	120.6	21.8
ViT-B/32	76.4	60.3	46.5	35.6	28.1	56.7	116.9	21.2
ViT-B/16	78.0	62.1	48.2	37.2	28.8	57.6	122.3	22.1
RN101_448	78.1	62.3	48.4	37.5	29.0	58.0	122.9	22.2
ViT-B/32_448	75.8	59.6	45.9	35.1	27.8	56.3	114.2	21.0

2.2 Self-critical Reinforcement Learning

Name	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr-D	SPICE
BUTD	80.5	65.4	51.1	39.2	29.1	58.7	130.0	23.0
RN101	-	-	-	-	-	-	-	-
ViT-B/32	79.9	64.6	50.4	38.5	29.0	58.6	129.6	22.8
ViT-B/16	82.0	67.3	53.1	41.1	29.9	59.8	136.6	23.8
RN101_448	81.7	66.9	52.6	40.5	29.9	59.7	136.1	23.9
ViT-B/32_448	-	-	-	-	-	-	-	-

3. Get Started

Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.

3.1 Requirements

PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
timm ≥ 0.4.5

3.2 Examples

Use CLIP ViT-B/32 model to extract global textual features of MSCOCO sentences from dataset_coco.json in Karpathy's released annotations.

CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
    --anno dataset_coco.json \
    --output_dir ${TXT_OUTPUT_DIR} \
    --model_type_or_path 'ViT-B/32'

Use CLIP ViT-B/16 model to extract global and grid visual features of MSCOCO images.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'ViT-B/16' \
    --model_type_or_path 'ViT-B/16'

Use CLIP RN101 model to extract global and grid visual features of MSCOCO images.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101' \
    --model_type_or_path 'RN101'

Use CLIP RN101 model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101_448' \
    --model_type_or_path 'RN101'

3.3 Speeding up feature extraction with Multiple GPUs

You can run the same script with same input list (i.e. --image_list or --anno) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.

4. License

MIT

5. Acknowledgement

This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

6. References

[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.

[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Related tags

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

1. Supported CLIP Models

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

2.1 Teacher-forcing

2.2 Self-critical Reinforcement Learning

3. Get Started

3.1 Requirements

3.2 Examples

3.3 Speeding up feature extraction with Multiple GPUs

4. License

5. Acknowledgement

6. References

Owner

Jianjie(JJ) Luo

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Checking spelling of form elements

Simple NLP based project without any use of AI

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Quick insights from Zoom meeting transcripts using Graph + NLP

Neural network sequence labeling model

a chinese segment base on crf

Open solution to the Toxic Comment Classification Challenge

Code Generation using a large neural network called GPT-J

Snips Python library to extract meaning from text

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Exploring dimension-reduced embeddings

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

Rootski - Full codebase for rootski.io (without the data)

A website which allows you to play with the GPT-2 transformer

An assignment on creating a minimalist neural network toolkit for CS11-747

Text to speech converter with GUI made in Python.

A script that automatically creates a branch name using google translation api and jira api

Translation to python of Chris Sims' optimization function