Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

Overview

imgbeddings

A Python package to generate embedding vectors from images, using OpenAI's robust CLIP model via Hugging Face transformers. These image embeddings, derived from an image model that has seen the entire internet up to mid-2020, can be used for many things: unsupervised clustering (e.g. via umap), embeddings search (e.g. via faiss), and using downstream for other framework-agnostic ML/AI tasks such as building a classifier or calculating image similarity.

  • The embeddings generation models are ONNX INT8-quantized, meaning they're 20-30% faster on the CPU, much smaller on disk, and doesn't require PyTorch or TensorFlow as a dependency!
  • Works for many different image domains thanks to CLIP's zero-shot performance.
  • Includes utilities for using principal component analysis (PCA) to reduces the dimensionality of generated embeddings without losing much info.

Real-World Demo Notebooks

You can read how to use imgbeddings for real-world use cases in these Jupyter Notebooks:

Installation

aitextgen can be installed from PyPI:

pip3 install imgbeddings

Quick Example

Let's say you want to generate an image embedding for a cute cat photo. First you can download the photo:

import requests
from PIL import Image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

Then, you can load imgbeddings. By default, imgbeddings will load a 88MB model based on the patch32 variant of CLIP, which separates each image into 49 32x32 patches.

from imgbeddings import imgbeddings
ibed = imgbeddings()

You can also load the patch16 model by passing patch_size = 16 to imgbeddings() (more granular embeddings but takes about 3x longer to run), or the "large" patch14 model with patch_size = 14 (3.5x model size, 3x longer than patch16).

Then to generate embeddings, all you have to is pass the image to to_embeddings()!

embedding = ibed.to_embeddings(image)
embedding[0][0:5] # array([ 0.914541, 0.45988417, 0.0350069 , -0.9054574 , 0.08941309], dtype=float32)

This returns a 768D numpy vector for each input, which can be used for pretty much anything in the ML/AI world. You can also pass a list of filename and/or PIL Images for batch embeddings generation.

See the Demo Notebooks above for more advanced parameters and real-world use cases. More formal documentation will be added soon.

Ethics

The official paper for CLIP explicitly notes that there are inherent biases in the finished model, and that CLIP shouldn't be used in production applications as a result. My perspective is that having better tools free-and-open-source to detect such issues and make it more transparent is an overall good for the future of AI, especially since there are less-public ways to create image embeddings that aren't as accessible. At the least, this package doesn't do anything that wasn't already available when CLIP was open-sourced in January 2021.

If you do use imgbeddings for your own project, I recommend doing a strong QA pass along a diverse set of inputs for your application, which is something you should always be doing whenever you work with machine learning, biased models or not.

imgbeddings is not responsible for malicious misuse of image embeddings.

Design Notes

  • Note that CLIP was trained on square images only, and imgbeddings will pad and resize rectangular images into a square (imgbeddings deliberately does not center crop). As a result, images too wide/tall (e.g. more than a 3:1 ratio of largest dimension to smallest) will not generate robust embeddings.
  • This package only works with image data intentionally as opposed to leveraging CLIP's ability to link image and text. For downstream tasks, using your own text in conjunction with an image will likely give better results. (e.g. if training a model on an image embeddings + text embeddings, feed both and let the model determine the relative importance of each for your use case)

For more miscellaneous design notes, see DESIGN.md.

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon and GitHub Sponsors. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

See Also

License

MIT

You might also like...
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

 Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP
Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Simple image captioning model -  CLIP prefix captioning.
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

CLIPImageClassifier wraps clip image model from transformers

CLIPImageClassifier CLIPImageClassifier wraps clip image model from transformers. CLIPImageClassifier is initialized with the argument classes, these

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Implementation of
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Comments
  • multiple classes

    multiple classes

    Excuse me, I'm trying to use the work to clustering 4-classes datasets, while I following the instructions in "cat_dogs.ipynb", when using: umap.plot.points, raise a ValueError: "Plotting is currently only implemented for 2D embeddings", I pretty sure I follow the data structure as the repo given. Does it mean it just support binary classes? Thanks a lot~

    opened by CinKKKyo 3
  • Embeddings vary slightly when done in batches vs. single

    Embeddings vary slightly when done in batches vs. single

    import requests
    from PIL import Image
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    from imgbeddings import imgbeddings
    ibed = imgbeddings()
    
    embedding = ibed.to_embeddings(image)
    embedding[:, 0:5] 
    
    array([[ 0.914541  ,  0.45988417,  0.0350069 , -0.9054574 ,  0.08941309]],
          dtype=float32)
    
    embedding = ibed.to_embeddings([image]*4)
    embedding[:, 0:5] 
    
    array([[ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
           [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
           [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
           [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635]],
          dtype=float32)
    

    Probably a side effect of ONNX conversion as that's within tolerances. (or a case where intra op is breaking parallelism?)

    bug 
    opened by minimaxir 0
  • Allow imgbeddings to optionally split an image into parts for more robust embeddings

    Allow imgbeddings to optionally split an image into parts for more robust embeddings

    Let's say you want to split the image into quadrants (2 row x 2 col)

    • Run each image as a batch of 4 inputs, with each input representing a quadrant
    • Hstack/contatenate the outputs to create a 768 * 4 vector (3072D)
    • PCA to get it down to a reasonable size to avoid curse-of-dimensionality shenanigans

    This should work since CLIP was trained with center/random cropping so the model should be resilient to subsets.

    Since the outcome of a 2x2 would give a maximum robustness for 448x448 images, which is still low, it may be worth it to scale it up/allow arbitrary segments (e.g. 4x4 for 896x896 images, or rectangular inputs) if the image resolution of the input data is consistent (e.g. 1024x1024 for StyleGAN shenanigans).

    enhancement 
    opened by minimaxir 1
Owner
Max Woolf
Data Scientist @buzzfeed. Plotter of pretty charts.
Max Woolf
A Python library for working with arbitrary-dimension hypercomplex numbers following the Cayley-Dickson construction of algebras.

Hypercomplex A Python library for working with quaternions, octonions, sedenions, and beyond following the Cayley-Dickson construction of hypercomplex

7 Nov 04, 2022
A Python 3 package for state-of-the-art statistical dimension reduction methods

direpack: a Python 3 library for state-of-the-art statistical dimension reduction techniques This package delivers a scikit-learn compatible Python 3

Sven Serneels 32 Dec 14, 2022
The CLRS Algorithmic Reasoning Benchmark

Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms.

DeepMind 251 Jan 05, 2023
GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

ICLR Computational Geometry & Topology Challenge 2022 Welcome to the ICLR 2022 Computational Geometry & Topology challenge 2022 --- by the ICLR 2022 W

42 Dec 13, 2022
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

WarpDrive is a flexible, lightweight, and easy-to-use open-source reinforcement learning (RL) framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit).

Salesforce 334 Jan 06, 2023
Generalized Data Weighting via Class-level Gradient Manipulation

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

18 Nov 12, 2022
The code for "Deep Level Set for Box-supervised Instance Segmentation in Aerial Images".

Deep Levelset for Box-supervised Instance Segmentation in Aerial Images Wentong Li, Yijie Chen, Wenyu Liu, Jianke Zhu* Any questions or discussions ar

sunshine.lwt 112 Jan 05, 2023
Continual learning with sketched Jacobian approximations

Continual learning with sketched Jacobian approximations This repository contains the code for reproducing figures and results in the paper ``Provable

Machine Learning and Information Processing Laboratory 1 Jun 30, 2022
Accelerating BERT Inference for Sequence Labeling via Early-Exit

Sequence-Labeling-Early-Exit Code for ACL 2021 paper: Accelerating BERT Inference for Sequence Labeling via Early-Exit Requirement: Please refer to re

李孝男 23 Oct 14, 2022
Code to train models from "Paraphrastic Representations at Scale".

Paraphrastic Representations at Scale Code to train models from "Paraphrastic Representations at Scale". The code is written in Python 3.7 and require

John Wieting 71 Dec 19, 2022
基于pytorch构建cyclegan示例

cyclegan-demo 基于Pytorch构建CycleGAN示例 如何运行 准备数据集 将数据集整理成4个文件,分别命名为 trainA, trainB:训练集,A、B代表两类图片 testA, testB:测试集,A、B代表两类图片 例如 D:\CODE\CYCLEGAN-DEMO\DATA

Koorye 3 Oct 18, 2022
基于AlphaPose的TensorRT加速

1. Requirements CUDA 11.1 TensorRT 7.2.2 Python 3.8.5 Cython PyTorch 1.8.1 torchvision 0.9.1 numpy 1.17.4 (numpy版本过高会出报错 this issue ) python-package s

52 Dec 06, 2022
RL and distillation in CARLA using a factorized world model

World on Rails Learning to drive from a world on rails Dian Chen, Vladlen Koltun, Philipp Krähenbühl, arXiv techical report (arXiv 2105.00636) This re

Dian Chen 131 Dec 16, 2022
This code provides various models combining dilated convolutions with residual networks

Overview This code provides various models combining dilated convolutions with residual networks. Our models can achieve better performance with less

Fisher Yu 1.1k Dec 30, 2022
Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

TradingGym TradingGym is a toolkit for training and backtesting the reinforcement learning algorithms. This was inspired by OpenAI Gym and imitated th

Yvictor 1.1k Jan 02, 2023
DeepSpamReview: Detection of Fake Reviews on Online Review Platforms using Deep Learning Architectures. Summer Internship project at CoreView Systems.

Detection of Fake Reviews on Online Review Platforms using Deep Learning Architectures Dataset: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polar

Ashish Salunkhe 37 Dec 17, 2022
Download from Onlyfans.com.

OnlySave: Onlyfans downloader Getting Started: Download the setup executable from the latest release. Install and run. Only works on Windows currently

4 May 30, 2022
Implementation of ICCV2021(Oral) paper - VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation

VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation Created by Zeyu HU Introduction This work is based on our paper VMNet: Voxel-Mes

HU Zeyu 82 Dec 27, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

🐤 Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 09, 2023
Repository of Vision Transformer with Deformable Attention

Vision Transformer with Deformable Attention This repository contains the code for the paper Vision Transformer with Deformable Attention [arXiv]. Int

410 Jan 03, 2023