Vision transformers (ViTs) have found only limited practical use in processing images

Last update: Sep 10, 2022

Related tags

Overview

CXV

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. CXV outperforms other architectures, token mixers (eg ConvMixer, FNet and MLP Mixer), transformer models (eg ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources.

Models:

CNV - Convolutional Nyströmformer for Vision
CPV - Convolutional Performer for Vision
CLTV - Convolutional Linear Transformer for Vision

Vision transformers (ViTs) have found only limited practical use in processing images

Related tags

Overview

CXV

Convolutional Xformers for Vision

Owner

Cloudwalker

Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

A Blender python script for getting asset browser custom preview images for objects and collections.

Custom TensorFlow2 implementations of forward and backward computation of soft-DTW algorithm in batch mode.

PyTorch CZSL framework containing GQA, the open-world setting, and the CGE and CompCos methods.

Multi-task yolov5 with detection and segmentation based on yolov5

MIMIC Code Repository: Code shared by the research community for the MIMIC-III database

A solution to the 2D Ising model of ferromagnetism, implemented using the Metropolis algorithm

v objective diffusion inference code for JAX.

deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and different optimization choices

RoMa: A lightweight library to deal with 3D rotations in PyTorch.

Classification of ecg datas for disease detection

Optimal space decomposition based-product quantization for approximate nearest neighbor search

PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

MiraiML: asynchronous, autonomous and continuous Machine Learning in Python

OCR-D wrapper for detectron2 based segmentation models

Adversarial vulnerability of powerful near out-of-distribution detection

Collection of Docker images for ML/DL and video processing projects

A collection of awesome resources image-to-image translation.

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Jittor Medical Segmentation Lib -- The assignment of Pattern Recognition course (2021 Spring) in Tsinghua University