Official implementation of "Refiner: Refining Self-attention for Vision Transformers".

Last update: Dec 29, 2022

Related tags

Overview

RefinerViT

This repo is the official implementation of "Refiner: Refining Self-attention for Vision Transformers". The repo is build on top of timm and include the relabbeling trick included in TokenLabelling.

Introduction

Refined Vision Transformer is initially described in arxiv, which observes vision transformers require much more datafor model pre-training. Most of recent works thus are dedicated to designing morecomplex architectures or training methods to address the data-efficiency issue ofViTs. However, few of them explore improving the self-attention mechanism, akey factor distinguishing ViTs from CNNs. Different from existing works, weintroduce a conceptually simple scheme, calledrefiner, to directly refine the self-attention maps of ViTs. Specifically, refiner exploresattention expansionthatprojects the multi-head attention maps to a higher-dimensional space to promotetheir diversity. Further, refiner applies convolutions to augment local patternsof the attention maps, which we show is equivalent to adistributed local atten-tion—features are aggregated locally with learnable kernels and then globallyaggregated with self-attention. Extensive experiments demonstrate that refinerworks surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classifi-cation accuracy on ImageNet with only 81M parameters.

Please run git clone with --recursive to clone timm as submodule and install it with cd pytorch-image-models && pip install -e ./

Requirements

torch>=1.4.0 torchvision>=0.5.0 pyyaml numpy timm==0.4.5

A summary of the results are shown below for quick reference. Details can be found in the paper.

Model	head	layer	dim	Image resolution	Param	Top 1
Refiner-ViT-S	12	16	384	224	25M	83.6
Refiner-ViT-S	12	16	384	384	25M	84.6
Refiner-ViT-M	12	32	420	224	55M	84.6
Refiner-ViT-M	12	32	420	384	55M	85.6
Refiner-ViT-L	16	32	512	224	81M	84.9
Refiner-ViT-L	16	32	512	384	81M	85.8
Refiner-ViT-L	16	32	512	448	81M	86.0

Training

Train the Refiner-ViT-S from scratch:

bash run.sh scripts/refiner_s.yaml

To use the re-labbeling tricks for improving the accuracy, download the relabel_data based on NFNet. This is provided in TokenLabelling repo. Then, copy the relabbeling data to the data folder.

Official implementation of "Refiner: Refining Self-attention for Vision Transformers".

Related tags

Overview

RefinerViT

Introduction

Requirements

Training

Owner

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Sequence modeling benchmarks and temporal convolutional networks

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

[CVPR 2021] Involution: Inverting the Inherence of Convolution for Visual Recognition, a brand new neural operator

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Official PyTorch implementation for paper "Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer"

Model Zoo of BDD100K Dataset

This repository provides an unified frameworks to train and test the state-of-the-art few-shot font generation (FFG) models.

Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO.

This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Deep Dual Consecutive Network for Human Pose Estimation (CVPR2021)

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis Implementation

Grammar Induction using a Template Tree Approach

AbelNN: Deep Learning Python module from scratch

object detection; robust detection; ACM MM21 grand challenge; Security AI Challenger Phase VII

Deep Learning pipeline for motor-imagery classification.

Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Activity tragle - Google is tracking everything, we just look at it