A curated list of efficient attention modules

Overview

awesome-fast-attention Awesome

A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)

Table of Contents

Efficient Attention

Paper (citations) Implementation Computational Complexity AutoRegressive Main Idea
Generating Wikipedia by Summarizing Long Sequences (282) memory-compressed-attention formula ✔️
EXPAND

compresses key and value + blocked attention

CBAM: Convolutional Block Attention Module (999+) attention-module formula
EXPAND

combines the SE attention with a per pixel(local) weight

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (16) set_transformer formula
EXPAND

uses K relay nodes

CCNet: Criss-Cross Attention for Semantic Segmentation (296) CCNet formula
EXPAND

each pixel attends to its row and column simultaneously

Efficient Attention: Attention with Linear Complexities (16) efficient-attention formula
EXPAND

Softmax(Q)*(Softmax(K^T)*V)

Star-Transformer (40) fastNLP formula
EXPAND

uses a relay(global) node and attends to/from that node

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (199) GCNet formula
EXPAND

squeeze and excitation with an attention pooling (instead of a GAP)

Generating Long Sequences with Sparse Transformers (257) DeepSpeed formula ✔️
EXPAND

sparse block based attention

SCRAM: Spatially Coherent Randomized Attention Maps (1) - formula ✔️
EXPAND

uses PatchMatch to find close keys

Interlaced Sparse Self-Attention for Semantic Segmentation (24) IN_PAPER formula ✔️
EXPAND

combination of a short length and then long range(dilated) attention

Permutohedral Attention Module for Efficient Non-Local Neural Networks (3) Permutohedral_attention_module formula
EXPAND

uses permutohedral lattice approximation algorithm to approximate the attention output

Large Memory Layers with Product Keys (43) XLM formula ✔️
EXPAND

search for nearest neighbor keys

Expectation-Maximization Attention Networks for Semantic Segmentation (79) EMANet formula
EXPAND

applys expectation maximization to cluster keys into k clusters

BP-Transformer: Modelling Long-Range Context via Binary Partitioning (15) BPT formula ✔️
EXPAND

attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner

Compressive Transformers for Long-Range Sequence Modelling (48) compressive-transformer-pytorch formula ✔️
EXPAND

compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL

Axial Attention in Multidimensional Transformers (36) axial-attention formula ✔️
EXPAND

apply attention on each axis separately

Reformer: The Efficient Transformer (216) trax formula ✔️
EXPAND

uses LSH to find close keys

Sparse Sinkhorn Attention (16) sinkhorn-transformer formula ✔️
EXPAND

uses a cost matrix to limit attention between buckets

Transformer on a Diet (2) transformer-on-diet formula ✔️
EXPAND

dilated transformer like wavenet

Time-aware Large Kernel Convolutions (9) TaLKConvolutions formula ✔️
EXPAND

calculate mean over a dynamic subsequence around each token with the help of summed-area table

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (2) - formula ✔️
EXPAND

learns the q, k connections == dynamically creates a sparse attention matrix

Efficient Content-Based Sparse Attention with Routing Transformers (38) routing-transformer formula ✔️
EXPAND

computes attention with same-cluster tokens (computed by online k-means)

Neural Architecture Search for Lightweight Non-Local Networks (11) AutoNL formula
EXPAND

computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions

Longformer: The Long-Document Transformer (159) longformer formula ✔️
EXPAND

global + blocked attention

ETC: Encoding Long and Structured Inputs in Transformers (16) - formula
EXPAND

combines global attention (star transformer with multiple global tokens) with local attention

Multi-scale Transformer Language Models (2) IN_PAPER formula ✔️
EXPAND

UNet like + retina attetion is something close to BP-Transformer

Synthesizer: Rethinking Self-Attention in Transformer Models (26) Synthesizer-Rethinking-Self-Attention-Transformer-Models formula ✔️
EXPAND

does not compute pairwise interactions

Jukebox: A Generative Model for Music (45) jukebox formula ✔️
EXPAND

better attention patterns from Sparse Transformer

Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers (0) - formula ✔️
EXPAND

does not compute pairwise interactions and uses fixed mask patters

GMAT: Global Memory Augmentation for Transformers (2) gmat formula
EXPAND

adds global tokens

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (45) fast-transformers formula ✔️
EXPAND

uses phi(q)(phi(k)v) and also improves the sequential sampling step

Linformer: Self-Attention with Linear Complexity (47) linformer-pytorch formula
EXPAND

project key and value from nd to kd

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (8) google-research formula ✔️
EXPAND

calculate an unbiased stochastic approximation of the attention matrix

Kronecker Attention Networks (1) kronecker-attention-pytorch formula
EXPAND

uses horizontal and lateral average matrices

Real-time Semantic Segmentation with Fast Attention (5) - formula
EXPAND

l2_norm(q)*(l2_norm(k)*v)

Fast Transformers with Clustered Attention (6) fast-transformers formula
EXPAND

groups queries together with LSH

Big Bird: Transformers for Longer Sequences (60) DeepSpeed formula
EXPAND

ETC with random connections

Tensor Low-Rank Reconstruction for Semantic Segmentation (3) - formula
EXPAND

decompose the full attention tensor into rank one tensors (CP decomposition)

Looking for change? Roll the Dice and demand Attention (0) IN_PAPER formula
EXPAND

uses the fractal tanimoto similarity to compare queries with keys inside the attention module

Rethinking Attention with Performers (30) google-research formula ✔️
EXPAND

unbiased approximation of the attention matrix with softmax kernel

Memformer: The Memory-Augmented Transformer (0) memformer formula ✔️
EXPAND

attend to memory slots + Memory-Replay BackPropagation

SMYRF: Efficient Attention using Asymmetric Clustering (1) smyrf formula
EXPAND

LSH with balanced clusters

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (0) Informer2020 formula ✔️
EXPAND

sparse attention + funnel like encoder

Sub-Linear Memory: How to Make Performers SLiM (0) google-research formula ✔️
EXPAND

Performer but with sublinear Memory usage

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (0) Nystromformer formula
EXPAND

uses Nystrom method to approximate the attention matrix

Linear Transformers Are Secretly Fast Weight Memory Systems (0) fast-weight-transformers formula ✔️
EXPAND

show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness

LambdaNetworks: Modeling Long-Range Interactions Without Attention (6) lambda-networks formula ✔️
EXPAND

generates a linear layer based on context + decouple pos/context

Random Feature Attention (2) - formula ✔️
EXPAND

kernel approximation and also transformers are rnn

Articles/Surveys/Benchmarks

Owner
Sepehr Sameni
PhD Candidate at the University of Bern, Computer Vision Group
Sepehr Sameni
Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

Adarsh Reddy 1 Dec 22, 2021
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
This program do translate english words to portuguese

Python-Dictionary This program is used to translate english words to portuguese. Web-Scraping This program use BeautifulSoap to make web scraping, so

João Assalim 1 Oct 10, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 04, 2023
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
TPlinker for NER 中文/英文命名实体识别

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

GodK 113 Dec 28, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022