Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Last update: Dec 23, 2022

Overview

Memory Compressed Attention

Implementation of the Self-Attention layer of the proposed Memory-Compressed Attention, in Pytorch. This repository offers both the causal and non-causal variant, and will take care of the padding if the sequence length is not divisible by the compression ratio.

The code also resolves an edge-case where the very first query have no keys to attend to in the auto-regressive scenario. The solution is to use null key/values, appended to the final compressed set, so that there is always at least 1 key for all queries to attend to.

Install

$ pip install memory_compressed_attention

Usage

import torch
from memory_compressed_attention import MemoryCompressedAttention

attn = MemoryCompressedAttention(
    dim = 512,
    heads = 8,                 # number of heads
    causal = False,            # auto-regressive or not
    compression_factor = 3,    # compression ratio
    dropout = 0.1              # dropout post-attention
)

x = torch.randn(1, 1024, 512)
mask = torch.ones(1, 1024).bool()

attn(x, input_mask = mask) # (1, 1024, 512)

Citations

@misc{liu2018generating,
    title={Generating Wikipedia by Summarizing Long Sequences},
    author={Peter J. Liu and Mohammad Saleh and Etienne Pot and Ben Goodrich and Ryan Sepassi and Lukasz Kaiser and Noam Shazeer},
    year={2018},
    eprint={1801.10198},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

Memory Efficient Attention This is unofficial implementation of Self-attention Does Not Need O(n^2) Memory for Jax and PyTorch. Implementation is almo

126 Dec 27, 2022

Attention for PyTorch with Linear Memory Footprint

Attention for PyTorch with Linear Memory Footprint Unofficially implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention (+

11 Jan 9, 2022

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

71 Dec 30, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

109 Dec 28, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting.

GAN Memory for Lifelong learning This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting. Please consider citing our paper

43 Dec 27, 2022

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

23 Sep 21, 2022

Comments

The order of masking and softmax operation

Hi,

In memory_compressed_attention.py, I'm wondering if we need to do softmax operation after masking? Btw, if the entry in the mask should be float('-inf') instead of -float('-inf')? If I make something wrong, please correct me.

Thanks!

opened by cfeng16 3

mask error in attention

Very grateful for your pioneering work! I want to use it in Standard Transformer released in http://nlp.seas.harvard.edu/2018/04/03/attention.html. but it mat a mask error in training. more detail information shown as follow, the code i use: class ConvCompress(nn.Module): def init(self, dim, ratio = 2, groups = 1): super(ConvCompress, self).init() self.conv = nn.Conv1d(dim, dim, ratio, stride = ratio, groups = groups) #self.linear = nn.Linear(dim, dim)

def forward(self, mem):
    mem = mem.transpose(1, 2)
    compressed_mem = self.conv(mem)
    return compressed_mem.transpose(1, 2)

class MemoryCompressedAttention(nn.Module): def init(self, h, d_model, compression_factor = 2, dropout = 0.1): super(MemoryCompressedAttention, self).init() assert (d_model % h) == 0, 'dimension must be divisible by number of heads' self.h = h self.d_model = d_model self.d_k = d_model // h

    self.compression_factor = compression_factor
    self.compress_fn = ConvCompress(d_model, compression_factor, groups = h)

    #self.to_qkv = nn.Linear(dim, dim * 3, bias = False)
    self.wq = nn.Linear(d_model, d_model, bias = False)
    self.wk = nn.Linear(d_model, d_model, bias = False)
    self.wv = nn.Linear(d_model, d_model, bias = False)

    self.wo = nn.Linear(d_model, d_model)

    self.dropout = nn.Dropout(dropout)

    #self.null_k = nn.Parameter(torch.zeros(1, 1, d_model))
    #self.null_v = nn.Parameter(torch.zeros(1, 1, d_model))

def forward(self, query, key, value, mask = None):
    
    if mask is not None:
        # Same mask applied to all h heads.
        mask = mask.unsqueeze(1)
    nbatches = query.size(0)
    t = query.size(1)
    cf = self.compression_factor

    query = self.wq(query)
    key = self.wk(key)
    value = self.wv(value)

    # make sure keys and values sequence lengths
    # are divisible by the compression factor
    padding = cf - (t % cf)
    if padding != 0:
        key, value = map(lambda t: F.pad(t, (0, 0, padding, 0)), (key, value))


    # compress keys and values
    key, value = map(self.compress_fn, (key, value))

    # attach a null key and value, in the case that the first query has no keys to pay attention to
    null_k = nn.Parameter(torch.zeros(key.size(0), 1, self.d_model)).cuda()
    null_v = nn.Parameter(torch.zeros(value.size(0), 1, self.d_model)).cuda()

    key = torch.cat((null_k, key), dim=1)
    value = torch.cat((null_v, value), dim=1)
    
    # merge heads
    #query, key, value = map(lambda t: t.reshape(*t.shape[:2], h, -1).transpose(1, 2), (query, key, value))
    # 1) Do all the linear projections in batch from d_model => h x d_k
    query = query.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
    key = key.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
    value = value.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)

  
    # 2) Apply attention on all the projected vectors in batch.
    x, self.attn = attention(query, key, value, mask=mask,
                             dropout=self.dropout)

    # 3) "Concat" using a view and apply a final linear.   # split heads and combine
    x = x.contiguous().view(nbatches, -1, self.d_model)
    out = self.wo(x)

    return out

The error was show that

I want to know how to fix it, and how to do mask for N*M matrix??

opened by HN123-123 0

Releases(0.0.5)

0.0.5(Aug 19, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.4(Aug 18, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.3(Feb 10, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Jul 26, 2020)

Source code(tar.gz)
Source code(zip)
0.0.1(Jul 26, 2020)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector of the financial market.

GPlearn_finiance_stock_futures_extension This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector

[email protected]"> 189 Dec 25, 2022

Official implementation of "CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding" (CVPR, 2022)

CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding (CVPR'22) Paper Link | Project Page Abstract : Manual an

152 Dec 23, 2022

Toontown House CT Edition

Toontown House: Classic Toontown House Classic source that should just work. ❓ W

5 Jan 09, 2022

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

17.3k Dec 29, 2022

The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

Rule-based Representation Learner This is a PyTorch implementation of Rule-based Representation Learner (RRL) as described in NeurIPS 2021 paper: Scal

53 Dec 17, 2022

Python scripts to detect faces in Python with the BlazeFace Tensorflow Lite models

Python scripts to detect faces using Python with the BlazeFace Tensorflow Lite models. Tested on Windows 10, Tensorflow 2.4.0 (Python 3.8).

46 Nov 17, 2022

Car Parking Tracker Using OpenCv

Car Parking Vacancy Tracker Using OpenCv I used basic image processing methods i

30 Dec 03, 2022

ETMO: Evolutionary Transfer Multiobjective Optimization

ETMO: Evolutionary Transfer Multiobjective Optimization To promote the research on ETMO, benchmark problems are of great importance to ETMO algorithm

0 Mar 16, 2021

This is the code of "Multi-view Contrastive Graph Clustering" in NeurlPS 2021.

MCGC Description This is the code of "Multi-view Contrastive Graph Clustering" in NeurlPS 2021. Datasets Results ACM DBLP IMDB Amazon photos Amazon co

31 Nov 14, 2022

(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds by Mutian Xu*, Runyu Ding*, Hengshuang Zhao, and Xiaojuan Qi. Int

228 Dec 25, 2022

Pytorch and Keras Implementations of Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects.

The repository contains the implementations for Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects. Model

115 Jan 06, 2023

An example project demonstrating how the Autonomous Learning Library can be used to build new reinforcement learning agents.

About This repository shows how Autonomous Learning Library can be used to build new reinforcement learning agents. In particular, it contains a model

5 Aug 30, 2022

Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Related tags

Overview

Memory Compressed Attention

Install

Usage

Citations

You might also like...

Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

Attention for PyTorch with Linear Memory Footprint

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting.

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

Comments

The order of masking and softmax operation

mask error in attention

Releases(0.0.5)

0.0.5(Aug 19, 2022)

0.0.4(Aug 18, 2022)

0.0.3(Feb 10, 2021)

0.0.2(Jul 26, 2020)

0.0.1(Jul 26, 2020)

Owner

Phil Wang

This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector of the financial market.

Official implementation of "CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding" (CVPR, 2022)

Toontown House CT Edition

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

Python scripts to detect faces in Python with the BlazeFace Tensorflow Lite models

Car Parking Tracker Using OpenCv

ETMO: Evolutionary Transfer Multiobjective Optimization

This is the code of "Multi-view Contrastive Graph Clustering" in NeurlPS 2021.

(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

Pytorch and Keras Implementations of Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects.

An example project demonstrating how the Autonomous Learning Library can be used to build new reinforcement learning agents.

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

A simple python module to generate anchor (aka default/prior) boxes for object detection tasks.

Largest list of models for Core ML (for iOS 11+)

CTC segmentation python package

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

An interpreter for RASP as described in the ICML 2021 paper "Thinking Like Transformers"

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

The pure and clear PyTorch Distributed Training Framework.