Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Last update: Oct 04, 2022

Overview

Ponder(ing) Transformer

Implementation of a Transformer that learns to adapt the number of computational steps it takes depending on the difficulty of the input sequence, using the scheme from the PonderNet paper. Will also try to abstract out a pondering module that can be used with any block that returns an output with the halting probability.

This repository would not have been possible without repeated viewings of Yannic's educational video

Install

$ pip install ponder-transformer

Usage

import torch
from ponder_transformer import PonderTransformer

model = PonderTransformer(
    num_tokens = 20000,
    dim = 512,
    max_seq_len = 512
)

mask = torch.ones(1, 512).bool()

x = torch.randint(0, 20000, (1, 512))
y = torch.randint(0, 20000, (1, 512))

loss = model(x, labels = y, mask = mask)
loss.backward()

Now you can set the model to .eval() mode and it will terminate early when all samples of the batch have emitted a halting signal

import torch
from ponder_transformer import PonderTransformer

model = PonderTransformer(
    num_tokens = 20000,
    dim = 512,
    max_seq_len = 512,
    causal = True
)

x = torch.randint(0, 20000, (2, 512))
mask = torch.ones(2, 512).bool()

model.eval() # setting to eval makes it return the logits as well as the halting indices

logits, layer_indices = model(x,  mask = mask) # (2, 512, 20000), (2)

# layer indices will contain, for each batch element, which layer they exited

Citations

@misc{banino2021pondernet,
    title   = {PonderNet: Learning to Ponder}, 
    author  = {Andrea Banino and Jan Balaguer and Charles Blundell},
    year    = {2021},
    eprint  = {2107.05407},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

You might also like...

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

FLASH - Pytorch Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time Install $ pip install FLASH-pytorch

209 Dec 28, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

409 Jan 6, 2023

3D-Transformer: Molecular Representation with Transformer in 3D Space

55 Dec 19, 2022

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

1 Jan 6, 2022

Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

3 May 19, 2022

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Codebase for training transformers on systematic generalization datasets. The official repository for our EMNLP 2021 paper The Devil is in the Detail:

57 Nov 21, 2022

Comments

Evaluating ponder-net on more pondering-steps than trained on.

As the paper says,

In evaluation, and under known temporal or computational limitations, N can be set naively as a constant (or not set any limit, i.e. N → ∞). For training, we found that a more effective (and interpretable) way of parameterizing N is by defining a minimum cumulative probability of halting. N is then the smallest value of n such that sum( p_sub_ j > 1 − ε)over(j=1, n) , with the hyper-parameter ε positive near 0 (in our experiments 0.05).

from that I infer that pondering can be done to more steps than trained on. How can be done so with this implementation?

edit: I was going through the paper again,and I think what the paper means is that the max_num_pondering_steps:N should be re evaluated at every training-step, the model should be run till the condition is met or a pre-defined num of max steps is reached, and where the cumsum_probs condition will be met will be set as 'N', with the cumsum_probs normalised with one of the methods. Then that value of 'N' will be used to calc prior geom for the kl_div (and not normalising the prior geom term).

i.e. if the num of pondering steps are initially set to 'M', then the model will recur for 'k' steps - i.e. till the condition is met or for 'M' num of max steps; then 'N' will be calculated by first calculating the probabilities - p_0 to p_k - then normalizing through one of the methods, then calculate cumulative-sum of those probabilities, and checking where the sum is greater than threshold, and assigning it the value 'N'. After that, calculating prior geometric values with the defined hyper-parameter, for 'N' seq-len, and using this in the kl-div term against the halting probs truncated to 'N' steps.

λp is a hyper-parameter that defines a geometric prior distribution pG(λp) on the halting policy (truncated at N)

opened by Vbansal21 0
Can pondernet used for imagenet?

I plan to do a project on the complexity of tasks on image dataset like imagenet, cifar 100. If I use a vision transformer, then can I implement my project?

opened by fryegg 2

Releases(0.0.8)

0.0.8(Oct 30, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Aug 30, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Related tags

Overview

Ponder(ing) Transformer

Install

Usage

Citations

You might also like...

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

3D-Transformer: Molecular Representation with Transformer in 3D Space

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Transformer - Transformer in PyTorch

Transformer Huffman coding - Complete Huffman coding through transformer

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Comments

Evaluating ponder-net on more pondering-steps than trained on.

Can pondernet used for imagenet?

Releases(0.0.8)

0.0.8(Oct 30, 2021)

0.0.7(Aug 30, 2021)

0.0.5(Aug 26, 2021)

0.0.4(Aug 26, 2021)

0.0.3(Aug 26, 2021)

0.0.2(Aug 26, 2021)

0.0.1(Aug 26, 2021)

Owner

Phil Wang

Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

Layer 7 DDoS Panel with Cloudflare Bypass ( UAM, CAPTCHA, BFM, etc.. )

PyTorch wrappers for using your model in audacity!

An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

PyTorch implementations of the paper: "DR.VIC: Decomposition and Reasoning for Video Individual Counting, CVPR, 2022"

Machine Learning Platform for Kubernetes

Process JSON files for neural recording sessions using Medtronic's BrainSense Percept PC neurostimulator

Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness

This is a computer vision based implementation of the popular childhood game 'Hand Cricket/Odd or Even' in python

A Comparative Review of Recent Kinect-Based Action Recognition Algorithms (TIP2020, Matlab codes)

This repo tries to recognize faces in the dataset you created

[ICCV 2021] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts (ICLR 2022)

Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Final project code: Implementing BicycleGAN, for CIS680 FA21 at University of Pennsylvania

Faster Convex Lipschitz Regression

Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

MADE (Masked Autoencoder Density Estimation) implementation in PyTorch

Code for our ALiBi method for transformer language models.