Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

Related tags

Deep LearningTR-BERT
Overview

TR-BERT

Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference".

model

The code is based on huggaface's transformers. Thanks to them! We will release all the source code in the future.

Requirement

Install dependencies and apex:

pip3 install -r requirement.txt
pip3 install --editable transformers

Pretrained models

Download the DistilBERT-3layer and BERT-1024 from Google Drive/Tsinghua Cloud.

Classfication

Download the IMDB, Yelp, 20News datasets from Google Drive/Tsinghua Cloud.

Download the Hyperpartisan dataset, and randomly split it into train/dev/test set: python3 split_hyperpartisan.py

Train BERT/DistilBERT Model

Use flag --do train:

python3 run_classification.py  --task_name imdb  --model_type bert  --model_name_or_path bert-base-uncased --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 16 --gradient_accumulation_steps 4 --learning_rate 3e-5 --save_steps 2000  --num_train_epochs 5  --output_dir imdb_models/bert_base  --do_lower_case  --do_eval  --evaluate_during_training  --do_train

where task_name can be set as imdb/yelp_f/20news/hyperpartisan for different tasks and model type can be set as bert/distilbert for different models.

Compute Graident for Residual Strategy

Use flag --do_eval_grad.

python3 run_classification.py  --task_name imdb  --model_type bert  --model_name_or_path imdb_models/bert_base --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 8  --output_dir imdb_models/bert_base  --do_lower_case  --do_eval_grad

This step doesn't supoort data DataParallel or DistributedDataParallel currently and should be done in a single GPU.

Train the policy network solely

Start from the checkpoint from the task-specific fine-tuned model. Change model_type from bert to autobert, and run with flag --do_train --train_rl:

python3 run_classification.py  --task_name imdb  --model_type autobert  --model_name_or_path imdb_models/bert_base --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 8 --gradient_accumulation_steps 4 --learning_rate 3e-5 --save_steps 2000  --num_train_epochs 3  --output_dir imdb_models/auto_1  --do_lower_case  --do_train --train_rl --alpha 1 --guide_rate 0.5

where alpha is the harmonic coefficient for the length punishment and guide_rate is the proportion of imitation learning steps. model_type can be set as autobert/distilautobert for applying token reduction to BERT/DistilBERT.

Compute Logits for Knowledge Distilation

Use flag --do_eval_logits.

python3 run_classification.py  --task_name imdb  --model_type bert  --model_name_or_path imdb_models/bert_base --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 8  --output_dir imdb_models/bert_base  --do_lower_case  --do_eval_logits

This step doesn't supoort data DataParallel or DistributedDataParallel currently and should be done in a single GPU.

Train the whole network with both the task-specifc objective and RL objective

Start from the checkpoint from --train_rl model and run with flag --do_train --train_both --train_teacher:

python3 run_classification.py  --task_name imdb  --model_type autobert  --model_name_or_path imdb_models/auto_1 --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 3e-5 --save_steps 2000  --num_train_epochs 3  --output_dir imdb_models/auto_1_both  --do_lower_case  --do_train --train_both --train_teacher --alpha 1

Evaluate

Use flag --do_eval:

python3 run_classification.py  --task_name imdb  --model_type autobert  --model_name_or_path imdb_models/auto_1_both  --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 1  --output_dir imdb_models/auto_1_both  --do_lower_case  --do_eval --eval_all_checkpoints

When the batch size is more than 1 in evaluating, we will remain the same number of tokens for each instance in the same batch.

Initialize

For IMDB dataset, we find that when we directly initialize the selector with heuristic objective before train the policy network solely, we can get a bit better performance. For other datasets, this step makes little change. Run this step with flag --do_train --train_init:

python3 trans_imdb_rank.py
python3 run_classification.py  --task_name imdb  --model_type initbert  --model_name_or_path imdb_models/bert_base --data_dir imdb --max_seq_length 512  --per_gpu_train_batch_size 8  --per_gpu_eval_batch_size 8 --gradient_accumulation_steps 4 --learning_rate 3e-5 --save_steps 2000  --num_train_epochs 3  --output_dir imdb_models/bert_init  --do_lower_case  --do_train --train_init 

Question Answering

Download the SQuAD 2.0 dataset.

Download the MRQA dataset with our split] from Google Drive/Tsinghua Cloud.

Download the HotpotQA dataset from the Transformer-XH repository, where paragraphs are retrieved for each question according to TF-IDF, entity linking and hyperlink and re-ranked by BERT re-ranker.

Download the TriviaQA dataset, where paragraphs are re-rank by the linear passage re-ranker in DocQA.

Download the WikiHop dataset.

The whole training progress of question answer models is similiar to text classfication models, with flags --do_train, --do_train --train_rl, --do_train --train_both --train_teacher in turn. The codes of each dataset:

SQuAD: run_squad.py with flag version_2_with_negative

NewsQA / NaturalQA: run_mrqa.py

RACE: run_race_classify.py

HotpotQA: run_hotpotqa.py

TriviaQA: run_triviaqa.py

WikiHop: run_wikihop.py

Harmonic Coefficient Lambda

The example harmonic coefficients are shown as follows:

Dataset train_rl train_both
SQuAD 2.0 5 5
NewsQA 3 5
NaturalQA 2 2
RACE 0.5 0.1
YELP.F 2 0.5
20News 1 1
IMDB 1 1
HotpotQA 0.1 4
TriviaQA 0.5 1
Hyperparisan 0.01 0.01

Cite

If you use the code, please cite this paper:

@inproceedings{ye2021trbert,
  title={TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference},
  author={Deming Ye, Yankai Lin, Yufei Huang, Maosong Sun},
  booktitle={Proceedings of NAACL 2021},
  year={2021}
}
Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Fang Zhonghao 13 Nov 19, 2022
This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels].

CGPN This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels]. Req

10 Sep 12, 2022
GPU Accelerated Non-rigid ICP for surface registration

GPU Accelerated Non-rigid ICP for surface registration Introduction Preivous Non-rigid ICP algorithm is usually implemented on CPU, and needs to solve

Haozhe Wu 144 Jan 04, 2023
Intrusion Detection System using ensemble learning (machine learning)

IDS-ML implementation of an intrusion detection system using ensemble machine learning methods Data set This project is carried out using the UNSW-15

4 Nov 25, 2022
Code implementation for the paper 'Conditional Gaussian PAC-Bayes'.

CondGauss This repository contains PyTorch code for the paper Stochastic Gaussian PAC-Bayes. A novel PAC-Bayesian training method is implemented. Ther

0 Nov 01, 2021
Data Preparation, Processing, and Visualization for MoVi Data

MoVi-Toolbox Data Preparation, Processing, and Visualization for MoVi Data, https://www.biomotionlab.ca/movi/ MoVi is a large multipurpose dataset of

Saeed Ghorbani 51 Nov 27, 2022
A wrapper around SageMaker ML Lineage Tracking extending ML Lineage to end-to-end ML lifecycles, including additional capabilities around Feature Store groups, queries, and other relevant artifacts.

ML Lineage Helper This library is a wrapper around the SageMaker SDK to support ease of lineage tracking across the ML lifecycle. Lineage artifacts in

AWS Samples 12 Nov 01, 2022
This repository stores the code to reproduce the results published in "TiWS-iForest: Isolation Forest in Weakly Supervised and Tiny ML scenarios"

TinyWeaklyIsolationForest This repository stores the code to reproduce the results published in "TiWS-iForest: Isolation Forest in Weakly Supervised a

2 Mar 21, 2022
Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

xfbai 45 Dec 26, 2022
Source code of D-HAN: Dynamic News Recommendation with Hierarchical Attention Network

D-HAN The source code of D-HAN This is the source code of D-HAN: Dynamic News Recommendation with Hierarchical Attention Network. However, only the co

30 Sep 22, 2022
FCN (Fully Convolutional Network) is deep fully convolutional neural network architecture for semantic pixel-wise segmentation

FCN_via_Keras FCN FCN (Fully Convolutional Network) is deep fully convolutional neural network architecture for semantic pixel-wise segmentation. This

Kento Watanabe 48 Aug 30, 2022
Code for Multinomial Diffusion

Code for Multinomial Diffusion Abstract Generative flows and diffusion models have been predominantly trained on ordinal data, for example natural ima

104 Jan 04, 2023
Code examples and benchmarks from the paper "Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective"

Code For the Paper "Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective" Author: Robert Bamler Date: 22 D

4 Nov 02, 2022
RLMeta is a light-weight flexible framework for Distributed Reinforcement Learning Research.

RLMeta rlmeta - a flexible lightweight research framework for Distributed Reinforcement Learning based on PyTorch and moolib Installation To build fro

Meta Research 281 Dec 22, 2022
The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

DanceNet3D The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer. Dataset and Results Pleas

南嘉Nanga 36 Dec 21, 2022
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

213 Nov 12, 2022
Code release of paper "Deep Multi-View Stereo gone wild"

Deep MVS gone wild Pytorch implementation of "Deep MVS gone wild" (Paper | website) This repository provides the code to reproduce the experiments of

François Darmon 53 Dec 24, 2022
Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Database

Python cx_Oracle Notebooks, 2022 The repository contains Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Da

Christopher Jones 13 Dec 15, 2022
Official PyTorch implementation of the paper: Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting.

Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting Official PyTorch implementation of the paper: Improving Graph Neural Net

Giorgos Bouritsas 58 Dec 31, 2022
The Codebase for Causal Distillation for Language Models.

Causal Distillation for Language Models Zhengxuan Wu*,Atticus Geiger*, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D.

Zen 20 Dec 31, 2022