When BERT Plays the Lottery, All Tickets Are Winning

Last update: Nov 10, 2022

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

Environment

Install the requirements in your python 3.7.7 virtual environment.

pip install -r requirements.txt

These experiments were done on multi-gpu environment, were some experiments, benchmarks were run parallel. So some changes to the bash scripts to make it work for your environment.

Dataset

Download the GLUE dataset using data/download_glue.py and data/download_mnli_data.py. Follow the instructions in data/download_glue.py docstring for MRPC.
All data for the tasks should be organized in data/glue/task_name/ structure.
Extract the attention pattern classification labelled data.
```
cd data
tar -xvf head_classification_data.tar.gz
```

Training, Masking, and Evaluation

Switch cwd to src (cd src) as many paths are relative from that directory.

Fine-tune the BERT on GLUE tasks

./train.sh

Obtain the masks

./find_masks.sh

Train models with the masks applied in good, random and bad settings.

./train_with_masks.sh

Evaluate the trained models

./evaluate.sh

Note: These experiments were run through course of time and now stiched together into single scripts. So it might be better to run the training and evaluation commands in them one by one.

Train the CNN classifier on attention patterns normed and raw.

python classify_attention_patterns.py
python classify_normed_patterns.py

These only train the classifier.

Evaluation Analysis and Final Results

These are primarily done in jupyter notebooks in experiment_analysis directory. There are many experimental notebooks there. Here are the important ones used to generate results included in the paper.

Importance pruning Heatmaps. Ignore the final "train_subset" and "hans" settings.
Magnitude pruning Heatmap
Overlap of surviving components
Generate the random baseline
Attention Classification Patterns
Evaluation Result Comparisons and table
Statistics on mask correlation across seeds

When BERT Plays the Lottery, All Tickets Are Winning

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Environment

Dataset

Training, Masking, and Evaluation

Evaluation Analysis and Final Results

Owner

Sai

A tool for calculating distortion parameters in coordination complexes.

Project repo for Learning Category-Specific Mesh Reconstruction from Image Collections

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

DeLighT: Very Deep and Light-Weight Transformers

BasicNeuralNetwork - This project looks over the basic structure of a neural network and how machine learning training algorithms work

Implementation for Shape from Polarization for Complex Scenes in the Wild

SelfAugment extends MoCo to include automatic unsupervised augmentation selection.

Supplementary materials for ISMIR 2021 LBD paper "Evaluation of Latent Space Disentanglement in the Presence of Interdependent Attributes"

RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

subpixel: A subpixel convnet for super resolution with Tensorflow

A framework for GPU based high-performance medical image processing and visualization

This repo contains the code required to train the multivariate time-series Transformer.

Unsupervised Video Interpolation using Cycle Consistency

Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Self-Guided Contrastive Learning for BERT Sentence Representations

Simple embedding based text classifier inspired by fastText, implemented in tensorflow

A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

Using pretrained language models for biomedical knowledge graph completion.

Code for the ICCV2021 paper "Personalized Image Semantic Segmentation"