Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Last update: Dec 31, 2022

Overview

Self-Training for Neural Sequence Generation

This repo includes instructions for running noisy self-training algorithms from the following paper:

Revisiting Self-Training for Neural Sequence Generation
Junxian He*, Jiatao Gu*, Jiajun Shen, Marc'Aurelio Ranzato
ICLR 2020

Requirement

fairseq (please see the fairseq repo for other requirements on Python and PyTorch versions)

fairseq can be installed with:

pip install fairseq

Data

Download and preprocess the WMT'14 En-De dataset:

# Download and prepare the data
wget https://raw.githubusercontent.com/pytorch/fairseq/master/examples/translation/prepare-wmt14en2de.sh
bash prepare-wmt14en2de.sh --icml17

TEXT=wmt14_en_de
fairseq-preprocess --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir wmt14_en_de_bin --thresholdtgt 0 --thresholdsrc 0 \
    --joined-dictionary --workers 16

Then we mimic a semi-supervised setting where 100K training samples are randomly selected as parallel corpus and the remaining English training samples are treated as unannotated monolingual corpus:

bash extract_wmt100k.sh

Preprocess WMT100K:

bash preprocess.sh 100ken 100kde

Add noise to the monolingual corpus for later usage:

TEXT=wmt14_en_de
python paraphrase/paraphrase.py \
  --paraphraze-fn noise_bpe \
  --word-dropout 0.2 \
  --word-blank 0.2 \
  --word-shuffle 3 \
  --data-file ${TEXT}/train.mono_en \
  --output ${TEXT}/train.mono_en_noise \
  --bpe-type subword

Train the base supervised model

Train the translation model with 30K updates:

bash supervised_train.sh 100ken 100kde 30000

Self-training as pseudo-training + fine-tuning

Translate the monolingual data to train.[suffix] to form a pseudo parallel dataset:

bash translate.sh [model_path] [suffix]

Suppose the pseduo target language suffix is mono_de_iter1 (by default), preprocess:

bash preprocess.sh mono_en_noise mono_de_iter1

Pseudo-training + fine-tuning:

bash self_train.sh mono_en_noise mono_de_iter1

The above command trains the model on the pseduo parallel corpus formed by train.mono_en_noise and train.mono_de_iter1 and then fine-tune it on real parallel data.

This self-training process can be repeated for multiple iterations to improve performance.

Reference

@inproceedings{He2020Revisiting,
title={Revisiting Self-Training for Neural Sequence Generation},
author={Junxian He and Jiatao Gu and Jiajun Shen and Marc'Aurelio Ranzato},
booktitle={Proceedings of ICLR},
year={2020},
url={https://openreview.net/forum?id=SJgdnAVKDH}
}

Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Related tags

Overview

Self-Training for Neural Sequence Generation

Requirement

Data

Train the base supervised model

Self-training as pseudo-training + fine-tuning

Reference

Owner

Junxian He

Codebase for Diffusion Models Beat GANS on Image Synthesis.

Official implementation for CVPR 2021 paper: Adaptive Class Suppression Loss for Long-Tail Object Detection

Evaluating different engineering tricks that make RL work

Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions"

A PyTorch-based library for semi-supervised learning

Pytorch Implementation for Dilated Continuous Random Field

(CVPR2021) DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

object detection; robust detection; ACM MM21 grand challenge; Security AI Challenger Phase VII

Analysis code and Latex source of the manuscript describing the conditional permutation test of confounding bias in predictive modelling.

Python scripts form performing stereo depth estimation using the high res stereo model in PyTorch .

A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

A list of all papers and resoureces on Semantic Segmentation

Attention-based Transformation from Latent Features to Point Clouds (AAAI 2022)

Contrastive Learning of Structured World Models

Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Official code of paper "PGT: A Progressive Method for Training Models on Long Videos" on CVPR2021

Breaching - Breaching privacy in federated learning scenarios for vision and text

The Generic Manipulation Driver Package - Implements a ROS Interface over the robotics toolbox for Python

FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection