Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Overview

RewriteNAT

This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressive Neural Machine Translation". RewriteNAT is a iterative NAT model which utilizes a locator component to explicitly learn to rewrite the erroneous translation pieces during iterative decoding.

Dependencies

Preprocessing

All the datasets are tokenized using the scripts from Moses except for Chinese with Jieba tokenizer, and splitted into subword units using BPE. The tokenized datasets are binaried using the script binaried.sh as follows:

python preprocess.py \
    --source-lang ${src} --target-lang ${tgt} \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/${dataset} --thresholdtgt 0 --thresholdsrc 0 \ 
    --workers 64 --joined-dictionary

Train

All the models are run on 8 Tesla V100 GPUs for 300,000 updates with an effective batch size of 128,000 tokens apart from En→Fr where we make 500,000 updates to account for the data size. The training scripts train.rewrite.nat.sh is configured as follows:

python train.py \
    data-bin/${dataset} \
    --source-lang ${src} --target-lang ${tgt} \
    --save-dir ${save_dir} \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion rewrite_nat_loss \
    --arch rewrite_nonautoregressive_transformer \
    --noise full_mask \
    ${share_all_embeddings} \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \ 
    --max-tokens 4000 \
    --save-interval-updates 10000 \
    --max-update ${step} \
    --update-freq 4 \ 
    --fp16 \
    --save-interval ${save_interval} \
    --discriminator-layers 6 \ 
    --train-max-iter ${max_iter} \
    --roll-in-g sample \
    --roll-in-d oracle \
    --imitation-g \
    --imitation-d \
    --discriminator-loss-factor ${discriminator_weight} \
    --no-share-discriminator \
    --generator-scale ${generator_scale} \
    --discriminator-scale ${discriminator_scale} \

Evaluation

We evaluate performance with BLEU for all language pairs, except for En→>Zh, where we use SacreBLEU. The testing scripts test.rewrite.nat.sh is utilized to generate the translations, as follows:

python generate.py \                                            
    data-bin/${dataset} \                                          
    --source-lang ${src} --target-lang ${tgt} \                    
    --gen-subset ${subset} \                                       
    --task translation_lev \                                       
    --path ${save_dir}/${dataset}/checkpoint_average_${suffix}.pt \
    --iter-decode-max-iter ${max_iter} \                           
    --iter-decode-with-beam ${beam} \                              
    --iter-decode-p ${iter_p} \                                    
    --beam 1 --remove-bpe \                                        
    --batch-size 50\                                               
    --print-step \                                                 
    --quiet 

Citation

Please cite as:

@inproceedings{geng-etal-2021-learning,
    title = "Learning to Rewrite for Non-Autoregressive Neural Machine Translation",
    author = "Geng, Xinwei and Feng, Xiaocheng and Qin, Bing",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.265",
    pages = "3297--3308",
}
Owner
Xinwei Geng
Ph.D. student working on improving Neural Machine Translation with Reinforcement Learning @HIT-SCIR
Xinwei Geng
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 01, 2022
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

3 May 19, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17.1k Jan 09, 2023
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022