Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Last update: Jan 09, 2023

Related tags

Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
July 2021: Support subword-nmt tokenization.
July 2021: Support sentencepiece tokenization.

What's On-going:

Add translation training/evaluation codes.
Support classification tasks.
Support pip usage.

Features:

Efficient: CPU learning on one machine.
Simple: The core code is no more than 200 lines.
Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

python 3.0
tqdm
mosedecoder
subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm

Usage

The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:


#Assume source_data is the file stroing data in the source language
#Assume target_data is the file stroing data in the target language
BPEROOT=subword-nmt
size=30000 # the size of BPE
cat source_data > training_data
cat target_data >> training_data

#subword-nmt style:
mkdir bpeoutput
BPE_CODE=code # the path to save vocabulary
python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file

#sentencepiece style:
mkdir spmout
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
#After this step, you will see spm.vocab and spm.model
python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

The second step is to run VOLT scripts. It accepts the following parameters:
- --source_file: the file storing data in the source language.
- --target_file: the file storing data in the target language.
- --token_candidate_file: the file storing token candidates.
- --max_number: the maximum size of the vocabulary generated by VOLT.
- --interval: the search granularity in VOLT.
- --loop_in_ot: the maximum interation loop in sinkhorn solution.
- --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
- --size_file: the file to store the vocabulary size generated by VOLT.
- --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
```
#subword-nmt style
python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
#sentencepiece style
python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
```

The third step is to use the generated vocabulary to tokenize your texts:

  #for subword-nmt toolkit
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file

  #for sentencepiece toolkit, here we only keep the optimal size
  best_size=$(cat spmoutput/size)
  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe

  #After this step, you will see spm.vocab and spm.model
  python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
  python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Overview

What's New:

What's On-going:

Features:

Requirements and Installation

Usage

Examples

Datasets

Citation

Owner

Pixel Consensus Voting for Panoptic Segmentation (CVPR 2020)

Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX.

Code to reproduce the results for Statistically Robust Neural Network Classification, published in UAI 2021

Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Official repository for the NeurIPS 2021 paper Get Fooled for the Right Reason: Improving Adversarial Robustness through a Teacher-guided curriculum Learning Approach

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

Natural Posterior Network: Deep Bayesian Predictive Uncertainty for Exponential Family Distributions

Sum-Product Probabilistic Language

This app is a simple example of using Strealit to create a financial data web app.

A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

PerfFuzz: Automatically Generate Pathological Inputs for C/C++ programs

A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

MaskTrackRCNN for video instance segmentation based on mmdetection

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

TabNet for fastai

Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Pytorch implementation of CVPR2021 paper "MUST-GAN: Multi-level Statistics Transfer for Self-driven Person Image Generation"

Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs

Supervised Contrastive Learning for Product Matching