Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Last update: Jan 07, 2023

Overview

DataSelection-NMT

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

Systems	Link	Systems	Link
Top1	Download	Top1	Download
Top2+Top1	Download	Top2	Download
Top3+Top2+...	Download	Top3	Donwload
Top4+Top3+...	Download	Top4	Donwload
Top5+Top4+...	Download	Top5	Donwload
Top6+Top5+...	Download	Top6	Donwload

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

3. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

4. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 3.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

5. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd

Authors

Javad Pourmostafa - Email, Website
Dimitar Shterionov - Email, Website
Pieter Spronck - Email, Website

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design This repository contains code for the paper: JA

55 Nov 29, 2022

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

370 Dec 27, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

19 Oct 28, 2022

Generate images from texts. In Russian. In PaddlePaddle

ruDALL-E PaddlePaddle ruDALL-E in PaddlePaddle. Install: pip install rudalle_paddle==0.0.1rc1 Run with free v100 on AI Studio. Original Pytorch versi

20 Oct 18, 2022

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph This repository provides a pipeline to create a knowledge graph from ra

3 Jan 1, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

84 Dec 26, 2022

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Related tags

Overview

DataSelection-NMT

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

How to use

Authors

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Generate images from texts. In Russian. In PaddlePaddle

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

Releases(1.1)

1.1(Oct 25, 2021)

Owner

Javad Pourmostafa

This is the workbook I created while I was studying for the Qiskit Associate Developer exam. I hope this becomes useful to others as it was for me :)

This repo is to present various code demos on how to use our Graph4NLP library.

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

DRIFT is a tool for Diachronic Analysis of Scientific Literature.

Time series annotation library.

Deep Learning Algorithms for Hedging with Frictions

Video Frame Interpolation without Temporal Priors (a general method for blurry video interpolation)

TDmatch is a Python library developed to perform matching tasks in three categories:

Tutorial: Introduction to Graph Machine Learning, with Jupyter notebooks

Easy and Efficient Object Detector

Image Completion with Deep Learning in TensorFlow

The code uses SegFormer for Semantic Segmentation on Drone Dataset.

Offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

Repository for Driving Style Recognition algorithms for Autonomous Vehicles

A convolutional recurrent neural network for classifying A/B phases in EEG signals recorded for sleep analysis.

Source code for the paper: Variance-Aware Machine Translation Test Sets (NeurIPS 2021 Datasets and Benchmarks Track)

Unofficial Alias-Free GAN implementation. Based on rosinality's version with expanded training and inference options.

This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

In this work, we will implement some basic but important algorithm of machine learning step by step.