Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Overview

Mask-Align: Self-Supervised Neural Word Alignment

This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment.

@inproceedings{chen2021maskalign,
   title={Mask-Align: Self-Supervised Neural Word Alignment},
   author={Chi Chen and Maosong Sun and Yang Liu},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

The implementation is built on top of THUMT.

Contents

Introduction

Mask-Align is a self-supervised neural word aligner. It parallelly masks out each target token and predicts it conditioned on both source and the remaining target tokens. The source token that contributes most to recovering a masked target token will be aligned to that target token.

Prerequisites

  • PyTorch
  • NLTK
  • remi *
  • pyecharts *
  • pandas *
  • matplotlib *
  • seaborn *

*: optional, only used for Visualization.

Usage

Data Preparation

To get the data used in our paper, you can follow the instructions in https://github.com/lilt/alignment-scripts.

To train an aligner with your own data, you should pre-process it yourself. Usually this includes tokenization, BPE, etc. You can find a simple guide here.

Now we have the pre-processed parallel training data (train.src, train.tgt), validation data (optional) (valid.src, valid.tgt) and test data (test.src, test.tgt). An example 3-sentence German–English parallel training corpus is:

# train.src
wiederaufnahme der sitzungsperiode
frau präsidentin , zur geschäfts @@ordnung .
ich bitte sie , sich zu einer schweigeminute zu erheben .

# train.tgt
resumption of the session
madam president , on a point of order .
please rise , then , for this minute ' s silence .

The next step is to shuffle the training set, which proves to be helpful for improving the results.

python thualign/scripts/shuffle_corpus.py --corpus train.src train.tgt

The resulting files train.src.shuf and train.tgt.shuf rearrange the sentence pairs randomly.

Then we need to generate vocabulary from the training set.

python thualign/scripts/build_vocab.py train.src.shuf vocab.train.src
python thualign/scripts/build_vocab.py train.tgt.shuf vocab.train.tgt

The resulting files vocab.train.src.txt and vocab.train.tgt.txt are final source and target vocabularies used for model training.

Training

All experiments are configured via config files in thualign/configs, see Configs for more details.. We provide an example config file thualign/configs/user/example.config. You can easily use it by making three changes:

  1. change device_list, update_cycle and batch_size to match your machine configuration;

  2. change exp_dir and output to your own experiment directory

  3. change train/valid/test_input and vocab to your data paths;

When properly configured, you can use the following command to train an alignment model described in the config file

bash thualign/bin/train.sh -s thualign/configs/user/example.config

or more simply

bash thualign/bin/train.sh -s example

The configuration file is an INI file and is parsed through configparser. By adding a new section, you can easily customize some configs while keep other configs unchanged.

[DEFAULT]
...

[small_budget]
batch_size = 4500
update_cycle = 8
device_list = [0]
half = False

Use -e option to run this small_budget section

bash thualign/bin/train.sh -s example -e small_budget

You can also monitor the training process through tensorboard

tensorboard --logdir=[output]

Test

After training, the following command can be used to generate attention weights (-g), generate data for attention visualization (-v), and test its AER (-t) if test_ref is provided.

bash thualign/bin/test.sh -s [CONFIG] -e [EXP] -gvt

For example, to test the model trained with the configs in example.config

bash thualign/bin/test.sh -s example -gvt

You might get the following output

alignment-soft.txt: 14.4% (87.7%/83.5%/9467)

The alignment results (alignment.txt) along with other test results are stored in [output]/test by default.

Configs

Most of the configuration of Mask-Align is done through configuration files in thualign/configs. The model reads the basic configs first, followed by the user-defined configs.

Basic Config

Predefined configs for experiments to use.

  • base.config: basic configs for training, validation and test

  • model.config: define different models with their hyperparameters

User Config

Customized configs that must describe the following configuration and maybe other experiment-specific parameters:

  • train/valid/test_input: paths of input parallel corpuses
  • vocab: paths of vocabulary files generated from thualign/scripts/build_vocab.py
  • output: path to save the model outputs
  • model: which model to use
  • batch_size: the batch size (number of tokens) used in the training stage.
  • update_cycle: the number of iterations for updating model parameters. The default value is 1. If you have only 1 GPU and want to obtain the same translation performance with using 4 GPUs, simply set this parameter to 4. Note that the training time will also be prolonged.
  • device_list: the list of GPUs to be used in training. Use the nvidia-smi command to find unused GPUs. If the unused GPUs are gpu0 and gpu1, set this parameter as device_list=[0,1].
  • half: set this to True if you wish to use half-precision training. This will speeds up the training procedure. Make sure that you have the GPUs with half-precision support.

Here is a minimal experiment config:

### thualign/configs/user/example.config
[DEFAULT]

train_input = ['train.src', 'train.tgt']
valid_input = ['valid.src', 'valid.tgt']
vocab = ['vocab.src.txt', 'vocab.tgt.txt']
test_input = ['test.src', 'test.tgt']
test_ref = test.talp

exp_dir = exp
label = agree_deen
output = ${exp_dir}/${label}

model = mask_align

batch_size = 9000
update_cycle = 1
device_list = [0,1,2,3]
half = True

Visualization

To better understand and analyze the model, Mask-Align supports the following two types of visulizations.

Training Visualization

Add eval_plot = True in your config file to turn on visualization during training. This will plot 5 attention maps from evaluation in the tensorboard.

These packages are required for training visualization:

  • pandas
  • matplotlib
  • seaborn

Attention Visualization

Use -v in the test command to generate alignment_vizdata.pt first. It is stored in [output]/test by default. To visualize it, using this script

python thualign/scripts/visualize.py [output]/test/alignment_vizdata.pt [--port PORT]

This will start a local service that plots the attention weights for all the test sentence pairs. You can access it through a web browser.

These packages are required for training visualization:

  • remi
  • pyecharts

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Owner
THUNLP-MT
Machine Translation Group, Natural Language Processing Lab at Tsinghua University (THUNLP). Please refer to https://github.com/thunlp for more NLP resources.
THUNLP-MT
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022
A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

Shahrukh Khan 48 Jan 02, 2023
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 07, 2023
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
FastFormers - highly efficient transformer models for NLU

FastFormers FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Underst

Microsoft 678 Jan 05, 2023
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

44 Jan 06, 2023
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
A Telegram bot to add notes to Flomo.

flomo bot 使用 Telegram 机器人发送笔记到你的 Flomo. 你需要有一台可访问 Telegram 的服务器。 Steps @BotFather 新建机器人,获取 token Flomo 官网获取 API,链接 https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

Bureaucratic Labs 16 Mar 03, 2022
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023