Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

Overview

BERTGEN

This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codebase is based on the VL-BERT official repository (https://github.com/jackroos/VL-BERT) presented in the paper VL-BERT: Pre-training of Generic Visual-Linguistic Representations.

Introduction

BERTGEN extends the VL-BERT model by making it multilingual, inheriting multilingual pretraining from multilingual BERT (https://github.com/google-research/bert/blob/master/multilingual.md. The BERTGEN model produces multilingual, multimodal embeddings usede for visual-linguistic generation tasks.

BERTGEN takes advantage of large-scale training of VL-BERT and M-BERT but is also further trained, in a generative setting as described in the paper.

drawing

Figure 1: Overview of the BERTGEN architecture

Special thanks to VL-BERT, PyTorch and its 3rd-party libraries and BERT. This codebase also uses the following features inherited from VL-BERT:

  • Distributed Training
  • Various Optimizers and Learning Rate Schedulers
  • Gradient Accumulation
  • Monitoring the Training Using TensorboardX

Prepare

Environment

  • Ubuntu 16.04, CUDA 9.0, GCC 4.9.4
  • Python 3.6.x
    # We recommend you to use Anaconda/Miniconda to create a conda environment
    conda create -n bertgen python=3.6 pip
    conda activate bertgen
  • PyTorch 1.0.0 or 1.1.0
    conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch
  • Apex (optional, for speed-up and fp16 training)
    git clone https://github.com/jackroos/apex
    cd ./apex
    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./  
  • Other requirements:
    pip install Cython
    pip install -r requirements.txt
  • Compile
    ./scripts/init.sh

Data

The datasets used for training and evaluating BERTGEN can be found in this zenodo link. After checking out the code repository, simply extract the .tar.gz file downloaded from zenodo to <github checkout folder>/data. A README file is included in the download with more information on the structure of the datasets.

Pre-trained Models

See PREPARE_PRETRAINED_MODELS.md.

Training

Distributed Training on Single-Machine

./scripts/dist_run_single.sh <num_gpus> <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>
  • <num_gpus>: number of gpus to use.
  • <task>: LanguageGeneration.
  • <path_to_cfg>: config yaml file under ./cfgs/<task>.
  • <dir_to_store_checkpoint>: root directory to store checkpoints.

Following is a more concrete example:

./scripts/dist_run_single.sh 4 LanguageGeneration/train_end2end.py ./cfgs/multitask_training/base_prec_multitask_train_global.yaml ./checkpoints

Distributed Training on Multi-Machine

For example, on 2 machines (A and B), each with 4 GPUs,

run following command on machine A:

./scripts/dist_run_multi.sh 2 0 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

run following command on machine B:

./scripts/dist_run_multi.sh 2 1 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>
  • Training:
    • multitask training: "MODULE: BERTGENMultitaskTraining"

Non-Distributed Training

./scripts/nondist_run.sh <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

Note:

  1. In yaml files under ./cfgs, we set batch size for GPUs with at least 32G memory, you may need to adapt the batch size and gradient accumulation steps according to your actual case, e.g., if you decrease the batch size, you should also increase the gradient accumulation steps accordingly to keep 'actual' batch size for SGD unchanged. Note that for the multitask training of 13 tasks the batch size is set to the minimum of 1 sample from each dataset per task. You would have to reduce the number of datasets to fit on a GPU with smaller memory than 32G.

  2. For efficiency, we recommend you to use distributed training even on single-machine.

Evaluation

Language Generation tasks (MT, MMT, IC)

  • Generate prediction results on selected test dataset (specified in yaml). The task is also specified in the .yaml file (MT, MMT, IC):
    python LanguageGeneration/test.py \
      --cfg <cfg_of_downstream_task> \
      --ckpt <checkpoint_of_pretrained_model> \
      --gpus <indexes_of_gpus_to_use> \
      --result-path <dir_to_save_result> --result-name <result_file_name>
    
  • Inference:
    • Machine Translation: "MODULE: BERTGENGenerateMMT"
    • Multimodal Machine Translation: "MODULE: BERTGENGenerateMT"
    • Image Captioning: "MODULE: BERTGENGenerateImageOnly"

Evaluation Metrics

  • After generating results, the generated text file can be compared with the ground truth in tokenised format. We have used the nmtpytoch tool for generating these metrics. An example is shown below
nmtpy-coco-metrics -l de "./checkpoints/generated/ENDEIMG.txt" -r "./data/ground_truths/ENDEIMG.txt.tok

Acknowledgements

Many thanks to following codebases that have been essential while building this codebase:

Owner
[email protected]
Natural Language Processing at Imperial College London
<a href=[email protected]">
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 72 Dec 19, 2022
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
null

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022
Weird Sort-and-Compress Thing

Weird Sort-and-Compress Thing A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by

Douglas 1 Jan 03, 2022
Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022