Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

Overview

BERTGEN

This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codebase is based on the VL-BERT official repository (https://github.com/jackroos/VL-BERT) presented in the paper VL-BERT: Pre-training of Generic Visual-Linguistic Representations.

Introduction

BERTGEN extends the VL-BERT model by making it multilingual, inheriting multilingual pretraining from multilingual BERT (https://github.com/google-research/bert/blob/master/multilingual.md. The BERTGEN model produces multilingual, multimodal embeddings usede for visual-linguistic generation tasks.

BERTGEN takes advantage of large-scale training of VL-BERT and M-BERT but is also further trained, in a generative setting as described in the paper.

drawing

Figure 1: Overview of the BERTGEN architecture

Special thanks to VL-BERT, PyTorch and its 3rd-party libraries and BERT. This codebase also uses the following features inherited from VL-BERT:

  • Distributed Training
  • Various Optimizers and Learning Rate Schedulers
  • Gradient Accumulation
  • Monitoring the Training Using TensorboardX

Prepare

Environment

  • Ubuntu 16.04, CUDA 9.0, GCC 4.9.4
  • Python 3.6.x
    # We recommend you to use Anaconda/Miniconda to create a conda environment
    conda create -n bertgen python=3.6 pip
    conda activate bertgen
  • PyTorch 1.0.0 or 1.1.0
    conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch
  • Apex (optional, for speed-up and fp16 training)
    git clone https://github.com/jackroos/apex
    cd ./apex
    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./  
  • Other requirements:
    pip install Cython
    pip install -r requirements.txt
  • Compile
    ./scripts/init.sh

Data

The datasets used for training and evaluating BERTGEN can be found in this zenodo link. After checking out the code repository, simply extract the .tar.gz file downloaded from zenodo to <github checkout folder>/data. A README file is included in the download with more information on the structure of the datasets.

Pre-trained Models

See PREPARE_PRETRAINED_MODELS.md.

Training

Distributed Training on Single-Machine

./scripts/dist_run_single.sh <num_gpus> <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>
  • <num_gpus>: number of gpus to use.
  • <task>: LanguageGeneration.
  • <path_to_cfg>: config yaml file under ./cfgs/<task>.
  • <dir_to_store_checkpoint>: root directory to store checkpoints.

Following is a more concrete example:

./scripts/dist_run_single.sh 4 LanguageGeneration/train_end2end.py ./cfgs/multitask_training/base_prec_multitask_train_global.yaml ./checkpoints

Distributed Training on Multi-Machine

For example, on 2 machines (A and B), each with 4 GPUs,

run following command on machine A:

./scripts/dist_run_multi.sh 2 0 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

run following command on machine B:

./scripts/dist_run_multi.sh 2 1 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>
  • Training:
    • multitask training: "MODULE: BERTGENMultitaskTraining"

Non-Distributed Training

./scripts/nondist_run.sh <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

Note:

  1. In yaml files under ./cfgs, we set batch size for GPUs with at least 32G memory, you may need to adapt the batch size and gradient accumulation steps according to your actual case, e.g., if you decrease the batch size, you should also increase the gradient accumulation steps accordingly to keep 'actual' batch size for SGD unchanged. Note that for the multitask training of 13 tasks the batch size is set to the minimum of 1 sample from each dataset per task. You would have to reduce the number of datasets to fit on a GPU with smaller memory than 32G.

  2. For efficiency, we recommend you to use distributed training even on single-machine.

Evaluation

Language Generation tasks (MT, MMT, IC)

  • Generate prediction results on selected test dataset (specified in yaml). The task is also specified in the .yaml file (MT, MMT, IC):
    python LanguageGeneration/test.py \
      --cfg <cfg_of_downstream_task> \
      --ckpt <checkpoint_of_pretrained_model> \
      --gpus <indexes_of_gpus_to_use> \
      --result-path <dir_to_save_result> --result-name <result_file_name>
    
  • Inference:
    • Machine Translation: "MODULE: BERTGENGenerateMMT"
    • Multimodal Machine Translation: "MODULE: BERTGENGenerateMT"
    • Image Captioning: "MODULE: BERTGENGenerateImageOnly"

Evaluation Metrics

  • After generating results, the generated text file can be compared with the ground truth in tokenised format. We have used the nmtpytoch tool for generating these metrics. An example is shown below
nmtpy-coco-metrics -l de "./checkpoints/generated/ENDEIMG.txt" -r "./data/ground_truths/ENDEIMG.txt.tok

Acknowledgements

Many thanks to following codebases that have been essential while building this codebase:

Owner
[email protected]
Natural Language Processing at Imperial College London
<a href=[email protected]">
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

Masatoshi Suzuki 4 Feb 19, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

SilkyArcTool English Dual languaged (rus+eng) GUI tool for packing and unpacking archives of Silky Engine. It is not the same arc as used in Ai6WIN. I

Tester 5 Sep 15, 2022
A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

James Betker 2.1k Jan 01, 2023
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 06, 2023
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

jon_chun 14 Dec 19, 2022
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 02, 2023
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023