The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Last update: Dec 27, 2022

Overview

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data

This repository provides the implementation details for the ACL 2021 main conference paper:

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data. [paper]

1. Data Preparation

In this work, we carried out persona-based dialogue generation experiments under a persona-dense scenario (English PersonaChat) and a persona-sparse scenario (Chinese PersonalDialog), with the assistance of a series of auxiliary inference datasets. Here we summarize the key information of these datasets and provide the links to download these datasets if they are directly accessible.

For Persona-Dense Experiments

Dataset	Type	Language	Usage	Where to Download
ConvAI2 PersonaChat	Dialogue Generation	English	Training	https://www.aclweb.org/anthology/P18-1205.pdf train_self_original_no_cands & valid_self_original_no_cands (7801 test dialogues)
MNLI	Non-dialogue Inference	English	Training	https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip entailment & contradiction
DNLI	Dialogue Inference	English	Evaluation	https://www.aclweb.org/anthology/P19-1363.pdf

For Persona-Sparse Experiments

Dataset	Type	Language	Usage	Where to Download
ECDT2019 PersonalDialog	Dialogue Generation	Chinese	Training	https://arxiv.org/pdf/1901.09672.pdf dialogues_train.json & test_data_random.json & test_data_biased.json
CMNLI	Non-dialogue Inference	Chinese	Training	https://github.com/CLUEbenchmark/CLUECorpus2020/ entailment & contradiction
KvPI	Dialogue Inference	Chinese	Evaluation	https://github.com/songhaoyu/KvPI

Download Pre-trained BERT

The BoB model is initialized from public BERT checkpoints:
- English BERT: https://huggingface.co/bert-base-uncased/tree/main
- Chinese BERT: https://huggingface.co/bert-base-chinese/tree/main

2. How to Run

The setup.sh script contains the necessary dependencies to run this project. Simply run ./setup.sh would install these dependencies. Here we take the English PersonaChat dataset as an example to illustrate how to run the dialogue generation experiments. Generally, there are three steps, i.e., tokenization, training and inference:

Preprocessing
```
 python preprocess.py --dataset_type convai2 \
 --trainset ./data/ConvAI2/train_self_original_no_cands.txt \
 --testset ./data/ConvAI2/valid_self_original_no_cands.txt \
 --nliset ./data/ConvAI2/ \
 --encoder_model_name_or_path ./pretrained_models/bert/bert-base-uncased/ \
 --max_source_length 64 \
 --max_target_length 32
```
We have provided some data examples (dozens of lines) at the ./data directory to show the data format. preprocess.py reads different datasets and tokenizes the raw data into a series of vocab IDs to facilitate model training. The --dataset_type could be either convai2 (for English PersonaChat) or ecdt2019 (for Chinese PersonalDialog). Finally, the tokenized data will be saved as a series of JSON files.

Model Training

 CUDA_VISIBLE_DEVICES=0 python bertoverbert.py --do_train \
 --encoder_model ./pretrained_models/bert/bert-base-uncased/ \
 --decoder_model ./pretrained_models/bert/bert-base-uncased/ \
 --decoder2_model ./pretrained_models/bert/bert-base-uncased/ \
 --save_model_path checkpoints/ConvAI2/bertoverbert --dataset_type convai2 \
 --dumped_token ./data/ConvAI2/convai2_tokenized/ \
 --learning_rate 7e-6 \
 --batch_size 32

Here we initialize encoder and both decoders from the same downloaded BERT checkpoint. And more parameter settings could be found at bertoverbert.py.

Evaluations

 CUDA_VISIBLE_DEVICES=0 python bertoverbert.py --dumped_token ./data/ConvAI2/convai2_tokenized/ \
 --dataset_type convai2 \
 --encoder_model ./pretrained_models/bert/bert-base-uncased/  \
 --do_evaluation --do_predict \
 --eval_epoch 7

Empirically, in the PersonaChat experiment with default hyperparameter settings, the best-performing checkpoint should be found between epoch 5 and epoch 9. If the training procedure goes fine, there should be some results like:

 Perplexity on test set is 21.037 and 7.813.

where 21.037 is the ppl from the first decoder and 7.813 is the final ppl from the second decoder. And the generated results is redirected to test_result.tsv, here is a generated example from the above checkpoint:

 persona:i'm terrified of scorpions. i am employed by the us postal service. i've a german shepherd named barnaby. my father drove a car for nascar.
 query:sorry to hear that. my dad is an army soldier.
 gold:i thank him for his service.
 response_from_d1:that's cool. i'm a train driver.
 response_from_d2:that's cool. i'm a bit of a canadian who works for america.

where d1 and d2 are the two BERT decoders, respectively.

Computing Infrastructure:
- The released codes were tested on NVIDIA Tesla V100 32G and NVIDIA PCIe A100 40G GPUs. Notice that with a batch_size=32, the BoB model will need at least 20Gb GPU resources for training.

MISC

Build upon 🤗 Transformers.

Bibtex:

  @inproceedings{song-etal-2021-bob,
      title = "BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data",
      author = "Haoyu Song, Yan Wang, Kaiyan Zhang, Wei-Nan Zhang, Ting Liu",
      booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-2021)",
      month = "Aug",
      year = "2021",
      address = "Online",
      publisher = "Association for Computational Linguistics",
  }

Email: [email protected].

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Related tags

Overview

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data

1. Data Preparation

2. How to Run

MISC

Owner

This is a simple face recognition mini project that was completed by a team of 3 members in 1 week's time

Python scripts using the Mediapipe models for Halloween.

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

Efficient 6-DoF Grasp Generation in Cluttered Scenes

Project page for our ICCV 2021 paper "The Way to my Heart is through Contrastive Learning"

End-to-End Referring Video Object Segmentation with Multimodal Transformers

structured-generative-modeling

An Unbiased Learning To Rank Algorithms (ULTRA) toolbox

Background Matting: The World is Your Green Screen

Repository of the paper Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models at ML4AD @ NeurIPS 2021.

Official git for "CTAB-GAN: Effective Table Data Synthesizing"

Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

KoCLIP: Korean port of OpenAI CLIP, in Flax

Official PyTorch implementation of "RMGN: A Regional Mask Guided Network for Parser-free Virtual Try-on" (IJCAI-ECAI 2022)

Analysing poker data from home games with friends

Code to accompany our paper "Continual Learning Through Synaptic Intelligence" ICML 2017

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Planning from Pixels in Environments with Combinatorially Hard Search Spaces -- NeurIPS 2021

Template repository to build PyTorch projects from source on any version of PyTorch/CUDA/cuDNN.

Implementation of GGB color space