The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Last update: Nov 13, 2022

Related tags

Text Data & NLP SimpleReDial-v1

Overview

Easy-to-use toolkit for retrieval-based Chatbot

Recent Activity

Our released RRS corpus can be found here.
Our released BERT-FP post-training checkpoint for the RRS corpus can be found here.
Our related work (Exploring Dense Retrieval for Dialogue Response Selection) can be found here.

How to Use

Init the repo

Before using the repo, please run the following command to init:

# create the necessay folders
python init.py

# prepare the environment
# if some package cannot be installed, just google and install it from other ways
pip install -r requirements.txt

train the model

./scripts/train.sh <dataset_name> <model_name> <cuda_ids>

test the model [rerank]

./scripts/test_rerank.sh <dataset_name> <model_name> <cuda_id>

test the model [recal]

# different recall_modes are available: q-q, q-r
./scripts/test_recall.sh <dataset_name> <model_name> <cuda_id>

inference the responses and save into the faiss index

Somethings inference will missing data samples, please use the 1 gpu (faiss-gpu search use 1 gpu quickly)

It should be noted that: 1. For writer dataset, use extract_inference.py script to generate the inference.txt 2. For other datasets(douban, ecommerce, ubuntu), just cp train.txt inference.txt. The dataloader will automatically read the test.txt to supply the corpus.

# work_mode=response, inference the response and save into faiss (for q-r matching) [dual-bert/dual-bert-fusion]
# work_mode=context, inference the context to do q-q matching
# work_mode=gray, inference the context; read the faiss(work_mode=response has already been done), search the topk hard negative samples; remember to set the BERTDualInferenceContextDataloader in config/base.yaml
./scripts/inference.sh <dataset_name> <model_name> <cuda_ids>

If you want to generate the gray dataset for the dataset:

# 1. set the mode as the **response**, to generate the response faiss index; corresponding dataset name: BERTDualInferenceDataset;
./scripts/inference.sh <dataset_name> response <cuda_ids>

# 2. set the mode as the **gray**, to inference the context in the train.txt and search the top-k candidates as the gray(hard negative) samples; corresponding dataset name: BERTDualInferenceContextDataset
./scripts/inference.sh <dataset_name> gray <cuda_ids>

# 3. set the mode as the **gray-one2many** if you want to generate the extra positive samples for each context in the train set, the needings of this mode is the same as the **gray** work mode
./scripts/inference.sh <dataset_name> gray-one2many <cuda_ids>

If you want to generate the pesudo positive pairs, run the following commands:

# make sure the dual-bert inference dataset name is BERTDualInferenceDataset
./scripts/inference.sh <dataset_name> unparallel <cuda_ids>

deploy the rerank and recall model

# load the model on the cuda:0(can be changed in deploy.sh script)
./scripts/deploy.sh <cuda_id>

at the same time, you can test the deployed model by using:

# test_mode: recall, rerank, pipeline
./scripts/test_api.sh <test_mode> <dataset>

test the recall performance of the elasticsearch

Before testing the es recall, make sure the es index has been built:

# recall_mode: q-q/q-r
./scripts/build_es_index.sh <dataset_name> <recall_mode>

# recall_mode: q-q/q-r
./scripts/test_es_recall.sh <dataset_name> <recall_mode> 0

simcse generate the gray responses

# train the simcse model
./script/train.sh <dataset_name> simcse <cuda_ids>

# generate the faiss index, dataset name: BERTSimCSEInferenceDataset
./script/inference_response.sh <dataset_name> simcse <cuda_ids>

# generate the context index
./script/inference_simcse_response.sh <dataset_name> simcse <cuda_ids>
# generate the test set for unlikelyhood-gen dataset
./script/inference_simcse_unlikelyhood_response.sh <dataset_name> simcse <cuda_ids>

# generate the gray response
./script/inference_gray_simcse.sh <dataset_name> simcse <cuda_ids>
# generate the test set for unlikelyhood-gen dataset
./script/inference_gray_simcse_unlikelyhood.sh <dataset_name> simcse <cuda_ids>

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Related tags

Overview

Easy-to-use toolkit for retrieval-based Chatbot

Recent Activity

How to Use

Owner

GMFTBY

TPlinker for NER 中文/英文命名实体识别

Exploring dimension-reduced embeddings

Almost State-of-the-art Text Generation library

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

Telegram AI chat bot written in Python using Pyrogram

Simple GUI where you can enter an article and get a crisp summarized version.

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

The Sudachi synonym dictionary in Solar format.

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

Spooky Skelly For Python

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

NLP command-line assistant powered by OpenAI

A website which allows you to play with the GPT-2 transformer

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Material for GW4SHM workshop, 16/03/2022.