Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Last update: Dec 26, 2022

Related tags

Text Data & NLP GAR

Overview

This repo provides the code of the following papers:

(GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021

(RIDER) "Reader-Guided Passage Reranking for Open-Domain Question Answering", Findings of ACL 2021.

GAR augments a question with relevant contexts generated by seq2seq learning, with the question as input and target outputs such as the answer, the sentence where the answer belongs to, and the title of a passage that contains the answer. With the generated contexts appended to the original questions, GAR achieves state-of-the-art OpenQA performance with a simple BM25 retriever.

RIDER is a simple and effective passage reranker, which reranks retrieved passages by reader predictions without any training. RIDER achieves 10~20 gains in top-1 retrieval accuracy, 1~4 gains in Exact Match (EM), and even outperforms supervised transformer-based rerankers.

Code

Generation

The codebase of seq2seq models is based on (old) huggingface/transformers (version==2.11.0) examples.

See train_gen.yml for the package requirements and example commands to run the models.

train_generator.py: training of seq2seq models.

conf.py: configurations for train_generator.py. There are some default parameters but it might be easier to set e.g., --data_dir and --output_dir directly.

test_generator.py: test of seq2seq models (if not already done in train_generator.py).

Retrieval

We use pyserini for BM25 retrieval. Please refer to its document for indexing and searching wiki passages (wiki passages can be downloaded here). Alternatively, you may take a look at its effort to reproduce DPR results, which gives more detailed instructions and incorporates the passage-level span voting in GAR.

Reranking

Please see the instructions in rider/rider.py.

Reading

We experiment with one extractive reader and one generative reader.

For the extractive reader, we take the one used by dense passage retrieval. Please refer to DPR for more details.

For the generative reader, we reuse the codebase in the generation stage above, with [question; top-retrieved passages] as the source input and one ground-truth answer as the target output. Example script is provided in train_gen.yml.

Data

Please refer to DPR for dataset downloading.

For seq2seq learning, use {train/val/test}.source as the input and {train/val/test}.target as the output, where each line is one example.

In the same folder, save the list of ground-truth answers with name {val/test}.target.json if you want to evaluate EM during training.

Cite

Please use the following bibtex to cite our papers.

@article{mao2020generation,
  title={Generation-augmented retrieval for open-domain question answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.08553},
  year={2020}
}

@article{mao2021reader,
  title={Reader-Guided Passage Reranking for Open-Domain Question Answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2101.00294}
}

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Related tags

Overview

Code

Generation

Retrieval

Reranking

Reading

Data

Cite

Owner

morning

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

AMUSE - financial summarization

Code for text augmentation method leveraging large-scale language models

scikit-learn wrappers for Python fastText.

Conditional probing: measuring usable information beyond a baseline

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

🌐 Translation microservice powered by AI

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Weaviate demo with the text2vec-openai module

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Faster, modernized fork of the language identification tool langid.py

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

YACLC - Yet Another Chinese Learner Corpus

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition