Build Text Rerankers with Deep Language Models

Overview

Reranker

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural language processing (NLP) pipelines. The training procedure follows our ECIR paper Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline using a localized constrastive esimation (LCE) loss.

Reranker speaks Huggingface 🤗 language! This means that you instantly get all state-of-the-art pre-trained models as soon as they are ported to HF transformers. You also get the familiar model and trainer interfaces.

Stae of the Art Performance.

Reranker has two submissions to MS MARCO document leaderboard. Each got 1st place, advancing the SOTA!

Date Submission Name Dev [email protected] Eval [email protected]
2021/01/20 LCE loss + HDCT (ensemble) 0.464 0.405
2020/09/09 HDCT top100 + BERT-base FirstP (single) 0.434 0.382

Features

  • Training rerankers from the state-of-the-art pre-trained language models like BERT, RoBERTa and ELECTRA.
  • The state-of-the-art reranking performance with our LCE loss based training pipeline.
  • GPU memory optimizations: Loss Parallelism and Gradient Cache which allow training of larger model.
  • Faster training
    • Distributed Data Parallel (DDP) for multi GPUs.
    • Automatic Mixed Precision (AMP) training and inference with up to 2x speedup!
  • Break CPU RAM limitation by memory mapping datasets with pyarrow through datasets package interface.
  • Checkpoint interoperability with Hugging Face transformers.

Design Philosophy

The library is designed to be dedicated for text reranking modeling, training and testing. This helps us keep the code concise and focus on a more specific task.

Under the hood, Reranker provides a thin layer of wrapper over Huggingface libraries. Our model wraps PreTrainedModel and our trainer sub-class Huggingface Trainer. You can then work with the familiar interfaces.

Installation and Dependencies

Reranker uses Pytorch, Huggingface Transformers and Datasets. Install with the following commands,

git clone https://github.com/luyug/Reranker.git
cd Reranker
pip install .

Reranker has been tested with torch==1.6.0, transformers==4.2.0, datasets==1.1.3.

For development, install as editable,

pip install -e .

Workflow

Inference (Reranking)

The easiest way to do inference is to use one of our uploaded trained checkpoints with RerankerForInference.

from reranker import RerankerForInference
rk = RerankerForInference.from_pretrained("Luyu/bert-base-mdoc-bm25")  # load checkpoint

inputs = rk.tokenize('weather in new york', 'it is cold today in new york', return_tensors='pt')
score = rk(inputs).logits

Training

For training, you will need a model, a dataset and a trainer. Say we have parsed arguments into model_args, data_args and training_args with reranker.arguments. First, initialize the reranker and tokenizer from one of pre-tained language models from Hugging Face. For example, let's use RoBERTa by loading roberta-base.

from reranker import Reranker 
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = Reranker.from_pretrained(model_args, data_args, training_args, 'roberta-base')

Then create the dataset,

from reranker.data import GroupedTrainDataset
train_dataset = GroupedTrainDataset(
    data_args, data_args.train_path, 
    tokenizer=tokenizer, train_args=training_args
)

Create a trainer and train,

from reranker import RerankerTrainer
trainer = RerankerTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=GroupCollator(tokenizer),
    )
trainer.train()

See full examples in our examples.

Examples

MS MARCO Document Ranking with Reranker

More to come

Large Models

Loss Paralellism

We support computing a query's LCE loss with multiple GPUs with flag --collaborative. Note that a group size (pos + neg) not divisible by number of GPUs may incur undefined behaviours. You will typically want to use it with gradient accumulation steps greater than one.

Detailed instruction ot be added.

Gradient Cache

Experimental We provide subclasses RerankerDC and RerankerDCTrainer. In the MS MARCO example, You can use them with --distance_cahce argument to activate gradient caching with respect to computed unnormalized distance. This allows potentially training with unlimited number of negatives beyond GPU memory limitation up to numerical precision. The method is described in our preprint Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage.

Detailed instruction to be added.

Helpers

We provide a few helpers in the helper directory for data formatting,

Score Formatting

  • score_to_marco.py turns a raw score txt file into MS MARCO format.
  • score_to_tein.py turns a raw score txt file into trec eval format.

For example,

python score_to_tein.py --score_file {path to raw score txt}

This generates a trec eval format file in the same directory as the raw score file.

Data Format

Reranker core utilities (batch training, batch inference) expect processed and tokenized text in token id format. This means pre-processing should be done beforehand, e.g. with BERT tokenizer.

Training Data

Training data is grouped by query into a json file where each line has a query, its corresponding positives and sampled negatives.

{
    "qry": {
        "qid": str,
        "query": List[int],
    },
    "pos": List[
        {
            "pid": str,
            "passage": List[int],
        }
    ],
    "neg": List[
        {
            "pid": str,
            "passage": List[int]
        }
    ]
}

Training data is handled by class reranker.data.GroupedTrainDataset.

Inference (Reranking) Data

Inference data is grouped by query document(passage) pairs. Each line is a json entry to be rereanked (scored).

{
    "qid": str,
    "pid": str,
    "qry": List[int],
    "psg": List[int]
}

To speed up postprocessing, we currently take an additional tsv specifying text ids,

qid0     pid0
qid0     pid1
...

The ordering in the two files are expected to be the same.

Inference data is handled by class reranker.data.PredictionDataset.

Result Scores

Scores are stored in a tsv file with columns corresponding to qid, pid and score.

qid0     pid0     s0
qid0     pid1     s1
...

You can post-process it with our helper scirpt into MS MARCO format or TREC eval format.

Contribution

We welcome contribution to the package, either adding new dataset interface or new models.

Contact

You can reach me by email [email protected]. As a 2nd year master, I get busy days from time to time and may not reply very promptly. Feel free to ping me if you don't get replies.

Citation

If you use Reranker in your research, please consider citing our ECIR paper,

@inproceedings{gao2021lce,
               title={Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline}, 
               author={Luyu Gao and Zhuyun Dai and Jamie Callan},
               year={2021},
               booktitle={The 43rd European Conference On Information Retrieval (ECIR)},
      
}

For the gradient cache utility, consider citing our preprint,

@misc{gao2021scaling,
      title={Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage}, 
      author={Luyu Gao and Yunyi Zhang},
      year={2021},
      eprint={2101.06983},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Reranker is currently licensed under CC-BY-NC 4.0.

Owner
Luyu Gao
NLP Research [email protected], CMU
Luyu Gao
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 07, 2022
Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

ThreatFox2Misp Creating a Feed of MISP Events from ThreatFox (by abuse.ch) What will it do? This will fetch IOCs from ThreatFox by Abuse.ch, convert t

17 Nov 22, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

Sepehr Sameni 891 Dec 22, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
Harvis is designed to automate your C2 Infrastructure.

Harvis Harvis is designed to automate your C2 Infrastructure, currently using Mythic C2. 📌 What is it? Harvis is a python tool to help you create mul

Thiago Mayllart 99 Oct 06, 2022
Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │   ├── __init__.py │   ├── log

2 Nov 20, 2022
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
A Fast Sequence Transducer Implementation with PyTorch Bindings

transducer A Fast Sequence Transducer Implementation with PyTorch Bindings. The corresponding publication is Sequence Transduction with Recurrent Neur

Awni Hannun 184 Dec 18, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022