SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Last update: Nov 09, 2022

Overview

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Chen, Jia, et al. "Axiomatically Regularized Pre-training for Ad hoc Search." To Appear in the Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.

Requirements

python 3.7
torch==1.9.0
transformers==4.9.2
tqdm, nltk, numpy, boto3
trec_eval for evaluation on TREC DL 2019
anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARES_simple and Transformer_ICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARES_simple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

doc_token_ids.memmap: each line is the token ids for a document
docid2idx.json: {docid: memmap_line_id}

Query files:

queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
dev-qrels.txt: qid relevant_docid for validating set
2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
sample_qid2id.json: {qid: memmap_line_id}
axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name	MS MARCO [email protected]	MS MARCO [email protected]	DL [email protected]	DL [email protected]	COVID	EQ
BM25	0.2962	0.3107	0.5776	0.4795	0.4857	0.6690
BERT	0.1820	0.2012	0.4059	0.4198	0.4314	0.6055
PROP_wiki	0.2429	0.2596	0.5088	0.4525	0.4857	0.5991
PROP_marco	0.2763	0.2914	0.5317	0.4623	0.4829	0.6454
ARES_strict	0.2630	0.2785	0.4942	0.4504	0.4786	0.6923
ARES_hard	0.2627	0.2780	0.5189	0.4613	0.4943	0.6822
ARES_simple	0.2991	0.3130	0.5955	0.4863	0.4957	0.6916

Few-shot performance:

Visualization (attribution values have been normalized within a document):

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
We welcome anyone who would like to contribute to this repo. 🤗
If you have any other questions, please feel free to contact me via [email protected] or open an issue.
Code for data preprocessing will come soon. Please stay tuned~

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Related tags

Overview

Introduction

Requirements

Why this repo?

Pre-training Data

Download data

Pre-process data

Quick Start

Pre-training

Zero-shot evaluation (based on AS top100)

Fine-tuning

Visualization

Results

Citation

Notice

Owner

Jia Chen

Lattice methods in TensorFlow

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Ray-based parallel data preprocessing for NLP and ML.

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

Practical Machine Learning with Python

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

Python generation script for BitBirds

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

lightweight, fast and robust columnar dataframe for data analytics with online update

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

XLNet: Generalized Autoregressive Pretraining for Language Understanding

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Maix Speech AI lib, including ASR, chat, TTS etc.

🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

A PyTorch-based model pruning toolkit for pre-trained language models

Training code for Korean multi-class sentiment analysis

Amazon Multilingual Counterfactual Dataset (AMCD)

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer