Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Overview

MetaAdaptRank

This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision.

CONTACT

For any question, please contact Si Sun by email [email protected] (respond to emails more quickly), we will try our best to solve :)

QUICKSTART

python 3.7
Pytorch 1.5.0

0/ Data Preparation

First download and prepare the following data into the data folder:

1 Contrastive Supervision Synthesis

1.1 Source-domain NLG training

  • We train two query generators (QG & ContrastQG) with the MS MARCO dataset using train_nlg.sh in the run_shells folder:

    bash prepro_nlg_dataset.sh
    
  • Optional arguments:

    --generator_mode            choices=['qg', 'contrastqg']
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --train_file                The path to the source-domain nlg training dataset
    --save_dir                  The path to save the checkpoints data; default: ../results
    

1.2 Target-domain NLG inference

  • The whole nlg inference pipline contains five steps:

    • 1.2.1/ Data preprocess
    • 1.2.2/ Seed query generation
    • 1.2.3/ BM25 subset retrieval
    • 1.2.4/ Contrastive doc pairs sampling
    • 1.2.5/ Contrastive query generation
  • 1.2.1/ Data preprocess. convert target-domain documents into the nlg format using prepro_nlg_dataset.sh in the preprocess folder:

    bash prepro_nlg_dataset.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --input_path            The path to the target dataset
    --output_path           The path to save the preprocess data; default: ../data/prepro_target_data
    
  • 1.2.2/ Seed query generation. utilize the trained QG model to generate seed queries for each target documents using nlg_inference.sh in the run_shells folder:

    bash nlg_inference.sh
    
  • Optional arguments:

    --generator_mode            choices='qg'
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --target_dataset_name       choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_load_dir        The path to the pretrained QG checkpoints.
    
  • 1.2.3/ BM25 subset retrieval. utilize BM25 to retrieve document subset according to the seed queries using do_subset_retrieve.sh in the bm25_retriever folder:

    bash do_subset_retrieve.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_folder      choices=['t5-small', 't5-base']
    
  • 1.2.4/ Contrastive doc pairs sampling. pairwise sample contrastive doc pairs from the BM25 retrieved subset using sample_contrast_pairs.sh in the preprocess folder:

    bash sample_contrast_pairs.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_folder      choices=['t5-small', 't5-base']
    
  • 1.2.5/ Contrastive query generation. utilize the trained ContrastQG model to generate new queries based on contrastive document pairs using nlg_inference.sh in the run_shells folder:

    bash nlg_inference.sh
    
  • Optional arguments:

    --generator_mode            choices='contrastqg'
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --target_dataset_name       choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_load_dir        The path to the pretrained ContrastQG checkpoints.
    

2 Meta Learning to Reweight

2.1 Data Preprocess

  • Prepare the contrastive synthetic supervision data (CTSyncSup) into the data/synthetic_data folder.

    • CTSyncSup_clueweb09
    • CTSyncSup_robust04
    • CTSyncSup_trec-covid

    >> example data format

  • Preprocess the target-domain datasets into the 5-fold cross-validation format using run_cv_preprocess.sh in the preprocess folder:

    bash run_cv_preprocess.sh
    
  • Optional arguments:

    --dataset_class         choices=['clueweb09', 'robust04', 'trec-covid']
    --input_path            The path to the target dataset
    --output_path           The path to save the preprocess data; default: ../data/prepro_target_data
    

2.2 Train and Test Models

  • The whole process of training and testing MetaAdaptRank contains three steps:

    • 2.2.1/ Meta-pretraining. The model is trained on synthetic weak supervision data, where the synthetic data are reweighted using meta-learning. The training fold of the target dataset is considered as target data that guides meta-reweighting.

    • 2.2.2/ Fine-tuning. The meta-pretrained model is continuously fine-tuned on the training folds of the target dataset.

    • 2.2.3/ Ensemble and Coor-Ascent. Coordinate Ascent is used to combine the last representation layers of all fine-tuned models, as LeToR features, with the retrieval scores from the base retriever.

  • 2.2.1/ Meta-pretraining using train_meta_bert.sh in the run_shells folder:

    bash train_meta_bert.sh
    

    Optional arguments for meta-pretraining:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --train_dir             The path to the synthetic weak supervision data
    --target_dir            The path to the target dataset
    --save_dir              The path to save the output files and checkpoints; default: ../results
    

    Complete optional arguments can be seen in config.py in the scripts folder.

  • 2.2.2/ Fine-tuning using train_metafine_bert.sh in the run_shells folder:

    bash train_metafine_bert.sh
    

    Optional arguments for fine-tuning:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --train_dir             The path to the target dataset
    --checkpoint_folder     The path to the checkpoint of the meta-pretrained model
    --save_dir              The path to save output files and checkpoint; default: ../results
    
  • 2.2.3/ Testing the fine-tuned model to collect LeToR features through test.sh in the run_shells folder:

    bash test.sh
    

    Optional arguments for testing:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --target_dir            The path to the target evaluation dataset
    --checkpoint_folder     The path to the checkpoint of the fine-tuned model
    --save_dir              The path to save output files and the **features** file; default: ../results
    
  • 2.2.4/ Ensemble. Train and test five models for each fold of the target dataset (5-fold cross-validation), and then ensemble and convert their output features to coor-ascent format using combine_features.sh in the ensemble folder:

    bash combine_features.sh
    

    Optional arguments for ensemble:

    --qrel_path             The path to the qrels of the target dataset
    --result_fold_1         The path to the testing result folder of the first fold model
    --result_fold_2         The path to the testing result folder of the second fold model
    --result_fold_3         The path to the testing result folder of the third fold model
    --result_fold_4         The path to the testing result folder of the fourth fold model
    --result_fold_5         The path to the testing result folder of the fifth fold model
    --save_dir              The path to save the ensembled `features.txt` file; default: ../combined_features
    
  • 2.2.5/ Coor-Ascent. Run coordinate ascent using run_ranklib.sh in the ensemble folder:

    bash run_ranklib.sh
    

    Optional arguments for coor-ascent:

    --qrel_path             The path to the qrels of the target dataset
    --ranklib_path          The path to the ensembled features.
    

    The final evaluation results will be output in the ranklib_path.

Results

All TREC files listed in this paper can be found in Tsinghua Cloud.

Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Elevation Mapping on GPU.

Elevation Mapping cupy Overview This is a ros package of elevation mapping on GPU. Code are written in python and uses cupy for GPU calculation. * pla

Robotic Systems Lab - Legged Robotics at ETH Zürich 183 Dec 19, 2022
TensorFlow port of PyTorch Image Models (timm) - image models with pretrained weights.

TensorFlow-Image-Models Introduction Usage Models Profiling License Introduction TensorfFlow-Image-Models (tfimm) is a collection of image models with

Martins Bruveris 227 Dec 20, 2022
Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Beijing ColorfulClouds Technology Co.,Ltd. 16 Aug 07, 2022
(ICCV 2021) ProHMR - Probabilistic Modeling for Human Mesh Recovery

ProHMR - Probabilistic Modeling for Human Mesh Recovery Code repository for the paper: Probabilistic Modeling for Human Mesh Recovery Nikos Kolotouros

Nikos Kolotouros 209 Dec 13, 2022
Unofficial PyTorch Implementation of "Augmenting Convolutional networks with attention-based aggregation"

Pytorch Implementation of Augmenting Convolutional networks with attention-based aggregation This is the unofficial PyTorch Implementation of "Augment

DK 20 Sep 09, 2022
Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database.

MIMIC-III Benchmarks Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database. Currently, the benchmark data

Chengxi Zang 6 Jan 02, 2023
Position detection system of mobile robot in the warehouse enviroment

Autonomous-Forklift-System About | GUI | Tests | Starting | License | Author | 🎯 About An application that run the autonomous forklift paletization a

Kamil Goś 1 Nov 24, 2021
PyTorch implementation for SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations Project | Paper | Colab PyTorch implementation of SDEdit: Image Synthesis a

536 Jan 05, 2023
Curated list of awesome GAN applications and demo

gans-awesome-applications Curated list of awesome GAN applications and demonstrations. Note: General GAN papers targeting simple image generation such

Minchul Shin 4.5k Jan 07, 2023
BraTs-VNet - BraTS(Brain Tumour Segmentation) using V-Net

BraTS(Brain Tumour Segmentation) using V-Net This project is an approach to dete

Rituraj Dutta 7 Nov 27, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 365 Dec 30, 2022
Neural Articulated Radiance Field

Neural Articulated Radiance Field NARF Neural Articulated Radiance Field Atsuhiro Noguchi, Xiao Sun, Stephen Lin, Tatsuya Harada ICCV 2021 [Paper] [Co

Atsuhiro Noguchi 144 Jan 03, 2023
Transparent Transformer Segmentation

Transparent Transformer Segmentation Introduction This repository contains the data and code for IJCAI 2021 paper Segmenting transparent object in the

谢恩泽 140 Jan 02, 2023
Metadata-Extractor - Metadata Extractor Script can be used to read in exif metadata

Metadata Extractor The exifextract script can be used to read in exif metadata f

1 Feb 16, 2022
Introduction to AI assignment 1 HCM University of Technology, term 211

Sokoban Bot Introduction to AI assignment 1 HCM University of Technology, term 211 Abstract This is basically a solver for Sokoban game using Breadth-

Quang Minh 4 Dec 12, 2022
TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

tf-metal-experiments TensorFlow Metal Backend on Apple Silicon Experiments (just for fun) Setup This is tested on M1 series Apple Silicon SOC only. Te

Timothy Liu 161 Jan 03, 2023
Explaining neural decisions contrastively to alternative decisions.

Contrastive Explanations for Model Interpretability This is the repository for the paper "Contrastive Explanations for Model Interpretability", about

AI2 16 Oct 16, 2022
This repository contains code for the paper "Decoupling Representation and Classifier for Long-Tailed Recognition", published at ICLR 2020

Classifier-Balancing This repository contains code for the paper: Decoupling Representation and Classifier for Long-Tailed Recognition Bingyi Kang, Sa

Facebook Research 820 Dec 26, 2022
Prml - Repository of notes, code and notebooks in Python for the book Pattern Recognition and Machine Learning by Christopher Bishop

Pattern Recognition and Machine Learning (PRML) This project contains Jupyter notebooks of many the algorithms presented in Christopher Bishop's Patte

Gerardo Durán-Martín 1k Jan 07, 2023
This is a beginner-friendly repo to make a collection of some unique and awesome projects. Everyone in the community can benefit & get inspired by the amazing projects present over here.

Awesome-Projects-Collection Quality over Quantity :) What to do? Add some unique and amazing projects as per your favourite tech stack for the communi

Rohan Sharma 178 Jan 01, 2023