Establishing Strong Baselines for TripClick Health Retrieval; ECIR 2022

Overview

TripClick Baselines with Improved Training Data

Welcome 🙌 to the hub-repo of our paper:

Establishing Strong Baselines for TripClick Health Retrieval Sebastian Hofstätter, Sophia Althammer, Mete Sertkan and Allan Hanbury

https://arxiv.org/abs/2201.00365

tl;dr We create strong re-ranking and dense retrieval baselines (BERTCAT, BERTDOT, ColBERT, and TK) for TripClick (health ad-hoc retrieval). We improve the – originally too noisy – training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking and retrieval setting on TripClick, which were not achieved with the original baselines. We publish the improved training files for everyone to use.

If you have any questions, suggestions, or want to collaborate please don't hesitate to get in contact with us via Twitter or mail to [email protected]

Please cite our work as:

@misc{hofstaetter2022tripclick,
      title={Establishing Strong Baselines for TripClick Health Retrieval}, 
      author={Sebastian Hofst{\"a}tter and Sophia Althammer and Mete Sertkan and Allan Hanbury},
      year={2022},
      eprint={2201.00365},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Training Files

We publish the improved training files without the text content instead using the ids from TripClick (with permission from the TripClick owners); for the text content please get the full TripClick dataset from the TripClick Github page.

Our training files have the format query_id pos_passage_id neg_passage_id (with tab separation) and are available as a HuggingFace dataset: https://huggingface.co/datasets/sebastian-hofstaetter/tripclick-training

Source Code

The full source-code for our paper is here, as part of our matchmaker library: https://github.com/sebastian-hofstaetter/matchmaker

We provide getting started guides for training re-ranking and retrieval models, as well as a range of evaluation setups.

Pre-Trained Models

Unfortunately, the license of TripClick does not allow us to publish the trained models.

TripClick Baselines Results

For more information and commentary on the results, please see our ECIR paper.

BM25 Top200 Re-Ranking

Model BERT Instance HEAD TORSO TAIL
nDCG MRR nDCG MRR nDCG MRR
Original Baselines
BM25 -- .140 .276 .206 .283 .267 .258
ConvKNRM -- .198 .420 .243 .347 .271 .265
TK -- .208 .434 .272 .381 .295 .280
Our Improved Baselines
TK -- .232 .472 .300 .390 .345 .319
ColBERT SciBERT .270 .556 .326 .426 .374 .347
PubMedBERT-Abstract .278 .557 .340 .431 .387 .361
BERT_CAT DistilBERT .272 .556 .333 .427 .381 .355
BERT-Base .287 .579 .349 .453 .396 .366
SciBERT .294 .595 .360 .459 .408 .377
PubMedBERT-Full .298 .582 .365 .462 .412 .381
PubMedBERT-Abstract .296 .587 .359 .456 .409 .380
Ensemble (Last 3 BERT_CAT) .303 .601 .370 .472 .420 .392

Dense Retrieval Results

Model BERT Instance Head(DCTR)
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
Original Baselines
BM25 -- 31% .140 .276 .499 .621 .834
Our Improved Baselines
BERT_DOT DistilBERT 39% .236 .512 .550 .648 .813
SciBERT 41% .243 .530 .562 .640 .793
PubMedBERT 40% .235 .509 .582 .673 .828
Owner
Sebastian Hofstätter
PhD student; working on machine learning and information retrieval
Sebastian Hofstätter
Towards Representation Learning for Atmospheric Dynamics (AtmoDist)

Towards Representation Learning for Atmospheric Dynamics (AtmoDist) The prediction of future climate scenarios under anthropogenic forcing is critical

Sebastian Hoffmann 4 Dec 15, 2022
Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra

850-Safra-DS-ModuloI Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra Para aprender mais Git https://learngitbranc

Brian Nunes 7 Dec 10, 2022
Official code for the publication "HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder".

HyFactor Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce a novel open-source archit

Laboratoire-de-Chemoinformatique 11 Oct 10, 2022
A quantum game modeling of pandemic (QHack 2022)

Contributors: @JongheumJung, @YoonjaeChung, @GyunghunKim Abstract In the regime of a global pandemic, leaders around the world need to consider variou

Yoonjae Chung 8 Apr 03, 2022
Course about deep learning for computer vision and graphics co-developed by YSDA and Skoltech.

Deep Vision and Graphics This repo supplements course "Deep Vision and Graphics" taught at YSDA @fall'21. The course is the successor of "Deep Learnin

Yandex School of Data Analysis 160 Jan 02, 2023
This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

DropEdge: Towards Deep Graph Convolutional Networks on Node Classification This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Con

401 Dec 16, 2022
Direct LiDAR Odometry: Fast Localization with Dense Point Clouds

Direct LiDAR Odometry: Fast Localization with Dense Point Clouds DLO is a lightweight and computationally-efficient frontend LiDAR odometry solution w

VECTR at UCLA 369 Dec 30, 2022
Unofficial Implementation of Oboe (SIGCOMM'18').

Oboe-Reproduce This is the unofficial implementation of the paper "Oboe: Auto-tuning video ABR algorithms to network conditions, Zahaib Akhtar, Yun Se

Tianchi Huang 13 Nov 04, 2022
Repository for benchmarking graph neural networks

Benchmarking Graph Neural Networks Updates Nov 2, 2020 Project based on DGL 0.4.2. See the relevant dependencies defined in the environment yml files

NTU Graph Deep Learning Lab 2k Jan 03, 2023
Ontologysim: a Owlready2 library for applied production simulation

Ontologysim: a Owlready2 library for applied production simulation Ontologysim is an open-source deep production simulation framework, with an emphasi

10 Nov 30, 2022
CAUSE: Causality from AttribUtions on Sequence of Events

CAUSE: Causality from AttribUtions on Sequence of Events

Wei Zhang 21 Dec 01, 2022
Code for "On Memorization in Probabilistic Deep Generative Models"

On Memorization in Probabilistic Deep Generative Models This repository contains the code necessary to reproduce the experiments in On Memorization in

The Alan Turing Institute 3 Jun 09, 2022
Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.

lacmus The program for searching through photos from the air of lost people in the forest using Retina Net neural nwtwork. The project is being develo

Lacmus Foundation 168 Dec 27, 2022
a minimal terminal with python 😎😉

Meterm a terminal with python 😎 How to use Clone Project: $ git clone https://github.com/motahharm/meterm.git Run: in Terminal: meterm.exe Or pip ins

Motahhar.Mokfi 5 Jan 28, 2022
Implementation of the paper "Fine-Tuning Transformers: Vocabulary Transfer"

Transformer-vocabulary-transfer Implementation of the paper "Fine-Tuning Transfo

LEYA 13 Nov 30, 2022
Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Delving into Localization Errors for Monocular 3D Detection By Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, Wanli Ouyang. Intr

XINZHU.MA 124 Jan 04, 2023
Official Repository for the ICCV 2021 paper "PixelSynth: Generating a 3D-Consistent Experience from a Single Image"

PixelSynth: Generating a 3D-Consistent Experience from a Single Image (ICCV 2021) Chris Rockwell, David F. Fouhey, and Justin Johnson [Project Website

Chris Rockwell 95 Nov 22, 2022
PyTorch implementation of PP-LCNet

PP-LCNet-Pytorch Pre-Trained Models Google Drive p018 Accuracy Models Top1 Top5 PPLCNet_x0_25 0.5186 0.7565 PPLCNet_x0_35 0.5809 0.8083 PPLCNet_x0_5 0

24 Dec 12, 2022
TransGAN: Two Transformers Can Make One Strong GAN

[Preprint] "TransGAN: Two Transformers Can Make One Strong GAN", Yifan Jiang, Shiyu Chang, Zhangyang Wang

VITA 1.5k Jan 07, 2023
DGCNN - Dynamic Graph CNN for Learning on Point Clouds

DGCNN is the author's re-implementation of Dynamic Graph CNN, which achieves state-of-the-art performance on point-cloud-related high-level tasks including category classification, semantic segmentat

Wang, Yue 1.3k Dec 26, 2022