A multilingual version of MS MARCO passage ranking dataset

Last update: Dec 27, 2022

Related tags

Overview

mMARCO

A multilingual version of MS MARCO passage ranking dataset

This repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset. The code available here is the same used in our paper mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset.

Translated Datasets

As described in our work, we made available 8 translated versions of MS MARCO passage ranking dataset. The translated passages collection and the queries set (training and validation) are available at:

Released Model Checkpoints

Our available fine-tuned models are:

Model	Description	[email protected]*
ptT5-base-pt-msmarco	a PTT5 model fine-tuned on Portuguese MS MARCO	0.188
ptT5-base-en-pt-msmarco	a PTT5 model fine-tuned on English and Portuguese MS MARCO	0.343
mT5-base-en-pt-msmarco	a mT5 model fine-tuned on both English and Portuguese MS MARCO	0.375
mT5-base-multi-msmarco	a mT5 model fine-tuned on mMARCO	0.366
mMiniLM-pt-msmarco	a mMiniLM model fine-tuned on Portuguese MS MARCO	-
mMiniLM-en-pt-msmarco	a mMiniLM model fine-tuned on both English and Portuguese MS MARCO	0.375
mMiniLM-multi-msmarco	a mMiniLM model fine-tuned on mMARCO	0.363

* [email protected] on English MS MARCO

Dataset

We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs.

Translation Model

To translate the MS MARCO dataset, we use MarianNMT an open-source neural machine translation framework originally written in C++ for fast training and translation. The Language Technology Research Group at the University of Helsinki made available more than a thousand language pairs for translation, supported by HuggingFace framework.

How To Translate

In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the translate.py script. This script expects a .tsv file, in which each line follows a document_id \t document_text format.

python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/

After translating, it is necessary to reassemble the file, as the documents were split into sentences.

python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection

Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@misc{bonifacio2021mmarco,
      title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, 
      author={Luiz Henrique Bonifacio and Israel Campiotti and Roberto Lotufo and Rodrigo Nogueira},
      year={2021},
      eprint={2108.13897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A multilingual version of MS MARCO passage ranking dataset

Related tags

Overview

mMARCO

Translated Datasets

Released Model Checkpoints

Dataset

Translation Model

How To Translate

How to Cite

Owner

We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will make a program to Crack Any Password Using Python. Show some ❤️ by starring this repository!

Code for "Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans" CVPR 2021 best paper candidate

Code for Boundary-Aware Segmentation Network for Mobile and Web Applications

Multi-objective gym environments for reinforcement learning.

Multi-objective constrained optimization for energy applications via tree ensembles

Scheduling BilinearRewards

Mask-invariant Face Recognition through Template-level Knowledge Distillation

Predicts an answer in yes or no.

Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Record radiologists' eye gaze when they are labeling images.

Code base for reproducing results of I.Schubert, D.Driess, O.Oguz, and M.Toussaint: Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics. NeurIPS (2021)

Space Ship Simulator using python

Official repository for "Orthogonal Projection Loss" (ICCV'21)

Model Quantization Benchmark

scAR (single-cell Ambient Remover) is a package for data denoising in single-cell omics.

This is the pytorch implementation of the paper - Axiomatic Attribution for Deep Networks.

PyTorch implementation of Octave Convolution with pre-trained Oct-ResNet and Oct-MobileNet models

CS506-Spring2022 - Code and Slides for Boston University CS 506

Prometheus exporter for Cisco Unified Computing System (UCS) Manager

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.