A multilingual version of MS MARCO passage ranking dataset

Related tags

Deep LearningmMARCO
Overview

mMARCO

A multilingual version of MS MARCO passage ranking dataset

This repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset. The code available here is the same used in our paper mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset.

Translated Datasets

As described in our work, we made available 8 translated versions of MS MARCO passage ranking dataset. The translated passages collection and the queries set (training and validation) are available at:

Released Model Checkpoints

Our available fine-tuned models are:

Model Description [email protected]*
ptT5-base-pt-msmarco a PTT5 model fine-tuned on Portuguese MS MARCO 0.188
ptT5-base-en-pt-msmarco a PTT5 model fine-tuned on English and Portuguese MS MARCO 0.343
mT5-base-en-pt-msmarco a mT5 model fine-tuned on both English and Portuguese MS MARCO 0.375
mT5-base-multi-msmarco a mT5 model fine-tuned on mMARCO 0.366
mMiniLM-pt-msmarco a mMiniLM model fine-tuned on Portuguese MS MARCO -
mMiniLM-en-pt-msmarco a mMiniLM model fine-tuned on both English and Portuguese MS MARCO 0.375
mMiniLM-multi-msmarco a mMiniLM model fine-tuned on mMARCO 0.363

* [email protected] on English MS MARCO

Dataset

We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs.

Translation Model

To translate the MS MARCO dataset, we use MarianNMT an open-source neural machine translation framework originally written in C++ for fast training and translation. The Language Technology Research Group at the University of Helsinki made available more than a thousand language pairs for translation, supported by HuggingFace framework.

How To Translate

In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the translate.py script. This script expects a .tsv file, in which each line follows a document_id \t document_text format.

python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/

After translating, it is necessary to reassemble the file, as the documents were split into sentences.

python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection

Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@misc{bonifacio2021mmarco,
      title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, 
      author={Luiz Henrique Bonifacio and Israel Campiotti and Roberto Lotufo and Rodrigo Nogueira},
      year={2021},
      eprint={2108.13897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Convert onnx models to pytorch.

onnx2torch onnx2torch is an ONNX to PyTorch converter. Our converter: Is easy to use – Convert the ONNX model with the function call convert; Is easy

ENOT 264 Dec 30, 2022
Agent-based model simulator for air quality and pandemic risk assessment in architectural spaces

Agent-based model simulation for air quality and pandemic risk assessment in architectural spaces. User Guide archABM is a fast and open source agent-

Vicomtech 10 Dec 05, 2022
Make your own game in a font!

Project structure. Included is a suite of tools to create font games. Tutorial: For a quick tutorial about how to make your own game go here For devel

Michael Mulet 125 Dec 04, 2022
Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

Keon Lee 152 Jan 02, 2023
Image to Image translation, image generataton, few shot learning

Semi-supervised Learning for Few-shot Image-to-Image Translation [paper] Abstract: In the last few years, unpaired image-to-image translation has witn

yaxingwang 49 Nov 18, 2022
Yolov5 deepsort inference,使用YOLOv5+Deepsort实现车辆行人追踪和计数,代码封装成一个Detector类,更容易嵌入到自己的项目中

使用YOLOv5+Deepsort实现车辆行人追踪和计数,代码封装成一个Detector类,更容易嵌入到自己的项目中。

813 Dec 31, 2022
PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages Abstract NLP applications for code-mixed (CM) or mix-li

Mohsin Ali, Mohammed 1 Nov 12, 2021
Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Yaoming Cai 5 Jul 18, 2022
Air Pollution Prediction System using Linear Regression and ANN

AirPollution Pollution Weather Prediction System: Smart Outdoor Pollution Monitoring and Prediction for Healthy Breathing and Living Publication Link:

Dr Sharnil Pandya, Associate Professor, Symbiosis International University 19 Feb 07, 2022
Hyperopt for solving CIFAR-100 with a convolutional neural network (CNN) built with Keras and TensorFlow, GPU backend

Hyperopt for solving CIFAR-100 with a convolutional neural network (CNN) built with Keras and TensorFlow, GPU backend This project acts as both a tuto

Guillaume Chevalier 103 Jul 22, 2022
RefineGNN - Iterative refinement graph neural network for antibody sequence-structure co-design (RefineGNN)

Iterative refinement graph neural network for antibody sequence-structure co-des

Wengong Jin 83 Dec 31, 2022
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usec

The Kompute Project 1k Jan 06, 2023
[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

Guangrui Li 84 Dec 26, 2022
PyTorch code for our paper "Gated Multiple Feedback Network for Image Super-Resolution" (BMVC2019)

Gated Multiple Feedback Network for Image Super-Resolution This repository contains the PyTorch implementation for the proposed GMFN [arXiv]. The fram

Qilei Li 66 Nov 03, 2022
The project was to detect traffic signs, based on the Megengine framework.

trafficsign 赛题 旷视AI智慧交通开源赛道,初赛1/177,复赛1/12。 本赛题为复杂场景的交通标志检测,对五种交通标志进行识别。 框架 megengine 算法方案 网络框架 atss + resnext101_32x8d 训练阶段 图片尺寸 最终提交版本输入图片尺寸为(1500,2

20 Dec 02, 2022
Improving Factual Completeness and Consistency of Image-to-text Radiology Report Generation

Improving Factual Completeness and Consistency of Image-to-text Radiology Report Generation The reference code of Improving Factual Completeness and C

46 Dec 15, 2022
Repository for code and dataset for our EMNLP 2021 paper - “So You Think You’re Funny?”: Rating the Humour Quotient in Standup Comedy.

AI-OpenMic Dataset The dataset is available for download via the follwing link. Repository for code and dataset for our EMNLP 2021 paper - “So You Thi

6 Oct 26, 2022
Code for the paper "Implicit Representations of Meaning in Neural Language Models"

Implicit Representations of Meaning in Neural Language Models Preliminaries Create and set up a conda environment as follows: conda create -n state-pr

Belinda Li 39 Nov 03, 2022
Unsupervised Video Interpolation using Cycle Consistency

Unsupervised Video Interpolation using Cycle Consistency Project | Paper | YouTube Unsupervised Video Interpolation using Cycle Consistency Fitsum A.

NVIDIA Corporation 100 Nov 30, 2022
Pytorch Implementation of "Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation"

CRL_EGPG Pytorch Implementation of Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation We use contrastive loss implemented b

YHR 25 Nov 14, 2022