This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Overview

Splinter

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to appear at ACL 2021.

Our pretraining code is based on TensorFlow (checked on 1.15), while fine-tuning is based on PyTorch (1.7.1) and Transformers (2.9.0). Note each has its own requirement file: pretraining/requirements.txt and finetuning/requirements.txt.

Data

Downloading Few-Shot MRQA Splits

curl -L https://www.dropbox.com/sh/pfg8j6yfpjltwdx/AAC8Oky0w8ZS-S3S5zSSAuQma?dl=1 > mrqa-few-shot.zip
unzip mrqa-few-shot.zip -d mrqa-few-shot

Pretrained Model

Command for downloading Splinter
curl -L https://www.dropbox.com/sh/h63xx2l2fjq8bsz/AAC5_Z_F2zBkJgX87i3IlvGca?dl=1 > splinter.zip
unzip splinter.zip -d splinter 

Pretraining

Create a virtual environment and execute

cd pretraining
pip install -r requirements.txt  # or requirements-gpu.txt for a GPU version

Then download the raw data (our pretraining was based on Wikipedia and BookCorpus). We support two data formats:

  • For wiki, a tag starts a new article and a ends it.
  • For BookCorpus, we process an already-tokenized file where tokens are separated by whitespaces. Newlines stands for a new book.
Command for creating the pretraining data

This command takes as input a set of files ($INPUT_PATTERN) and creates a tensorized dataset for pretraining. It supports the following masking schemes:

Command for creating the data for Splinter (recurring span selection)
cd pretraining
python create_pretraining_data.py \
    --input_file=$INPUT_PATTERN \
    --output_dir=$OUTPUT_DIR \
    --vocab_file=vocabs/bert-cased-vocab.txt \
    --do_lower_case=False \
    --do_whole_word_mask=False \
    --max_seq_length=512 \
    --num_processes=63 \
    --dupe_factor=5 \
    --max_span_length=10 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30

n-gram statistics are written to ngrams.txt in the output directory.

Command for pretraining Splinter
cd pretraining
python run_pretraining.py \
    --bert_config_file=configs/bert-base-cased-config.json \
    --input_file=$INPUT_FILE \
    --output_dir=$OUTPUT_DIR \
    --max_seq_length=512 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30 \
    --do_train \
    --train_batch_size=256 \
    --learning_rate=1e-4 \
    --num_train_steps=2400000 \
    --num_warmup_steps=10000 \
    --save_checkpoints_steps=10000 \
    --keep_checkpoint_max=240 \
    --use_tpu \
    --num_tpu_cores=8 \
    --tpu_name=$TPU_NAME

This can be trained using GPUs by dropping the use_tpu flag (although it was tested mainly on TPUs).

Convert TensorFlow Model to PyTorch

In order to fine-tune the TF model you pretrained with run_pretraining.py, you will first need to convert it to PyTorch. You can do so by

cd model_conversion
pip install -r requirements.txt
python convert_tf_to_pytorch.py --tf_checkpoint_path $TF_MODEL_PATH --pytorch_dump_path $OUTPUT_PATH

Fine-tuning

Fine-tuning has different requirements than pretraining, as it uses HuggingFace's Transformers library. Create a virtual environment and execute

cd finetuning
pip install -r requirements.txt

Please Note: If you want to reproduce results from the paper or run with a QASS head in genral, questions need to be augmented with a [QUESTION] token. In order to do so, please run

cd finetuning
python qass_preprocess.py --path "../mrqa-few-shot/*/*.jsonl"

This will add a [MASK] token to each question in the training data, which will later be replaced by a [QUESTION] token automatically by the QASS layer implementation.

Then fine-tune Splinter by

cd finetuning
export MODEL="../splinter"
export OUTPUT_DIR="output"
python run_mrqa.py \
    --model_type=bert \
    --model_name_or_path=$MODEL \
    --qass_head=True \
    --tokenizer_name=$MODEL \
    --output_dir=$OUTPUT_DIR \
    --train_file="../mrqa-few-shot/squad/squad-train-seed-42-num-examples-16_qass.jsonl" \
    --predict_file="../mrqa-few-shot/squad/dev_qass.jsonl" \
    --do_train \
    --do_eval \
    --max_seq_length=384 \
    --doc_stride=128 \
    --threads=4 \
    --save_steps=50000 \
    --per_gpu_train_batch_size=12 \
    --per_gpu_eval_batch_size=16 \
    --learning_rate=3e-5 \
    --max_answer_length=10 \
    --warmup_ratio=0.1 \
    --min_steps=200 \
    --num_train_epochs=10 \
    --seed=42 \
    --use_cache=False \
    --evaluate_every_epoch=False 

In order to train with automatic mixed precision, install apex and add the --fp16 flag.

See an example script for fine-tuning SpanBERT (rather than Splinter) here.

Citation

If you find this work helpful, please cite us

@inproceedings{ram-etal-2021-shot,
    title = "Few-Shot Question Answering by Pretraining Span Selection",
    author = "Ram, Ori  and
      Kirstain, Yuval  and
      Berant, Jonathan  and
      Globerson, Amir  and
      Levy, Omer",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.239",
    pages = "3066--3079",
}

Acknowledgements

We would like to thank the European Research Council (ERC) for funding the project, and to Google’s TPU Research Cloud (TRC) for their support in providing TPUs.

Owner
Ori Ram
PhD Candidate at Tel Aviv University, focusing on NLP and Machine Learning
Ori Ram
Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

CLUST-consortium 43 Nov 28, 2022
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

efficient-task-transfer This repository contains code for the experiments in our paper "What to Pre-Train on? Efficient Intermediate Task Selection".

AdapterHub 26 Dec 24, 2022
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this. Utilities for pre-process

Hamel Husain 180 Dec 09, 2022
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022