中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Last update: Oct 22, 2021

Overview

Transformer QG on DRCD

The inputs of the model refers to

we integrate C and A into a new C' in the following form.
C' = [c1, c2, ..., [HL], a1, ..., a|A|, [HL], ..., c|C|]

Proposed by Ying-Hong Chan & Yao-Chung Fan. (2019). A Re-current BERT-based Model for Question Generation.

我們還有另外一個英文QG: Transformer-QG-on-SQuAD

Features

完整的流程；從微調到模型評分
支援許多先進的語言模型
內建Flask，可快速作為API server

DRCD dataset

台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 DRCD資料集從2,108篇維基條目中整理出10,014篇段落，並從段落中標註出30,000多個問題。

Available models

BART (base on uer/bart-base-chinese-cluecorpussmall)

Use in Transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("p208p2002/bart-drcd-qg-hl")

model = AutoModelForSeq2SeqLM.from_pretrained("p208p2002/bart-drcd-qg-hl")

Expriments

Model	Bleu 1	Bleu 2	Bleu 3	Bleu 4	METEOR	ROUGE-L
BART-HLSQG	34.25	27.70	22.43	18.13	23.58	36.88

Environment requirements

The hole development is based on Ubuntu system

If you don't have pytorch 1.6+ please install or update first

https://pytorch.org/get-started/locally/

Install packages pip install -r requirements.txt
Setup scorer python setup_scorer.py
Download dataset python init_dataset.py

Training

Seq2Seq LM

usage: train_seq2seq_lm.py [-h]
                           [--base_model {bert-base-chinese,uer/bart-base-chinese-cluecorpussmall,p208p2002/bart-drcd-qg-hl}]
                           [-d {drcd}] [--batch_size BATCH_SIZE]
                           [--epoch EPOCH] [--lr LR] [--dev DEV] [--server]
                           [--run_test] [-fc FROM_CHECKPOINT]

optional arguments:
  -h, --help            show this help message and exit
  --base_model {bert-base-chinese,uer/bart-base-chinese-cluecorpussmall,p208p2002/bart-drcd-qg-hl}
  -d {drcd}, --dataset {drcd}
  --batch_size BATCH_SIZE
  --epoch EPOCH
  --lr LR
  --dev DEV
  --server
  --run_test
  -fc FROM_CHECKPOINT, --from_checkpoint FROM_CHECKPOINT

Run as API server

From pre-trained (recommend)

python train_seq2seq_lm.py --server --base_model p208p2002/bart-drcd-qg-hl

From your own checkpoint

python train_xxx_lm.py --server --base_model YOUR_BASE_MODEL --from_checkpoint FROM_CHECKPOINT

Request example

curl --location --request POST 'http://127.0.0.1:5000/' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'context=[HL]伊隆·里夫·馬斯克[HL]是一名企業家和商業大亨'

{"predict": "哪一個人是一名企業家和商業大亨?"}

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Related tags

Overview

Transformer QG on DRCD

Features

DRCD dataset

Available models

Use in Transformers

Expriments

Environment requirements

Training

Seq2Seq LM

Run as API server

From pre-trained (recommend)

From your own checkpoint

Request example

Owner

Philip

A Structured Self-attentive Sentence Embedding

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Python wrapper for Stanford CoreNLP tools v3.4.1

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Chinese Pre-Trained Language Models (CPM-LM) Version-I

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

AMUSE - financial summarization

SimCTG - A Contrastive Framework for Neural Text Generation

Natural Language Processing

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Extracting Summary Knowledge Graphs from Long Documents

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Long text token classification using LongFormer