PG-19 Language Modelling Benchmark

Last update: Oct 30, 2022

Related tags

Overview

PG-19 Language Modelling Benchmark

This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library [1], that were published before 1919. It also contains metadata of book titles and publication dates.

Full dataset download link

PG-19 is over double the size of the Billion Word benchmark [2] and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark [3].

Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date).

Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom [4] to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text.

To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table.

One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.

Dataset Statistics

	Train	Validation	Test
Books	28,602	50	100
Num. Tokens	1,973,136,207	3,007,061	6,966,499

Bibtex

@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
          Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name The PG-19 Language Modeling Benchmark

alternateName PG-19

url https://github.com/deepmind/pg19

sameAs https://github.com/deepmind/pg19

description This repository contains the PG-19 dataset. It includes a set of books extracted from the Project Gutenberg books project (https://www.gutenberg.org), that were published before 1919. It also contains metadata of book titles and publication dates.

provider

property	value
name	`DeepMind`
sameAs	`https://en.wikipedia.org/wiki/DeepMind`

license

property	value
name	`Apache License, Version 2.0`
url	`https://www.apache.org/licenses/LICENSE-2.0.html`

citation https://identifiers.org/arxiv:1911.05507

Contact

If you have any questions, please contact Jack Rae.

References

[1] https://www.gutenberg.org
[2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
[3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
[4] Ofcom offensive language guide
[5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
[6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)

PG-19 Language Modelling Benchmark

Related tags

Overview

PG-19 Language Modelling Benchmark

Dataset Statistics

Bibtex

Dataset Metadata

Contact

References

Owner

DeepMind

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Super easy library for BERT based NLP models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Skipgram Negative Sampling in PyTorch

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Chinese version of GPT2 training code, using BERT tokenizer.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

原神抽卡记录数据集-Genshin Impact gacha data

📔️ Generate a text-based journal from a template file.

customer care chatbot made with Rasa Open Source.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Python library for parsing resumes using natural language processing and machine learning

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Pre-training BERT masked language models with custom vocabulary

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

Open Source Neural Machine Translation in PyTorch

Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .