Installation:

pip install lm_dataloader

Design Philosophy

A library to unify lm dataloading at large scale
Simple interface, any tokenizer can be integrated
Minimal changes needed from small -> large scale (many multiple GPU nodes)
follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:
- Easily combine multiple datasets
- Easily split a dataset into train / val / test splits
- Easily build a weighted dataset out of a list of existing ones along with weights.
- unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
- index files that are built on the fly are hidden files, leaving less mess in the directory.
- More straightforward interface, better documentation.
- Inspectable with a command line tool
- Can load from urls
- Can load from S3 buckets
- Can load from GCS buckets
- Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

Dataloader tools for language modelling

Related tags

Overview

Installation:

Design Philosophy

Example usage

Command line utilities

Owner

Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Jiminy Cricket Environment (NeurIPS 2021)

General Multi-label Image Classification with Transformers

Users can free try their models on SIDD dataset based on this code

adversarial_multi_armed_bandit_variable_plays

A PyTorch implementation of DenseNet.

An implementation of the research paper "Retina Blood Vessel Segmentation Using A U-Net Based Convolutional Neural Network"

FactSeg: Foreground Activation Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery (TGRS)

Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

StyleGAN2-ADA-training-jupyter - Training custom datasets in styleGAN2-ADA by NVIDIA using Jupyter

3D Avatar Lip Syncronization from speech (JALI based face-rigging)

OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation

🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation

CAR-API: Cityscapes Attributes Recognition API

Prompts - Read a textfile of prompts and import into anki via ankiconnect

Build a medical knowledge graph based on Unified Language Medical System (UMLS)

The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

Research code for the paper "Variational Gibbs inference for statistical estimation from incomplete data".

Transfer Learning Shootout for PyTorch's model zoo (torchvision)