Russian GPT3 models.

Overview

ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large

This repository contains bunch of autoregressive transformer language models trained on a huge dataset of russian language.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Large) trained with 1024 sequence length.

We suggest using ruGPT2Large or ruGPT3XL because this models are well tested and achieve the best perplexity.

Usage examples are described in detail here.

Old version of code you can find here

Table of contents

Setup and usage

Models can be used for inference or finetuning with two ways: 🤗 HuggingFace interface or our code based on this implementation.

For both ways install transformers:

pip install transformers==3.5.0

HuggingFace interface

We support 🤗 HuggingFace interface only for ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large models. For RuGPT3XL please use code in this repo because RuGPT3XL model was trained with sparse attention.

Here we can obtain examples of finetuning or generation.

Also this examples is adapted for google colab:

  • finetuning: finetuning
  • generation: generation

Basic usage:

from transformers import GPT2LMHeadModel, GPT2Tokenizer


model_name_or_path = "sberbank-ai/rugpt3large_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).cuda()
text = "Александр Сергеевич Пушкин родился в "
input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
out = model.generate(input_ids.cuda())
generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)
# Output should be like this:
# Александр Сергеевич Пушкин родился в \n1799 году. Его отец был крепостным крестьянином, а мать – крепостной крестьянкой. Детство и юность Пушкина прошли в деревне Михайловское под Петербургом. В 1820-х годах семья переехала

For more information about 🤗 HuggingFace interface please follow this documentation.

Data issues

For training pass single txt file.

Megatron interface

Without deepspeed

For using our code for finetuning without deepspeed (not recommended) we should install apex:

%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

sh setup.sh

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed

For using our code for finetuning with deepspeed (recommended) we should install apex (see previous section) and deepspeed:

pip install deepspeed==0.3.7

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! For using deepspeed we should specify environ variable before all your python scripts and run with torch.distributed or mpi:

USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \
  --train-data-path "train.list" \
  --test-data-path "valid.list" \
  --max-files-per-process 100 \
  --save model \
  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --seq-length 2048 \
  --max-position-embeddings 2048 \
  --fp16 \
  --checkpoint-activations \
  --deepspeed-activation-checkpointing \
  --deepspeed \
  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json
Data issues

We use custom implementation of distributed dataset. For training and evaluating we should specify file file.list with list of paths to txt files. All files from file.list will be splitted between aviable GPUs. The logic of splitting is described by the following code:

shard_size = len(files) // world_size
shard_start = rank * shard_size
shard_end = (rank + 1) * shard_size
files = files[shard_start:shard_end]

For more details please see full code of dataset: src.dataset_rugpt3.RuGpt3TextDataset and example.

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed and sparsity

This section is used mostly for usage of RuGPT3XL model and training models with sparse attention.

apt-get install llvm-9-dev
pip install cpufeature
pip install triton==0.2.3
DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7

Test installation of deepspeed you can with the following command: ds_report.

Example of inference of RuGPT3XL here or Open In Colab

Example of finetune, load finetuned model and generate is here.

For using sparse layers in model use --sparse-mode and specify key "sparse_attention" at deepspeed_config (RuGPT3XL config example). Modes can be: fixed, bigbird, bslongformer, variable, dense.

More information about sparse attention here.

Pretraining details

All pretraining was done on Nvidia Tesla V100-SXM3 32 Gb GPUs on a Christofari Cluster. Following are the details of pretraining for each model.

Pretraining ruGPT3XL

Model was trained with 512 sequence length using Deepspeed and Megatron code by SberDevices team, on 80B tokens dataset for 4 epochs. After that model was finetuned 1 epoch with sequence length 2048.
Note! Model has sparse attention blocks.

Total training time was around 10 days on 256 GPUs.
Final perplexity on test set is 12.05.

🤗 HuggingFace model card link.

See more details for generation here or Open In Colab.

Example of finetune, load finetuned model and generate is here.

Our pretraining script here

Example of finetuning script here

Pretraining ruGPT3Large

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epochs. After that model was finetuned 1 epoch with sequence length 2048.

Total training time was around 14 days on 128 GPUs for 1024 context and few days on 16 GPUs for 2048 context.
Final perplexity on test set is 13.6.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3large_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Medium

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epoch. After that model was finetuned on 2048 context.

Total training time was around 16 days on 64 GPUs.
Final perplexity on test set is 17.4.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3medium_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Small

Model was trained with sequence length 1024 using transformers by SberDevices team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3small_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT2Large

Model was trained with sequence length 1024 using transformers by SberDevices team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model by using transformers with model name sberbank-ai/rugpt2large.

🤗 HuggingFace model card link

Advanced

Pretrained scripts (advanced)

Also we add pretraining scripts for all models (except RuGPT2Large). See scripts dir.

Note! All training params (such as lr, wd, ...) may was different while real training. This is just for example.

Convert checkpoint to HuggingFace

For converting megatron checkpoint to HuggingFace format use the following script (example for RuGPT3Small):

python convert2huggingface.py \
  --load /path/to/save/dir/ \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --max-position-embeddings 2048 \
  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \
  --no-load-optim \
  --export-huggingface /path/to/converted/checkpoint

After converting we can use HuggingFace model:

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("/path/to/converted/checkpoint")

Note! Conversion is worked for all models except RuGPT3XL. For using of RuGPT3XL see example of inference of RuGPT3XL here or Open In Colab.

Owner
Sberbank AI
Sberbank AI
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

CLUST-consortium 43 Nov 28, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022
Code and data accompanying Natural Language Processing with PyTorch

Natural Language Processing with PyTorch Build Intelligent Language Applications Using Deep Learning By Delip Rao and Brian McMahan Welcome. This is a

Joostware 1.8k Jan 01, 2023
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022