This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Overview

GPT-2 in Catalan

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2. In other words... this is more of a prototype and a personal playground than a serious attempt to have a fully functional GPT-2 in Catalan.

Nevertheless, I hope this can also help someone else train their own GPT-2 model and provide some pointers on how to do so.

Suggestions and constructive criticism are always welcome!

1. GPT-2 📝

1.1. What is GPT-2

GPT-2 (GPT-2 stands for Generative Pre-trained Transformer 2) is a transformer-based language model trained in large volumes of data and was not trained with a specific task in mind. Nevertheless, it has probably been used mostly for generating new text.

A better and further explanation can be found here (http://jalammar.github.io/illustrated-gpt2/).

1.2. Why GPT-2

It is undeniable that GPT-2 played a large role and became very popular when it came out. It has also created some controversy. These aside, GPT-2 acted as a big step forward in terms of generating texts... And is also "faster" to train on custom data than its next generation sibling, GPT-3.

2. Training 🔨

2.1. Requirements 📎

You will need a powerful GPU or reduce the batch size. You can also use a VM from a Cloud service such as Google Colab or Microsoft Azure.

2.2. Training Script 📈

The training is implemented in the train_GPT2.py script, which serves as a skeleton. You can run it from the Commandline and passing all the arguments.

e.g.

cd src
./train_GPT2.py \
    --model DeepESP/gpt2-spanish \
    --tokenizer DeepESP/gpt2-spanish \
    --train_path ../data/catalan_corpus_train.csv \
    --test_path ../data/catalan_corpus_test.csv \
    --n_epochs 1 \
    --train_batch_size 4 \
    --eval_batch_size 8 \
    --eval_steps 100 \
    --save_steps 1000 \
    --warmup_steps 100 \
    --output gpt2-catalan

2.3. About the data used 📂 open_file_folder

The data used has mostly been the WikiCorpus data provided by the Computer Science department @ FIB, UPC (Facultat d'Informàtica de Barcelona, Universitat Politècnica de Catalunya).

You can download it using the datasets library from Huggingface:

from datasets import load_dataset

dataset = load_dataset("wikicorpus, 'raw_ca')

Or you can use the download_wikicorpus.py file in this repository, which also splits the data in train/test and can create a smaller subset for testing, if desired.

2.3.1. WikiCorpus PROs 👍

Well, the data is already obtained. That's always a pro.

2.3.2. WikiCorpus CONs 👎

We are limiting the knowledge of the Language model to data from the Wikipedia. Therefore, this model will probably be more error-prone with informal text inputs. This includes data from chats, colloquialisms and text from social media.

Additionally, the size of the data is tiny with respect to what it should be.

Further training for specific tasks

Once the model is trained in Catalan and we have a base, we can further train this model for a specific task in mind.

A couple of Proof of Concepts (PoC) have been done using data gathered from Twitter and also from Catalan songs.

Testing the model 🐱

We can test the trained model easily using the script test_generation.py.

cd src
python .\test_generation.py -t DeepESP/gpt2-spanish -m ../data/gpt2-catalan -i generation_test.txt

3. Questions

3.1. Why Catalan

Artificial Intelligence should not be only for largely spoken languages, such as English or even Spanish. Catalan, a minority language, is my mother tongue and it's always fun to see something you work with also operating in your own language. So why not?

3.2. Why use a Pretrained model in Spanish

Although Spanish and Catalan are different languages, they share a lot of expressions, vocabulary and grammatical structures. Therefore, basing a Catalan model on a previously trained model in a close language such as Spanish is not unreasonable.

Transferring the knowledge from it to our model is better than starting from zero, specially to save computational time.

3.3. Can I use another data/language

Even though the scripts are all prepared with the Catalan language in mind, the scripts should work with any text data, be it Catalan from the Wikicorpus,

Feel free to change the CatalanDataset class or swap it with yours, since probably formatting of the input text is the most varying aspect between projects.

Be sure to also change the base model, since if you want to train another language (e.g. German), basing it on a pre-trained model in Spanish will not work well.

4. TO-DO 🚧

Since we are actually using the Transfer learning approach and relying on a previously pretrained model in Spanish, we probably don't have as an accurate model as we should.

More varied data should also be used during the training, because it is very biased towards informative data (for obvious reasons).

Owner
Laura
.
Laura
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.4k Jan 01, 2023
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 09, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
Conditional Transformer Language Model for Controllable Generation

CTRL - A Conditional Transformer Language Model for Controllable Generation Authors: Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong,

Salesforce 1.7k Dec 28, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 01, 2022
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
GPT-3: Language Models are Few-Shot Learners

GPT-3: Language Models are Few-Shot Learners arXiv link Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-trainin

OpenAI 12.5k Jan 05, 2023
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021