AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Last update: Nov 24, 2022

Related tags

Overview

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

This repository contains the code for the paper Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages which appears in the first workshop on Multilingual Representation Learning at EMNLP 2021.

AfriBERTa was trained on 11 languages - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. AfriBERTa was evaluated on NER and text classification spanning 10 languages (some of which it was not pretrained on). It outperformed mBERT and XLM-R on several languages and is very competitive overall.

Pretrained models

We release the following pretrained models:

AfriBERTa Small (97M params)
AfriBERTa Base (111M params)
AfriBERTa Large (126M params)

Reproducing Experiments

Datasets and Tokenizer

Below are details on how to obtain the datasets and trained sentencepiece tokenizer:

Language Modelling: The data for language modelling can be downloaded from this URL

NER: To obtain the NER dataset, please download it from this repository

Text Classification: To obtain the topic classification dataset, please download it from this repository

Tokenizer: The trained sentencepiece tokenizer can be downloaded from this URL

Training

To train AfriBERTa and evaluate on both downstream tasks, simply install all requirements in requirements.txt, download the relevant datasets and run the following script:

bash run_all.sh

This script will:

Train the multilingual language model from scratch and save the model as well as relevant logs
Evaluate the trained language model on NER for all ten languages over 5 seeds
Evaluate the trained language model on text classification for all two languages over 5 seeds

Citation

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Related tags

Overview

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Pretrained models

Reproducing Experiments

Datasets and Tokenizer

Training

Citation

Owner

Kelechi

Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

This repository contains code and data for "On the Multimodal Person Verification Using Audio-Visual-Thermal Data"

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

🇰🇷 Text to Image in Korean

Classification Modeling: Probability of Default

AAAI 2022 paper - Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

Convert onnx models to pytorch.

Dirty Pixels: Towards End-to-End Image Processing and Perception

Selective Wavelet Attention Learning for Single Image Deraining

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022)

git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

Doge-Prediction - Coding Club prediction ig

Pytorch Implementation of LNSNet for Superpixel Segmentation

Implementation of FitVid video prediction model in JAX/Flax.

PyTorch implementation of Histogram Layers from DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation

Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph