DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Last update: Jan 07, 2023

Related tags

Overview

DziriBERT

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian text contents written using both Arabic and Latin characters. It sets new state of the art results on Algerian text classification datasets, even if it has been pre-trained on much less data (~1 million tweets).

The model is publicly available at: https://huggingface.co/alger-ia/dziribert.

For more information, please visit our paper: https://arxiv.org/pdf/2109.12346.pdf

Evaluation

The Twifil dataset was used to compare DziriBERT with current multilingual, standard Arabic and dialectal Arabic models:

Model	Sentiment acc.	Emotion acc.
bert-base-multilingual-cased	73.6 %	59.4 %
aubmindlab/bert-base-arabert	72.1 %	61.2 %
CAMeL-Lab/bert-base-arabic-camelbert-mix	77.1 %	65.7 %
qarib/bert-base-qarib	77.7 %	67.6 %
UBC-NLP/MARBERT	80.1 %	68.4 %
alger-ia/dziribert	80.3 %	69.3 %

In order to reproduce these results, please install the following requirements:

pip install -r requirements.txt

Then, run the following evaluation script:

python3 evaluate_model.py

These results have been obtained on a Tesla K80 GPU.

Pretrained DziriBERT

DziriBERT has been uploaded to the HuggingFace hub in order to facilitate its use: https://huggingface.co/alger-ia/dziribert.

It can be easily downloaded and loaded using the transformers library:

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("alger-ia/dziribert")
model = BertForMaskedLM.from_pretrained("alger-ia/dziribert")

How to cite

@article{dziribert,
  title={DziriBERT: a Pre-trained Language Model for the Algerian Dialect},
  author={Abdaoui, Amine and Berrimi, Mohamed and Oussalah, Mourad and Moussaoui, Abdelouahab},
  journal={arXiv preprint arXiv:2109.12346},
  year={2021}
}

Contact

Please contact [email protected] for any question, feedback or request.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Related tags

Overview

DziriBERT

Evaluation

Pretrained DziriBERT

How to cite

Contact

Owner

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Understanding the Difficulty of Training Transformers

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

A paper list of pre-trained language models (PLMs).

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Use fastai-v2 with HuggingFace's pretrained transformers

A text augmentation tool for named entity recognition.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

An open-source NLP research library, built on PyTorch.

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

Submit issues and feature requests for our API here.

Pretrain CPM - 大规模预训练语言模型的预训练代码

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

NLP, Machine learning

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

A number of methods in order to perform Natural Language Processing on live data derived from Twitter