Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Related tags

Text Data & NLPNERDA
Overview

NERDA

Build status codecov PyPI PyPI - Downloads License

Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning pretrained transformers for Named Entity Recognition (=NER) tasks.

You can also utilize NERDA to access a selection of precooked NERDA models, that you can use right off the shelf for NER tasks.

NERDA is built on huggingface transformers and the popular pytorch framework.

Installation guide

NERDA can be installed from PyPI with

pip install NERDA

If you want the development version then install directly from GitHub.

Named-Entity Recogntion tasks

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.1

Example Task:

Task

Identify person names and organizations in text:

Jim bought 300 shares of Acme Corp.

Solution

Named Entity Type
'Jim' Person
'Acme Corp.' Organization

Read more about NER on Wikipedia.

Train Your Own NERDA Model

Say, we want to fine-tune a pretrained Multilingual BERT transformer for NER in English.

Load package.

from NERDA.models import NERDA

Instantiate a NERDA model (with default settings) for the CoNLL-2003 English NER data set.

from NERDA.datasets import get_conll_data
model = NERDA(dataset_training = get_conll_data('train'),
              dataset_validation = get_conll_data('valid'),
              transformer = 'bert-base-multilingual-uncased')

By default the network architecture is analogous to that of the models in Hvingelby et al. 2020.

The model can then be trained/fine-tuned by invoking the train method, e.g.

model.train()

Note: this will take some time depending on the dimensions of your machine (if you want to skip training, you can go ahead and use one of the models, that we have already precooked for you in stead).

After the model has been trained, the model can be used for predicting named entities in new texts.

# text to identify named entities in.
text = 'Old MacDonald had a farm'
model.predict_text(text)
([['Old', 'MacDonald', 'had', 'a', 'farm']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

This means, that the model identified 'Old MacDonald' as a PERson.

Please note, that the NERDA model configuration above was instantiated with all default settings. You can however customize your NERDA model in a lot of ways:

  • Use your own data set (finetune a transformer for any given language)
  • Choose whatever transformer you like
  • Set all of the hyperparameters for the model
  • You can even apply your own Network Architecture

Read more about advanced usage of NERDA in the detailed documentation.

Use a Precooked NERDA model

We have precooked a number of NERDA models for Danish and English, that you can download and use right off the shelf.

Here is an example.

Instantiate a multilingual BERT model, that has been finetuned for NER in Danish, DA_BERT_ML.

from NERDA.precooked import DA_BERT_ML()
model = DA_BERT_ML()

Down(load) network from web:

model.download_network()
model.load_network()

You can now predict named entities in new (Danish) texts

# (Danish) text to identify named entities in:
# 'Jens Hansen har en bondegård' = 'Old MacDonald had a farm'
text = 'Jens Hansen har en bondegård'
model.predict_text(text)
([['Jens', 'Hansen', 'har', 'en', 'bondegård']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

List of Precooked Models

The table below shows the precooked NERDA models publicly available for download.

Model Language Transformer Dataset F1-score
DA_BERT_ML Danish Multilingual BERT DaNE 82.8
DA_ELECTRA_DA Danish Danish ELECTRA DaNE 79.8
EN_BERT_ML English Multilingual BERT CoNLL-2003 90.4
EN_ELECTRA_EN English English ELECTRA CoNLL-2003 89.1

F1-score is the micro-averaged F1-score across entity tags and is evaluated on the respective test sets (that have not been used for training nor validation of the models).

Note, that we have not spent a lot of time on actually fine-tuning the models, so there could be room for improvement. If you are able to improve the models, we will be happy to hear from you and include your NERDA model.

Model Performance

The table below summarizes the performance (F1-scores) of the precooked NERDA models.

Level DA_BERT_ML DA_ELECTRA_DA EN_BERT_ML EN_ELECTRA_EN
B-PER 93.8 92.0 96.0 95.1
I-PER 97.8 97.1 98.5 97.9
B-ORG 69.5 66.9 88.4 86.2
I-ORG 69.9 70.7 85.7 83.1
B-LOC 82.5 79.0 92.3 91.1
I-LOC 31.6 44.4 83.9 80.5
B-MISC 73.4 68.6 81.8 80.1
I-MISC 86.1 63.6 63.4 68.4
AVG_MICRO 82.8 79.8 90.4 89.1
AVG_MACRO 75.6 72.8 86.3 85.3

'NERDA'?

'NERDA' originally stands for 'Named Entity Recognition for DAnish'. However, this is somewhat misleading, since the functionality is no longer limited to Danish. On the contrary it generalizes to all other languages, i.e. NERDA supports fine-tuning of transformers for NER tasks for any arbitrary language.

Background

NERDA is developed as a part of Ekstra Bladet’s activities on Platform Intelligence in News (PIN). PIN is an industrial research project that is carried out in collaboration between the Technical University of Denmark, University of Copenhagen and Copenhagen Business School with funding from Innovation Fund Denmark. The project runs from 2020-2023 and develops recommender systems and natural language processing systems geared for news publishing, some of which are open sourced like NERDA.

Shout-outs

Read more

The detailed documentation for NERDA including code references and extended workflow examples can be accessed here.

Contact

We hope, that you will find NERDA useful.

Please direct any questions and feedbacks to us!

If you want to contribute (which we encourage you to), open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Owner
Ekstra Bladet
GitHub of Ekstra Bladet Analyse
Ekstra Bladet
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
A fast hierarchical dimensionality reduction algorithm.

h-NNE: Hierarchical Nearest Neighbor Embedding A fast hierarchical dimensionality reduction algorithm. h-NNE is a general purpose dimensionality reduc

Marios Koulakis 35 Dec 12, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 03, 2023
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022
ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

Gunther Cox 12.8k Jan 03, 2023
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 02, 2023
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
构建一个多源(公众号、RSS)、干净、个性化的阅读环境

2C 构建一个多源(公众号、RSS)、干净、个性化的阅读环境 作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题。 假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运

howie.hu 678 Dec 28, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022