KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Related tags

Deep LearningKIND
Overview

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition.

It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

For the construction of the dataset, we decide to use texts available for free, under a license that permits both research and commercial use.

In particular we release four chapters with texts taken from: (i) Wikinews (WN) as a source of news texts belonging to the last decades; (ii) some Italian fiction books (FIC) whose authors died more than 70 years ago; (iii) writings and speeches from Italian politicians Aldo Moro (AM) and (iv) Alcide De Gasperi (ADG).

Wikinews

Wikinews is a multi-language free project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License.

In building KIND, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.

Literature

Regarding fiction literature, we annotate 86 book chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.

In particular, we choose: Il giorno delle Mésules (Ettore Castiglioni, 12,853 tokens), L'amante di Cesare (Augusto De Angelis, 13,464 tokens), Canne al vento (Grazia Deledda, 13,945 tokens), 1861-1911 - Cinquant’anni di vita nazionale ricordati ai fanciulli (Guido Fabiani, 10,801 tokens), Lettere dal carcere (Antonio Gramsci, 10,655), Anarchismo e democrazia (Errico Malatesta, 11,557 tokens), L'amore negato (Maria Messina, 31,115 tokens), La luna e i falò (Cesare Pavese, 10,705 tokens), La coscienza di Zeno (Italo Svevo, 56,364 tokens), Le cose piu grandi di lui (Luciano Zuccoli, 20,989 tokens).

In selecting works without copyright, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.

Aldo Moro's Works

Writings belonging to Aldo Moro have recently been collected by the University of Bologna and published on a platform called Edizione Nazionale delle Opere di Aldo Moro.

The project is still ongoing and, by now, it contains 806 documents for a total of about one million tokens.

In the first release of KIND, we include 392,604 tokens from the Aldo Moro's works dataset, with silver annotations (see the reference below).

Alcide De Gasperi's Writings

Finally, we annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history.

The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.

License

The NER annotations in (i), (ii), and (iii) are released under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Annotation from Alcide De Gasperi's writings are released under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Owner
Digital Humanities
Digital Humanities Unit at Fondazione Bruno Kessler
Digital Humanities
百度2021年语言与智能技术竞赛机器阅读理解Pytorch版baseline

项目说明: 百度2021年语言与智能技术竞赛机器阅读理解Pytorch版baseline 比赛链接:https://aistudio.baidu.com/aistudio/competition/detail/66?isFromLuge=true 官方的baseline版本是基于paddlepadd

周俊贤 54 Nov 23, 2022
Finite Element Analysis

FElupe - Finite Element Analysis FElupe is a Python 3.6+ finite element analysis package focussing on the formulation and numerical solution of nonlin

Andreas D. 20 Jan 09, 2023
Convert Apple NeuralHash model for CSAM Detection to ONNX.

Apple NeuralHash is a perceptual hashing method for images based on neural networks. It can tolerate image resize and compression.

Asuhariet Ygvar 1.5k Dec 31, 2022
One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking This is an official implementation for NEAS presented in CVPR

Multimedia Research 19 Sep 08, 2022
A set of tools to pre-calibrate and calibrate (multi-focus) plenoptic cameras (e.g., a Raytrix R12) based on the libpleno.

COMPOTE: Calibration Of Multi-focus PlenOpTic camEra. COMPOTE is a set of tools to pre-calibrate and calibrate (multifocus) plenoptic cameras (e.g., a

ComSEE - Computers that SEE 4 May 10, 2022
《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification》(AAAI 2021) GitHub:

LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification

76 Dec 05, 2022
Run Keras models in the browser, with GPU support using WebGL

**This project is no longer active. Please check out TensorFlow.js.** The Keras.js demos still work but is no longer updated. Run Keras models in the

Leon Chen 4.9k Dec 29, 2022
Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Implementation for Iso-Points (CVPR 2021) Official code for paper Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations paper |

Yifan Wang 66 Nov 08, 2022
[NeurIPS 2021] "Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks" by Yonggan Fu, Qixuan Yu, Yang Zhang, Shang Wu, Xu Ouyang, David Cox, Yingyan Lin

Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks Yonggan Fu, Qixuan Yu, Yang Zhang, S

12 Dec 11, 2022
Synthetic Humans for Action Recognition, IJCV 2021

SURREACT: Synthetic Humans for Action Recognition from Unseen Viewpoints Gül Varol, Ivan Laptev and Cordelia Schmid, Andrew Zisserman, Synthetic Human

Gul Varol 59 Dec 14, 2022
Recurrent Neural Network Tutorial, Part 2 - Implementing a RNN in Python and Theano

Please read the blog post that goes with this code! Jupyter Notebook Setup System Requirements: Python, pip (Optional) virtualenv To start the Jupyter

Denny Britz 863 Dec 15, 2022
Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

Soubhik Sanyal 689 Dec 25, 2022
Official implementation of our paper "Learning to Bootstrap for Combating Label Noise"

Learning to Bootstrap for Combating Label Noise This repo is the official implementation of our paper "Learning to Bootstrap for Combating Label Noise

21 Apr 09, 2022
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022
Algorithms for outlier, adversarial and drift detection

Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline d

Seldon 1.6k Dec 31, 2022
Official implementation of Neural Bellman-Ford Networks (NeurIPS 2021)

NBFNet: Neural Bellman-Ford Networks This is the official codebase of the paper Neural Bellman-Ford Networks: A General Graph Neural Network Framework

MilaGraph 136 Dec 21, 2022
Latent Execution for Neural Program Synthesis

Latent Execution for Neural Program Synthesis This repo provides the code to replicate the experiments in the paper Xinyun Chen, Dawn Song, Yuandong T

Xinyun Chen 16 Oct 02, 2022
Import Python modules from dicts and JSON formatted documents.

Paker Paker is module for importing Python packages/modules from dictionaries and JSON formatted documents. It was inspired by httpimporter. Important

Wojciech Wentland 1 Sep 07, 2022
Image-Scaling Attacks and Defenses

Image-Scaling Attacks & Defenses This repository belongs to our publication: Erwin Quiring, David Klein, Daniel Arp, Martin Johns and Konrad Rieck. Ad

Erwin Quiring 163 Nov 21, 2022
ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

This repo contains some of the codes for the following paper Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code

Xuewen Yang 56 Dec 08, 2022