Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Topic Modelling for Humans

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Biterm Topic Model (BTM): modeling topics in short texts

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

NeMo: a toolkit for conversational AI

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

An evaluation toolkit for voice conversion models.

Natural language computational chemistry command line interface.

Paddle2.x version AI-Writer

硕士期间自学的NLP子任务，供学习参考

Exploration of BERT-based models on twitter sentiment classifications

NLP made easy

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

结巴中文分词