Data preprocessing rosetta parser for python

Overview

datapreprocessing_rosetta_parser

I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity, specifically targeting popular packages like pandas, beautifulsoup and spacy.

The main idea of my project is to recreate Jelle Teijema's preprocessing pipeline and then try to run Dutch language model on each document to extract things of interest, such as emails, urls, organizations, people and dates. Maybe at this point, it shouldn't be considered just pre-processing, hmmm. Anyway, I've used nl_core_news_lg model. It is not very reliable, especially for organization and person names, however, it still allows for interesting queries.

Moreover, I've decided to try to do a summarization and collection of the most frequent words in the documents. My script tries to find N_SUMMARY_SENTENCES most important sentences and store it in the summary column. Please note, my Dutch is not very strong, so I can't really judge how well it works :)

Finally, the script also saves cleaned title and file contents, as per track anticipated output.

Output file

generate.py reads .csv files from input_data folder and produces output .csv file with | separator. It is pretty heavy (about x1.8 of input csv, ~75MB) and has a total of 15 columns:

Column name Description
filename Original filename provided in the input file
file_content Original file contents provided in the input file
id The dot separated numbers from the filename
category Type of a file
filename_date Date extracted from a filename
parsed_date Date extracted from file contents
found_emails Emails found in the file contents
found_urls URLs found in the file contents
found_organizations Organizations found in the file contents
found_people People found in the file contents
found_dates Dates found in the file contents
summary Summary of the document
top5words Top 5 most frequently used words in the file contents
title Somewhat cleaned title
abstract Somewhat cleaned file contents

Some interesting queries that I could think of at 12pm

  1. Load the output processed .csv file:
import pandas as pd
df = pd.read_csv('./output_data/processed_data.csv', sep='|',
                 index_col=0, dtype=str)
  1. All unique emails found in the documents:
import ast
emails = sum([ast.literal_eval(x) for x in df['found_emails']], [])
unique_emails = set(emails)
  1. Top 10 communicated domains in the documents:
from collections import Counter
domains = [x.split('@')[1] for x in emails]
d_counter = Counter(domains)
print(d_counter.most_common(10))
  1. Top 10 organizations mentioned in the documents:
orgs = sum([ast.literal_eval(x) for x in df['found_organizations']], [])
o_counter = Counter(orgs)
print(o_counter.most_common(10))
  1. Find IDs of documents that contain word "confidential" in them:
df['id'][df['abstract'].str.contains('confidential')]
  1. How many documents and categories there are in the dataset:
print(f'Total number of documents: {len(df)}')
print('Documents by category:')
df['category'].value_counts()

and I am sure you can be significantly more creative with this :)

How to generate output data

  1. Install dependencies with conda and switch to the environment:
conda env create -f environment.yml
conda activate ftm_hackathon

Alternatively (not tested), you can install packages to your current environment manually:

pip install spacy tqdm pandas bs4
  1. Download Dutch spacy model, ~500MB:
python -m spacy download nl_core_news_lg
  1. Put your raw .csv files into input_data folder.

  2. Run generate.py. On my 6yo laptop it takes ~17 minutes.

  3. The result will be written in output_data/processed_data.csv

Owner
ASReview hackathon for Follow the Money
ASReview hackathon for Follow the Money
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Hydra-Auto-Hack A python script that will use hydra to get user and password to login to ssh, ftp, and telnet Project Description This python script w

2 Jan 16, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
Graphical user interface for Argos Translate

Argos Translate GUI Website | GitHub | PyPI Graphical user interface for Argos Translate. Install pip3 install argostranslategui

Argos Open Tech 16 Dec 07, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 02, 2023
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🤗 Contributing to OpenSpeech 🤗 OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 03, 2023
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 03, 2023
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SNCSE SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples This is the repository for SNCSE. SNCSE aims to allev

Sense-GVT 59 Jan 02, 2023
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022