PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Last update: Aug 02, 2022

Related tags

Data Analysis PLStream

Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.

Environment Requirements

relative python packages are summerized in requirements.txt

Flink v1.13
Python 3.7
Java 8

DataSource

Dataset quick access on https://course.fast.ai/datasets#nlp

Tweets

1.6 million labeled Tweets:
Source:Sentiment140

Yelp Reviews

280,000 training and 19,000 test samples in each polarity
Source:Yelp Review Polarity

Amazon Reviews

1,800,000 training and 200,000 testing samples in each polarity
Source:Amazon product review polarity

Quick Start

quick try PLStream on yelp review dataset

Data Prepare

cd PLStream
weget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
tar zxvf yelp_review_polarity_csv.tgz
mv yelp_review_polarity_csv/train.csv train.csv

1. Install required environment of PLStream

please make sure Environment Requirements mentioned above is ready.

pip install -r requirements.txt

2. Start Redis-Server in a terminal

redis-server

3. Run PLStream

python PLStream.py

The outputs' form is "original text" + "label" + "@@@@":
With help of a split("@@@@") function we can further reorganize the labelled dataset.

Optional

to see the labelling accuracy, simply run: python PLStream_acc.py

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Related tags

Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

Environment Requirements

DataSource

Tweets

Yelp Reviews

Amazon Reviews

Quick Start

Data Prepare

1. Install required environment of PLStream

2. Start Redis-Server in a terminal

3. Run PLStream

Optional

Owner

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Hidden Markov Models in Python, with scikit-learn like API

The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

nrgpy is the Python package for processing NRG Data Files

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

International Space Station data with Python research 🌎

Pipeline to convert a haploid assembly into diploid

Randomisation-based inference in Python based on data resampling and permutation.

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Pandas and Spark DataFrame comparison for humans

simple way to build the declarative and destributed data pipelines with python

PyChemia, Python Framework for Materials Discovery and Design

Learn machine learning the fun way, with Oracle and RedBull Racing

Spaghetti: an open-source Python library for the analysis of network-based spatial data

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

CRISP: Critical Path Analysis of Microservice Traces

BIGDATA SIMULATION ONE PIECE WORLD CENSUS

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.