Pipeline for fast building text classification TF-IDF + LogReg baselines.

Last update: Dec 07, 2022

Overview

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

install pipeline:

pip install text-classification-baseline

run pipeline:

either in terminal:

text-clf-train

or in python:

import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

text
target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

terminal:

text-clf-train --path_to_config config.yaml

python:

import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with TF-IDF and LogReg steps
target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
config.yaml - config that was used to train the model
logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

285 Jan 2, 2023

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

160 Feb 9, 2021

Comments

release v0.1.4
fixed load_20newsgroups.py (#65 #71)

added Makefile (#71)

added logging confusion matrix (#72)

replaced all "valid" occurrences with "test" (#74)

updated docstrings (#77)

changed python interface - train function returns model and target_names_mapping (#78)

enhancement
opened by dayyass 1
release v0.1.6

fixed token frequency support (add token frequency support #85) fixed threshold selection for binary classification (add threshold selection for binary classification #86)
bug enhancement

opened by dayyass 0
release v0.1.5
added lemmatization (#66)

added token frequency support (#84)

added threshold selection for binary classification (#79)

added arbitrary save folder name (#80)

enhancement
opened by dayyass 0
release v0.1.5
added lemmatization (#81)

added token frequency support (#85)

added threshold selection for binary classification (#86)

added arbitrary save folder name (#83)

enhancement
opened by dayyass 0

Releases(v0.1.6)

v0.1.6(Nov 6, 2021)
Release v0.1.6

fixed token frequency support (add token frequency support #85)

fixed threshold selection for binary classification (add threshold selection for binary classification #86)

Source code(tar.gz)
Source code(zip)
v0.1.5(Oct 21, 2021)
Release v0.1.5 🥳🎉🍾

added pymorphy2 lemmatization (#81)

added token frequency support (#85)

added threshold selection for binary classification (#86)

added arbitrary save folder name (#83)

pymorphy2 lemmatization (config.yaml)

# preprocessing # (included in resulting model pipeline, so preserved for inference) preprocessing: lemmatization: pymorphy2

token frequency support

text_clf.token_frequency.get_token_frequency(path_to_config) -
get token frequency of train dataset according to the config file parameters

threshold selection for binary classification

text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
get precision and recall metrics for precision-recall curve

text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve

text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
plot precision-recall curve

text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
plot roc curve

text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
plot precision, recall, f1-score curves for probability thresholds

arbitrary save folder name (config.yaml)

experiment_name: model
Source code(tar.gz)
Source code(zip)
v0.1.4(Oct 10, 2021)
fixed load_20newsgroups.py (#65 #71)

added Makefile (#71)

added logging confusion matrix (#72)

replaced all "valid" occurrences with "test" (#74)

updated docstrings (#77)

changed python interface - train function returns model and target_names_mapping (#78)

Source code(tar.gz)
Source code(zip)
v0.1.3(Sep 2, 2021)
added hyper-parameters tuning (#58)

Source code(tar.gz)
Source code(zip)
v0.1.2(Aug 19, 2021)
fixed bug with multiple logging (#55)

Source code(tar.gz)
Source code(zip)
v0.1.1(Aug 11, 2021)
added logging (#43)

added unittests (#49)

added CI with linter, tests, codecov (#46 #49)

added docker (#48)

Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 7, 2021)

First release.
Source code(tar.gz)
Source code(zip)

Owner

Dani El-Ayyass

NLP Tech Lead @ Sber AI, Master Student in Applied Mathematics and Computer Science @ CMC MSU

GitHub Repository https://pypi.org/project/text-classification-baseline/

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

2 Jan 04, 2023

Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

2 Dec 18, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 07, 2023

novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作，包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。目录计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022

Crowd sourced training data for Rasa NLU models

NLU Training Data Crowd-sourced training data for the development and testing of Rasa NLU models. If you're interested in grabbing some data feel free

169 Dec 26, 2022

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

748 Jan 06, 2023

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

0 Mar 28, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 01, 2023

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

479 Jan 01, 2023

Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022

Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre

2.3k Jan 08, 2023

SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

345 Jan 03, 2023

A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

46 Sep 23, 2022

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

auto_code_complete v1.3 purpose and usage auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the m

2 Feb 22, 2022

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment วิธีติดตั้ง pip install thai_sentiment==0.1.3

7 Dec 08, 2022

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

61 Dec 13, 2022

leaking paid token generator that was a shit lmao for 100$ haha

Discord-Token-Generator-Leaked leaking paid token generator that was a shit lmao for 100$ he selling it for 100$ wth here the code enjoy don't forget

5 Apr 15, 2022

Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

143 Dec 14, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Related tags

Overview

Text Classification Baseline

Usage

Config

Output

Requirements

Citation

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Pipeline for chemical image-to-text competition

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Text vectorization tool to outperform TFIDF for classification tasks

Text vectorization tool to outperform TFIDF for classification tasks

Comments

release v0.1.4

release v0.1.6

release v0.1.5

release v0.1.5

Releases(v0.1.6)

v0.1.6(Nov 6, 2021)

Release v0.1.6

v0.1.5(Oct 21, 2021)

Release v0.1.5 🥳🎉🍾

pymorphy2 lemmatization (config.yaml)

token frequency support

threshold selection for binary classification

arbitrary save folder name (config.yaml)

v0.1.4(Oct 10, 2021)

v0.1.3(Sep 2, 2021)

v0.1.2(Aug 19, 2021)

v0.1.1(Aug 11, 2021)

v0.1.0(Aug 7, 2021)

Owner

Dani El-Ayyass

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Train 🤗-transformers model with Poutyne.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

novel deep learning research works with PaddlePaddle

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Crowd sourced training data for Rasa NLU models

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

Unsupervised text tokenizer for Neural Network-based text generation.

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Adversarial Examples for Extreme Multilabel Text Classification

Lattice methods in TensorFlow

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

SimCTG - A Contrastive Framework for Neural Text Generation

A framework for implementing federated learning

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

leaking paid token generator that was a shit lmao for 100$ haha

Code for the paper "Are Sixteen Heads Really Better than One?"