A simple machine learning package to cluster keywords in higher-level groups.

Last update: Dec 18, 2022

Overview

Simple Keyword Clusterer

A simple machine learning package to cluster keywords in higher-level groups.

Example:
"Senior Frontend Engineer" --> "Frontend Engineer"
"Junior Backend developer" --> "Backend developer"

Installation

pip install simple_keyword_clusterer

Usage

# import the package
from simple_keyword_clusterer import Clusterer

# read your keywords in list
with open("../my_keywords.txt", "r") as f:
    data = f.read().splitlines()

# instantiate object
clusterer = Clusterer()

# apply clustering
df = clusterer.extract(data)

print(df)

Performance

The algorithm will find the optimal number of clusters automatically based on the best Silhouette Score.

You can specify the number of clusters yourself too

# instantiate object
clusterer = Clusterer(n_clusters=4)

# apply clustering
df = clusterer.extract(data)

For best performance, try to reduce the variance of data by providing the same semantic context
(the job title keywords file should remain coherent, in that it shouldn't contain other stuff like gardening keywords).

If items are clearly separable, the algorithm should still be able to provide a useable output.

Customization

You can customize the clustering mechanism through the files

blacklist.txt
to_normalize.txt

If you notice that the clustering identifies unwanted groups, you can blacklist certain words simply by appending them in the blacklist.txt file.

The to_normalize.txt file contains tuples that identify a transformation to apply to the keyword. For instance

("back end", "backend), ("front end", "frontend), ("sr", "Senior"), ("jr", "junior")

Simply add your tuples to use this functionality.

Dependencies

Scikit-learn
Pandas
Matplotlib
Seaborn
Numpy
NLTK
Tqdm

Make sure to download NLTK English stopwords and punctuation with the command

nltk.download("stopwords")
nltk.download('punkt')

Contact

If you feel like contacting me, do so and send me a mail. You can find my contact information on my website.

A simple machine learning package to cluster keywords in higher-level groups.

Related tags

Overview

Simple Keyword Clusterer

Installation

Usage

Performance

Customization

Dependencies

Contact

Owner

Andrea D'Agostino

ArviZ is a Python package for exploratory analysis of Bayesian models

Python module for data science and machine learning users.

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Upgini : data search library for your machine learning pipelines

Uses WiFi signals :signal_strength: and machine learning to predict where you are

A benchmark of data-centric tasks from across the machine learning lifecycle.

Time-series momentum for momentum investing strategy

Python package for concise, transparent, and accurate predictive modeling

Python-based implementations of algorithms for learning on imbalanced data.

Distributed scikit-learn meta-estimators in PySpark

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Automatic extraction of relevant features from time series:

pymc-learn: Practical Probabilistic Machine Learning in Python

A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

This is a curated list of medical data for machine learning

XManager: A framework for managing machine learning experiments 🧑‍🔬

Relevance Vector Machine implementation using the scikit-learn API.

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.