Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Overview

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Here, we introduce a new algorithm, which we name Buckshot++. Buckshot++ improves upon the k-means by dealing with the main shortcoming thereof, namely, the need to predetermine the number of clusters, K. Typically, K is found in the following manner:

  1. settle on some metric,
  2. evaluate that metric at multiple values of K,
  3. use a greedy stopping rule to determine when to stop (typically the bend in an elbow curve).

There must be a better way. We detail the following 3 improvements that the Buckshot++ algorithm makes to k-means.

  1. Not all metrics are create equal. And since K-means doesn't prescribe which metric to use for finding K, we analyzed that some of the commonly implemented metrics are too inconsistent from one iteration to the next. Buckshot++ prescribes the silhouette score for finding K.
  2. In k-means, every single point is clustered -- even the noise and outliers. But what we really care about is the pattern and not the noise. We show here an elegant way to overcome this problem -- even simpler than k-medoids or k-medians.
  3. Finally, the computational complexity of running k-means multiple times on the whole dataset to find the best K can be prohibitive. We show below a surprisingly simple alternative with better asymptotics.

Details of the Buckshot++ algorithm

ALGORITHM: Buckshot++
INPUTS: population of N vectors
B := number of bootstrap samples
F := max number of clusters to try
M := cluster quality metric
OUTPUT: the optimal K for kmeans

Take B bootstrap samples where each sample is of size 1/B.
for each counter k from 2 to F do
  Compute kmeans with k centers.
  Compute the metric M on the clusters.
Compute the centroid of all metrics vectors.
Get argmax of the centroid vector.

Explanation of Buckshot++

The Buckshot++ algorithm was motivated by the Buckshot algorithm, which essentially finds cluster centers by performing hierarchical clustering on a sample and then performing k-means by taking those cluster centers as inputs. Hierarchical has relatively high time complexity, which is why Buckshot performs hierarchical only on a sample. The key difference between hierarchical and kmeans is that the former is more deterministic/stable but less scalable than the latter, as the next table elucidates.

%matplotlib inline
import pandas as pd
pd.set_option('display.max_rows', 500)
tbl = pd.DataFrame({'k-means': ['O(N * k * d * i)', 'random initial means; local minimum; outlier'],
                    'hierarchical': ['O(N^2 * logN)', 'outlier']}
                   , index=['Computational Complexity', 'Sources of Instability'])
tbl
k-means hierarchical
Computational Complexity O(N * k * d * i) O(N^2 * logN)
Sources of Instability random initial means; local minimum; outlier outlier

Hierarchical's higher time complexity means that, for large inputs, running k-means multiple times is still faster than running hierarchical just once. The Buckshot algorithm runs hierarchical just once on a small sample in order to initialize cluster centers for k-means. Since O(N^2 * logN) grows really fast, the sample must be really small to make it work computationally. But a key critique of Buckshot is failure to find the right structure with a small sample.

Buckshot++'s key innovation lies in the step "Take B bootstrap samples where each sample is of size 1/B." While Buckshot is doing hierarchical on a sample, Buckshot++ is doing multiple kmeans on bootstrap samples. Doing kmeans many times can still finish sooner than doing hierarchical just once, as the time complexities above show. An added bonus is that bootstrapping is a great way to smooth out noise and improve stability. In fact, that is exactly why Bagging (a.k.a. Bootstrap Aggregating) and Random Forests work so well.

Python implementation of Buckshot++

The core algorithm implementation is in the buckshotpp module. We use it below to cluster a news headlines dataset.

from buckshotpp import Clusterings, plot_mult_samples
from numpy.random import choice
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
import nltk; nltk.download('punkt', quiet=True)
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 120
import warnings; warnings.filterwarnings('ignore')

vecSpaceMod = Clusterings({'file_loc': 'data/news_headlines.csv',
                           'tf_dampen': True,
                           'common_word_pct': 1,
                           'rare_word_pct': 1,
                           'dim_redu': False}
                         )  # Instantiate a Clusterings object using parameters.
news_df = vecSpaceMod.get_file() # Read news_headlines.csv into a df.
metrics_byK = vecSpaceMod.buckshot(news_df)
plot_mult_samples(metrics_byK, 'silhouette')

png

An insight from this chart

Each green curve is generated from a bootstrap sample, and the red curve is their average. Remember the sources of instability for k-means listed in the table above? Outlier is one. The concept of outlier has somewhat different meaning in the context of clustering. In supervised learning, an outlier is a rare observation that's far from other observations distance-wise. In clustering, a far away observation is its own well-separated cluster. Here, our interpretation is that "rare" is the operative word here and that outliers are singleton clusters that exert undue influence on the formation of other clusters. Look at how bagging led to a more stable estimate of the optimal number of clusters in the graph above.

Not all metrics are create equal

The two internal clustering metrics implemented in scikit-learn are: the Silhouette Coefficient and the Calinski-Harabasz criterion. Comparing the Silhouette plotted above with the Calinski plotted below, it's clear that Calinski is far more extreme, perhaps implausibly extreme.

plot_mult_samples(metrics_byK, 'calinski')

png

Internal or External Clustering Metrics?

This data contains a field named "STORY" that indicates which story a headline belongs to. With this field as the ground truth, we compute Mutual Information (the most common external metric) using the code below. Mutual Information's possible range is 0-1. Using the K resulting from Buckshot++, we obtained a Mutual Information of about 0.6, an indicator that the model performance is reasonable.

X = vecSpaceMod.term_weight_matr(news_df.TITLE)
kmeans_fit = KMeans(20).fit(X)  # the argument comes from inflectin point of silhouette plot
mutual_info = adjusted_mutual_info_score(labels_true=news_df.STORY, labels_pred=kmeans_fit.labels_) 
mutual_info
0.6435601965984835

Practically, does Buckshot++ produce well-separated clusters?

Taking a look at the documents and their corresponding "predictedCluster", the results certainly do seem reasonable.

cluster_results = pd.DataFrame({'predictedCluster': kmeans_fit.labels_,
                                'document': news_df.TITLE})
cluster_results.sort_values(by='predictedCluster', inplace=True)

cluster_results
predictedCluster document
25 0 SAC Capital Starts Anew as Point72
50 0 Zebra Technologies to Acquire Enterprise Busin...
23 0 Fine Tuning: Good Wife just gets better
21 0 Boulder's Wealth May Be A Factor For Lowest Ob...
6 0 Power restored to nuclear plant in Waterford, ...
73 0 Electricity out as Millstone shifts to diesel
59 1 Twitter's head of media Chloe Sladden steps do...
28 1 Twitter's revolving door: media head Chloe Sla...
12 1 Twitter Exec Exodus Continues with Media Chief...
67 2 Sony Xperia C3 arrives with 5MP selfie camera,...
30 2 Leaked: Images Of Sony's Xperia C3 'Selfie Phone'
45 2 Sony Xperia Z2 Encased In A Block Of Ice, Cont...
90 2 Sony Xperia Z4 Concept Emerges as Fan Imagines...
78 2 If you hate the word 'selfie' look away now, t...
71 3 Twitter Executive Quits Amid Stalling Growth
47 3 Twitter COO quits, signalling management shake-up
52 3 Twitter Loses a Powerful Executive
31 3 Second Twitter executive quits hours after Row...
20 3 Twitter COO resigns as growth lags
61 3 Twitter COO Rowghani resigns amid lacklustre g...
57 4 'Goodbye Twitter' COO Ali Rowghani, says bye t...
69 4 Twitter chief operating officer resigns as use...
66 4 UPDATE 3-Twitter chief operating officer resig...
86 4 Twitter chief operating officer Ali Rowghani h...
76 4 Ali Rowghani, Twitter's COO, resigns after mon...
49 4 Twitter COO Ali Rowghani Just Announced Via Tw...
13 4 Twitter COO Ali Rowghani Exits
35 4 Second Twitter exec resigns with goodbye tweet...
39 5 Why almost everything you've been told about u...
77 5 Why Fargo Works So Well as a TV Show
0 6 'Mad Men' Preview: Buckle Up For 7 'Dense' Epi...
4 6 'Mad Men' end in sight for Weiner
36 6 Weiner reflects on the beginning of the end of...
42 7 Giant mystery crater in Siberia has scientists...
85 7 Mysterious giant crater in the earth discovere...
60 7 Massive Crater Discovered in Siberia
92 7 Massive mystery crater at 'end of the world'
16 7 Mysterious crater in Siberia spawns wild Inter...
43 8 Inflation rise stalls wage hopes in the UK
82 8 The Least Obese City in the Country
19 8 Real wages could resume fall as "Easter effect...
55 8 UK Inflation Rise To 1.8% Delays Real Wage Ris...
26 8 Virginia's Governor Challenges Abortion Clinic...
51 8 BREAKING NEWS: Transport costs lead to hike in...
8 8 Cable prices climb 4 times faster than inflati...
79 9 Despite Safety Issues, GM's Sales Still Increa...
17 9 Chrysler Group LLC reports June 2014 US sales ...
40 9 GM June Sales Up 9 Percent, Best June Since 2007
87 9 Ford sales fall, GM barely even; Jeep powers C...
18 10 Gov. McAuliffe Makes Health Announcements
48 10 Microsoft wants Windows XP dead and has announ...
74 10 McAuliffe puts focus on women's health
7 11 Sony makes duckfacing official with Xperia C3,...
54 11 Sony to announce 'Selfie' phone on July 8th wi...
27 11 Sony prepares to launch a smartphone that has ...
91 11 Sony Xperia C3 launches as "world's best selfi...
88 11 Sony unveils Xperia C3 smartphone with LED fla...
11 11 Sony Xperia C3 Boasts 5MP "PROselfie" Front-fa...
44 12 UK CPI rises to 1.8% in April, core CPI hits 2%
75 12 Rising CO2 Levels Will Lower Nutritional Value...
1 12 Here's How Climate Change Will Make Food Less ...
81 12 Rising CO2 levels also make our food less nutr...
80 13 Nutrition in Crops Are Cut down Drastically by...
2 13 Rising carbon dioxide levels reduce nutrients ...
68 13 With carbon dioxide levels up, nutrients in cr...
64 14 Inflation back up: Modest rise to 1.8% in Apri...
83 14 US plants prepare for long-term nuclear waste ...
22 14 Nuclear Plant Operators Deal With Radioactive ...
32 14 US plants prepare long-term nuclear waste stor...
84 15 'Mad Men' takes off on its final flight
3 15 'Mad Men' mixology
5 15 'Mad Men': 7 things to know for Season 7
9 15 Mad Men - the (Blaxploitation) Movie
37 15 TV Review: Mad Men Season 7
46 15 'Mad Men': Season 7 Premiere Guide (Video)
70 15 10 Things You Never Knew About 'Mad Men'!
53 15 'Mad Men' Season 7 Spoilers: Everything We Kno...
72 15 Rich Sommer from AMC's 'Mad Men' Season Premiere
63 16 Fargo (FX) Season Finale 2014 �Morton's Fork�
56 16 Before 'Fargo's' season finale, a sequel (or p...
65 16 'Fargo' Season 1 Spoilers: Episode 10 Synopsis...
62 17 Google Glass headsets get new designs in colla...
41 17 Google's first fashionable Glass frames are de...
89 17 Google Glass Still Trying To Look Cool
34 17 Net-a-Porter Embraces Google Glass
15 18 Routine pelvic exams not recommended under new...
14 18 Doctors group nixes routine pelvic exams
38 18 Metro Detroit doctors wary of recommendation a...
10 18 Doctors against having frequent pelvic exams
58 19 Technology stocks falling for 2nd day in a row
24 19 UPDATE 5-JPMorgan profit weaker than expected ...
29 19 JPMorgan profit weaker than expected
33 19 Marks and Spencer's profits fall for third year

Summary of the key advantages of Buckshot++

  • Accurate method of estimating the number of clusters (a clearly best Silhouette emerged every time, while typical elbow heuristic searches can hit or miss).
  • Scalable (faster search for K achieved by using k-means rather than hierarchical; running k-means on subsample rather than everything).
  • Noise resistant when used in conjunction with k-means++ (sampling with replacement lessens the chance of selecting an outlier in the bootstrap sample).
Owner
John Jung
Senior Machine Learning Engineer
John Jung
Store events and publish to Kafka

Create an event from Django ORM object model, store the event into the database and also publish it into Kafka cluster.

Diag 6 Nov 30, 2022
Django Federated Login provides an authentication bridge between Django projects and OpenID-enabled identity providers.

Django Federated Login Django Federated Login provides an authentication bridge between Django projects and OpenID-enabled identity providers. The bri

Bouke Haarsma 18 Dec 29, 2020
Django Livre Bank

Django Livre Bank Projeto final da academia Construdelas. API de um banco fictício com clientes, contas e transações. Integrantes da equipe Bárbara Sa

Cecília Costa 3 Dec 22, 2021
Use Database URLs in your Django Application.

DJ-Database-URL This simple Django utility allows you to utilize the 12factor inspired DATABASE_URL environment variable to configure your Django appl

Jacob Kaplan-Moss 1.3k Dec 30, 2022
Keep track of failed login attempts in Django-powered sites.

django-axes Axes is a Django plugin for keeping track of suspicious login attempts for your Django based website and implementing simple brute-force a

Jazzband 1.1k Dec 30, 2022
A simple app that provides django integration for RQ (Redis Queue)

Django-RQ Django integration with RQ, a Redis based Python queuing library. Django-RQ is a simple app that allows you to configure your queues in djan

RQ 1.6k Jan 06, 2023
Tweak the form field rendering in templates, not in python-level form definitions. CSS classes and HTML attributes can be altered.

django-widget-tweaks Tweak the form field rendering in templates, not in python-level form definitions. Altering CSS classes and HTML attributes is su

Jazzband 1.8k Jan 02, 2023
Sampling profiler for Python programs

py-spy: Sampling profiler for Python programs py-spy is a sampling profiler for Python programs. It lets you visualize what your Python program is spe

Ben Frederickson 9.5k Jan 01, 2023
Django channels basic chat

Django channels basic chat

Dennis Ivy 41 Dec 24, 2022
A visual indicator of what environment/system you're using in django

A visual indicator of what environment/system you're using in django

Mark Walker 4 Nov 26, 2022
A Django app to initialize Sentry client for your Django applications

Dj_sentry This Django application intialize Sentry SDK to your Django application. How to install You can install this packaging by using: pip install

Gandi 1 Dec 09, 2021
Media-Management with Grappelli

Django FileBrowser Media-Management with Grappelli. The FileBrowser is an extension to the Django administration interface in order to: browse directo

Patrick Kranzlmueller 913 Dec 28, 2022
A calendaring app for Django. It is now stable, Please feel free to use it now. Active development has been taken over by bartekgorny.

Django-schedule A calendaring/scheduling application, featuring: one-time and recurring events calendar exceptions (occurrences changed or cancelled)

Tony Hauber 814 Dec 26, 2022
A tool to automatically fix Django deprecations.

A tool to help upgrade Django projects to newer version of the framework by automatically fixing deprecations. The problem When maintaining a Django s

Bruno Alla 155 Dec 14, 2022
Opinionated boilerplate for starting a Django project together with React front-end library and TailwindCSS CSS framework.

Opinionated boilerplate for starting a Django project together with React front-end library and TailwindCSS CSS framework.

João Vítor Carli 10 Jan 08, 2023
A ToDO Rest API using Django, PostgreSQL and Docker

This Rest API uses PostgreSQL, Docker and Django to implements a ToDo application.

Brenno Lima dos Santos 2 Jan 05, 2022
Domain-driven e-commerce for Django

Domain-driven e-commerce for Django Oscar is an e-commerce framework for Django designed for building domain-driven sites. It is structured such that

Oscar 5.6k Jan 01, 2023
This is django-import-export module that exports data into many formats

django-import-export This is django-import-export module which exports data into many formats, you can implement this in your admin panel. - Dehydrat

Shivam Rohilla 3 Jun 03, 2021
A Django Demo Project of Students Management System

Django_StudentMS A Django Demo Project of Students Management System. From NWPU Seddon for DB Class Pre. Seddon simplify the code in 2021/10/17. Hope

2 Dec 08, 2021
Getdp-project - A Django-built web app that generates a personalized banner of events to come

getdp-project https://get-my-dp.herokuapp.com/ A Django-built web app that gener

CODE 4 Aug 01, 2022