Breaking it Down: K-Means Clustering

Outline:
1. What is K-Means Clustering?
2. Implementing K-means from Scratch with NumPy
   1. K-means++ Cluster Initialization
   2. K-Means Function Differentiation
   3. Data Labeling and Centroid Updates
   4. Fitting it Together
3. K-Means for Video Keyframe Extraction: Bee Pose Estimation
4. Implementing K-means with scikit-learn
5. Summary
6. Resources

See my GitHub learning-repo for all of the code behind this post.

1. What is K-Means Clustering?

K-means clustering is an algorithm used to classify data into a user-defined number of groups, k. K-means is a form of unsupervised Machine Learning, meaning that the input data do not have labels prior to running the algorithm.

Clustering data with algorithms such as k-means is valuable for a variety of reasons. Primarily, clustering serves to identify unique groups in unlabeled datasets when building data analytics pipelines. These labels are useful for data inspection, data interpretation, and training AI models. K-means and its variants are used in a variety of contexts, including:

Research. E.g., Categorizing single-cell RNA sequencing results¹
Computer Science. E.g., Clustering emails for spam detection and filtering²
Marketing. E.g., Customer group segmentation for credit card ad targeting³

2. Implementing K-Means from Scratch with NumPy

To gain a fundamental understanding of how k-means works, we will examine each step of the algorithm. We’ll do this with visual explanations and by building a model from scratch with Numpy.

The algorithm and mathematical function behind k-means are beautiful yet relatively simple. Let’s start with an overview:

In summary, the k-means algorithm has three steps:

Assign initial cluster center (centroid) positions
Label the data based on the nearest centroid
Move the centroids to the mean position of the newly labeled data points. Go back to step 2 until the cluster centers converge.

Let’s move on to building the model. These are the functions that we’ll need to write in order to use the algorithm:

2.1. Cluster Initialization

The first step of the k-means algorithm is for the user to select the number of groups that the data should be clustered into, k.

In the original implementation of the algorithm, once k was selected, the initial positions of the cluster centers (or centroids) would be initialized by randomly selecting k of the input data points as the centroid starting positions.

This approach turned out to be quite inefficient, as the starting centroid positions could end up being randomly close to one another. In 2006, a new and more efficient approach to the centroid initialization process was developed by Arthur and Vassilvitskii⁴. They published their approach in 2007, calling it k-means++.

Rather than randomly selecting the initial centroids, k-means++ efficiently selects the positions based on distance distributions. Let’s visualize how it works:

Now that the intuition behind k-means++ has been exposed, let’s implement the function for it:

Of note, rather than having to choose k manually, several unbiased techniques can be used to identify an optimal number. Khyati Mahendru explains two of these approaches, the elbow and silhouette methods in her article. It’s worth a read!

2.2. Data Labeling and Centroid Updates

Following centroid initialization, the algorithm enters an iterative process of data labeling and centroid position updates.

In each iteration, the input data will first be labeled based on their proximity to the centroids. After this, each centroid’s position will be updated to the average position of the data in its cluster.

These two steps will be repeated until the label assignments/centroid positions no longer change (or converge). Let’s visualize this process:

Now, let’s implement the data labeling code:

And lastly, we’ll implement the centroid position update function:

2.3. K-Means Function Differentiation

The third step of the k-means algorithm is to update the position of the centroids. We saw that these centroids are updated to the average position of all of the cluster’s labeled points.

Updating the centroid to the average cluster position might seem intuitive, but what is the mathematical rationale behind this step? The rationale lies in the differentiation of the k-means equation.

Let’s expose this intuition by exploring an animated proof of the k-means function differentiation. This proof demonstrates that the positional updates are a result of the k-means equation aiming to minimize the within-group variance.

2.4 Fitting it Together

Now that we’ve constructed the backbone functions for our k-means model, let’s tie it together in a single fit function that will fit our model to the input data. We will also define the __init__ function here:

Now we can put the model to use with my walkthrough notebook found here. This notebook uses synthetically generated data (shown in the videos above) to demonstrate the functionality of our newly written k_means.py code.

3. K-Means for Video Keyframe Extraction: Bee Pose Estimation

Wonderful – we’ve worked our way through the construction of a k-means model entirely from scratch. Rather than just tossing that code aside, let’s use it in an example scenario.

Over the past few years, there have been impressive advancements in the neuroscience & DL research communities that have enabled highly accurate and automated animal behavioral tracking and analysis*. The frameworks used in this research domain implement a variety of convolutional neural network architectures. The models also lean heavily on transfer learning to reduce the amount of training data that researchers need to generate. Two popular examples of these frameworks include DeepLabCut and SLEAP.

Side note: this subdomain is commonly dubbed computational neuroethology

To train models for automated tracking of specific points on animals, researchers typically have to manually label 100–150 unique frames from their behavioral videos. All things considered, this is a pretty small number that enables automated tracking of indefinitely long behavioral videos!

However, an important aspect that researchers must consider when labeling these training frames is that they should be as unique as possible from one another. It would be extremely aimless to label the first 5 seconds of a single video if hours and hours of recordings exist. This is because the behavior and body states of the animals in the first 5 seconds will likely not accurately represent the features of the entire video dataset. As such, the model would not be trained to effectively recognize a variety of features.

So what does this have to do with k-means? Rather than having to manually identify unique keyframes from the videos, algorithms such as k-means can be implemented to automatically cluster the video frames into unique groups. Let’s visualize how this works:

To get a hands-on understanding of this process, you can follow along with the code used to isolate these frames with my walkthrough notebook.

4. Implementing K-means with scikit-learn

In the real world, one should generally avoid implementing self-constructed algorithms unless necessary. Instead, we should rely on carefully and efficiently designed frameworks that are maintained by expert paid and volunteer contributors.

In this instance, let’s see how easy it is to implement k-means with scikit-learn. The documentation for this class can be found [here](http://4. Implementing K-means with scikit-learn).

The scikit-learn implementation of the model initialization and the fitting is very similar to ours (not a coincidence!), but we got to skip writing ~250 lines of the k_means.py code. Moreover, the scikit-learn framework implements optimized BLAS routines for k-means that make their implementation much faster than ours.

Long story short – learning from scratch is invaluable, but working from scratch isn’t.

5. Summary

In this post, we explore the fundamentals of the math and intuition behind the k-means algorithm. We built a k-means model from scratch using NumPy, used it to extract unique keyframes from an animal behavior video, and learned how to implement k-means with scikit-learn.

I hope this article was valuable for you! Feel free to reach out to me with any comments, ideas, or questions.

6. Resources

References:
1. Hicks SC, Liu R, Ni Y, Purdom E, Risso D (2021). mbkmeans: Fast clustering for single cell data using mini-batch k-means. PLoS Comput Biol 17(1): e1008625.
2. Sharma A, Rastogi V (2014). Spam Filtering using K mean Clustering with Local Feature Selection Classifier. Int J Comput ApplMB means 108: 35-39.
3. Muhammad Shahzad, Bank Customer Segmentation (PCA-KMeans)
4. Arthur D, Vassilvitskii S (2006). k-means++: The Advantages of Careful Seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035

Educational Resources:
- Google Machine Learning: Clustering
- Andrew Ng, CS229 Lecture Notes, K-Means
- Chris Piech, K-Means

Breaking it Down: K-Means Clustering

1. What is K-Means Clustering?

2. Implementing K-Means from Scratch with NumPy

2.1. Cluster Initialization

2.2. Data Labeling and Centroid Updates

2.3. K-Means Function Differentiation

2.4 Fitting it Together

3. K-Means for Video Keyframe Extraction: Bee Pose Estimation

4. Implementing K-means with scikit-learn

5. Summary

6. Resources

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits

Back to Basics, Part Tres: Logistic Regression