DBSCAN Clustering: Break It Down For Me

Welcome back to the world of I-don’t-know-why-I-was-freaking-out-about-this-algorithm-because-I-totally-intuitively-get-it-now, where we take complicated sounding Machine Learning algorithms and break them down into simple steps using fun illustrations.

Today we’ll tackle another clustering algorithm called DBSCAN (Density-based spatial clustering of applications with noise). To understand DBSCAN better check out the K-Means and Hierarchical clustering articles first.

As the name suggests, DBSCAN identifies clusters by the densities of the points. Clusters are usually in high-density regions and outliers tend to be in low-density regions. The 3 main advantages of using it (according to the pioneers of this algorithm) are that it requires minimum domain knowledge, can discover clusters of arbitrary shape, and is efficient for large databases.

Now that we have the introduction out of the way, let’s get to the fun part – actually understanding how this works. Let’s suppose our raw data looks like this:

The first thing to do is count the number of points close to each point. We determine this closeness by drawing a circle of a certain radius (eps) around a point and any other point that falls in this circle is said to be close to the first point. For instance, start with this pink point and draw a circle around it.

Drawing a circle of radius eps around a random point

We see that this point overlaps, fully or partially, 7 other points. So we say that the pink point is close to 7 points.

The radius of the circle, called eps, is the first parameter in the DSBCAN algorithm that we need to define. We need to define eps appropriately because if we choose a value that is too small, a large part of the data will not be clustered. On the other hand, if we choose a value that’s too large, clusters will merge and a lot of data points will be in the same cluster. In general, a smaller eps value is preferred.

Now consider this blue point. We see that it is close to 3 points because its circle with radius eps overlaps 3 other points.

Likewise, for all the remaining points, we count the number of close points. Once we do that, we need to decide what points are Core Points and what points are Non-Core Points.

This is where our second parameter of the algorithm – minPoints – comes in. We use minPoints to determine if a point is a Core Point or not. Suppose we set minPoints to 4 then we say that a point is a Core Point if at least 4 points are close to it. If less than 4 points are close to a point, it is deemed a Non-Core Point.

As a general rule, minPoints ≥ number of dimensions in a dataset + 1. Larger values are usually better for datasets with noise. The minimum value for minPoints must be 3, but the larger our dataset, the larger the minPoints value must be.

For our example, let’s set minPoints to 4. Then we can say that the pink point is a Core Point because at least 4 points are close to it, but the blue point is a Non-Core Point because only 3 points are close to it.

Ultimately, using the above process, we can determine that the following highlighted points are Core Points…

…and the remaining ones are Non-Core Points.

Now, we randomly pick a Core Point and assign it to the first cluster. Here, we randomly select a point and assign it to the blue cluster.

Next, the Core Points that are close to the blue cluster (meaning they overlap the circle with radius eps)…

…are all added to the blue cluster:

Then, the Core Points close to the growing blue cluster are added to it. Below, we see that 2 Core Points and 1 Non-Core Point are close to the blue cluster, but we only add the 2 Core Points to the cluster.

Ultimately, all the Core Points that are close to the growing blue cluster are added to it and the data will look like this:

Next, we add all the Non-Core Points close to the blue cluster to it. For instance, these 2 Non-Core Points are close to the blue cluster, so they are added to it:

However, since they are not Core Points, we do not use them to extend the blue cluster any further. That means this other Non-Core Point that is close to Non-Core Point 1 will not be added to the blue cluster.

So, unlike Core Points, Non-Core Points can only join a cluster and can not be used to extend it further.

After adding all our Non-Core Points, we are done creating our blue cluster and it looks like this:

Now because none of the remaining Core Points are close to the first cluster, we start the process of forming a new cluster. First, we randomly pick a Core Point (that’s not in a cluster) and assign it to our second yellow cluster.

Then we add all the Core Points close to the yellow cluster and use them to extend the cluster further.

Adding the remaining Core Points to the second yellow cluster

And then the Non-Core Points that are close to the yellow cluster are added to it. After doing that, our data with the 2 clusters, looks like this:

After adding Non-Core Points to the yellow cluster

We keep repeating this process of creating clusters until we have no Core Points left. In our case, since all the Core Points have been assigned to a cluster, we’re done making new clusters. Finally, any remaining Non-Core Points that are not close to Core Points and not part of any clusters are called outliers.

And just like that we made our 2 clusters and found outliers AND came out the other side unscathed.

One might be inclined to ask – why DBSCAN over K-Means or Hierarchical clustering?

K-Means and Hierarchical are suitable for compact and well-separated clusters and are also severely affected by noise and outliers in the data. On the other hand, DBSCAN captures clusters of complex shapes and does a great job of identifying outliers. Another nice thing about DBSCAN is that, unlike K-Means, we don’t have to specify the number of clusters (k), the algorithm automatically finds clusters for us. The below figure illustrates some examples of the differences and why DBSCAN can be powerful when used appropriately.

Image Credits: https://github.com/NSHipster/DBSCAN

That’s all for today. Please feel free to connect with me on LinkedIn or email me at [email protected] to send me questions and suggestions for any other Algorithms that you want illustrated!