A Comprehensive Introduction to Clustering Methods
In terms of unsupervised learning methods, some of the most well researched and common methods can be grouped under clustering. The basic idea is simple. If you can figure out how to define distances between data points, then data points that are closer together may exhibit some kind of group characteristic we could exploit for modeling or extract new understanding from. Some examples include patients with similar blood test results that have the same disease, consumers with a similar purchase history that are part of the same socioeconomic class or occupation, and flowers with similar colors and petal lengths that are part of the same species. Obviously, some of these problems can be solved as classification problems, but this is only possible if the labels are available. Clustering, as with other unsupervised methods, operate without a label of interest.
We will cover the following topics in clustering:
> Distance Metrics for Real Numbers
> Assessing Clustering Tendency in Data
> DBSCAN Clustering
> Agglomerative Clustering
> K-Means Clustering
> Extensions and Mixed Data Types
> Choosing the # of Clusters
Distance Metrics for Real Numbers
Given two data points a and b, we need to find a way to define a distance between them. There are many common distance metrics for vectors of real numbers. The most common one is of course the Euclidean distance, which simply considers the points a and b as the furthest vertices of a right triangle and takes the hypotenuse of the triangle as the distance. This is how most people who are not familiar with math would define distance on a map, often called “as the crow fries”. There are different pros and cons of using Euclidean distance as a metric. On the positive side, most optimization methods are designed with Euclidean distance in mind and the computational costs can be well constrained. Euclidean distance is graphically straightforward and well understood by most people. On the negative side, the fact that we’re squaring distances significantly amplifies the effect of outliers. For two vectors a,b, the vectorized Euclidean Distance is found by taking the difference of pairs of coordinates of a,b, squaring the results, summing them, and taking the square root.
Consider now if we’re trying to find the distance we’d have to walk to get between two buildings in Manhattan that are several blocks apart. Since the sidewalks are parallel we can’t simply cut diagonally from one building to another, so the Euclidean Distance wouldn’t give us a realistic estimate for the distance. We have to imagine traversing along a grid where the two points are on vertices. This is equivalent to putting the two points on diagonal vertices of a square and taking the distance to be half the perimeter. This metric goes by many names including absolute value, Manhattan, and Taxicab distance. It is found by taking the absolute value between pairs of coordinates in a,b and summing them. Optimization algorithms in Manhattan distance are often more computationally expensive and complex but are significantly more resistant to outliers.
There’s several more related distances we can define, but the question may naturally arise: how many distance metrics are there, and how do you come up with a new one? Well, mathematicians have a set of rules that define the properties of a valid distance metric. Given a set of objects, any function that operates on two objects and returns a single value and also satisfies the following properties can be considered a distance metric.
Distance Metric: Given elements a,b,c in a set, a distance metric is defined as a function with the following properties:
1) Non-negativity — d(a,b)≥0
2) Indiscernibility — d(a,b)=0 ⟺ a=b
3) Symmetry — d(a,b)=d(b,a)
4) Triange Inequality d(a,c)≤d(a,b)+d(b,c)
for all a,b,c∈S
These properties define some natural consequences we would expect from a measure of ‘distance’. (1) makes sure the distance can never be less then zero. (2) makes sure that the distance is only zero when the two elements being compared are equal. (3) makes sure that it doesn’t matter which point we start at to measure distance between two elements. (4) makes sure that the shortest distance between two elements is going directly from one to the other. Anything satisfying the following easily fits into existing frameworks for clustering and is often already implemented in a software package. Now, equipped with two examples of distance and the general definition of a distance metric, we will give a very general and powerful distance metric of which the Euclidean and Manhattan distances are special cases (p=2 and p=1 respectively), the Minkowski distance. Note that the Minkowski distance is only a distance metric for p≥1 (try to figure out which property is violated). The following is the definition of the Minkowski distance along with an illustration showing how the plane is warped with different values of p:
Let’s code these distance metrics in Python and see how the distances differ between two sample vectors:
a = [2,1,5,3,0.1,0.5,0.2,1]
b = [15,3,3,2,0,0,0.5,1]## Manhattan Distance
manhattan = 0
for i in a:
for j in b:
manhattan += abs(i-j)
print("Manhattan Distance = " + str(manhattan))## Euclidean distance
euclidean = 0
for i in a:
for j in b:
euclidean += (i-j)**2
print("Euclidean Distance = " + str(euclidean**0.5))
## Minkowski Distance
def Minkowski(a,b,p):
base = sum([abs(i-j)**p for i in a for j in b])
return(base**(1/p))
print("Minkowski Distance(p=3) : " + str(Minkowski(a,b,3)))
print("Minkowski Distance(p=4) : " + str(Minkowski(a,b,4)))
print("Minkowski Distance(p=5) : " + str(Minkowski(a,b,5)))
print("Minkowski Distance(p=100) : " + str(Minkowski(a,b,100)))
Assessing Clustering Tendency of Data
Before we begin studying algorithms to perform clustering, we’d like to know how to figure out how susceptible our data is to clustering methods. Is our data relatively homogeneous already or will we be able to find meaningful seperations? There are three mains ways to answer this question, and ideally all of them should be used.
1) Domain-knowledge approaches
Sometimes the nature of the data and collection methods can inform us about the need for clustering. For example, we might have crime data from districts with varying socioeconomic classes without a corresponding label. We might have ER patients from a very large hospital system and would like to examine different severity patients for different kinds of models. We might suspect we have sensor data from a large variety of vehicles and should try to separate on vehicle type. There are also similar situations in which clustering can be used to reduce bias in data by separating out sub-populations that are more or less representative of the entire population.
2) Summary statistics and plots
When our data is relatively clean and low-dimensional, looking at a table of summary statistics or some scatter plots can usually reveal how good clustering would be on the data. Look for things like large ‘clumps’ of points in scatter plots between features, large variances, large differences between median and mean, properties of data between quantiles etc.
3) Hopkins Statistic
One way to think about clustering tendency is to ask ourselves what a data set without any clustering tendency would look like. Well, it would look like data from a uniform distribution. Whatever metric we decide on should be minimized if the data is indeed drawn from such a distribution. We can test if this is the case with good ol’ hypothesis testing. Let the null hypothesis be that our data was drawn from a uniform distribution, then the more strength we have against this null hypothesis the more sure we can be that our data is indeed susceptible to clustering methods.
Preparing Data for Clustering
All of the major pre-processing methods we use before modeling we should also use before clustering. Standardizing our data by subtracting the mean and dividing by the standard deviation (for each feature) is very important, since features with large ranges will dominate distance metrics between points. We should also take care to having removed or imputed all the missing values in our data. Lastly, though we will not cover it in depth here, high degrees of correlation between features and highly noisy features can make it more difficult to achieve a meaningful clustering, thus methods like Principal Components are often performed on the data prior to clustering.
DBSCAN Clustering
One of the most intuitive and flexible clustering algorithms is called DBSCAN (density based spatial clustering of applications with noise). This algorithm essentially treats clusters as ‘dense’ regions of data and uses two parameters, min_points and epsilon. Min_points controls the minimum number of points to justify a cluster and epsilon controls how far out from a point we look to try to find more points. Some of the big advantages that DBSCAN enjoys over other clustering algorithms is its resistance to outliers, few and straightforward parameters, ability to find arbitrarily shaped clusters, and a natural extension to other distance metrics. Some of the drawbacks include setting the epsilon parameter (which can be difficult, especially in high dimensional data) and its poor performance in data with large fluctuations in density. The algorithm can be described in the following steps:
1)Pick a random point to start the process
2) Look within epsilon distance of the point to find other points, if no such points are found go back to (1)
3) When another point is found within epsilon distance, designate this a cluster and repeat (2) and (3).
4) Stop when each point has been visited
Let’s implement this clustering by generating some fake data with 3 clusters, and then using the popular Python package called Sklearn to perform DBSCAN clustering with 3 different values of epsilon.
from scipy.cluster import hierarchy
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import DBSCAN
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=450, centers=centers, cluster_std=0.4,random_state=0)
X = StandardScaler().fit_transform(X)
# -----Perform DBSCAN clustering with epsilon=0.1
cluster_assignments = DBSCAN(eps=0.1, min_samples=10,metric = 'minkowski', p = 2).fit_predict(X)
plt.figure(figsize=(10,4))
plt.subplot(1,3,1);plt.title("DBSCAN Clustering (epsilon=0.1)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform DBSCAN clustering with epsilon=0.3
cluster_assignments = DBSCAN(eps=0.3, min_samples=10, metric = 'minkowski', p = 2).fit_predict(X)
plt.subplot(1,3,2);plt.title("DBSCAN Clustering (epsilon=0.3)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform DBSCAN clustering with epsilon=0.6
cluster_assignments = DBSCAN(eps=0.6, min_samples=10, metric = 'minkowski', p = 2).fit_predict(X)
plt.subplot(1,3,3); plt.title("DBSCAN Clustering (epsilon=0.6)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
From the above we can see how much the value of epsilon affects the outcome of the clustering. The choice of epsilon depends on how closely bound we think our clusters are, its clear that choosing epsilon = 0.6 causes the epsilon neighborhood of each point to become so large that it expands into the other clusters, whereas choosing epsilon = 0.1 causes the epsilon neighborhoods to be too small to form meaningful clusters. We can also explore the effect of using different distance metrics on the clustering result. Recall that Manhattan Distance and Euclidean Distance are just special cases of the Minkowski distance (with p=1 and p=2 respectively), and that distances between vectors decrease as p increases. Compare the effect of setting too small of an epsilon neighborhood to setting a distance metric (Minkowski with p=1000) where distances are very small.
# -----Perform DBSCAN clustering with Manhattan Distance
cluster_assignments = DBSCAN(eps=0.3, min_samples=10,metric = 'minkowski', p = 2).fit_predict(X)
plt.figure(figsize=(10,4))
plt.subplot(1,3,1);plt.title("DBSCAN Clustering (Manhatten Distance)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform DBSCAN clustering with Euclidean Distance
cluster_assignments = DBSCAN(eps=0.3, min_samples=10, metric = 'minkowski', p = 3).fit_predict(X)
plt.subplot(1,3,2);plt.title("DBSCAN Clustering (Euclidean Distance)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform DBSCAN clustering with Minkowski Distance
cluster_assignments = DBSCAN(eps=0.3, min_samples=10, metric = 'minkowski', p = 1000).fit_predict(X)
plt.subplot(1,3,3); plt.title("DBSCAN Clustering (Minkowski Distance, p=1000)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
Finally, we can look at the effect of messing around with the minimum number of points that constitute a cluster (the min_samples parameter in this Sklearn implementation). The way we’ve generated the points, each cluster has roughly 150 points contained in it.
# -----Perform DBSCAN clustering with 5 samples form a cluster
cluster_assignments = DBSCAN(eps=0.3, min_samples=5).fit_predict(X)
plt.figure(figsize=(10,4))
plt.subplot(1,3,1);plt.title("DBSCAN Clustering (min_points = 5)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform DBSCAN clustering with 20 samples from a cluster
cluster_assignments = DBSCAN(eps=0.3, min_samples=20).fit_predict(X)
plt.subplot(1,3,2);plt.title("DBSCAN Clustering (min_points = 20)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform DBSCAN clustering with 100 samples form a cluster
cluster_assignments = DBSCAN(eps=0.3, min_samples=100).fit_predict(X)
plt.subplot(1,3,3); plt.title("DBSCAN Clustering (min_points = 100)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
Agglomerative Clustering
The idea of agglomerative clustering is very simple and the algorithm itself should look very natural to anyone familiar with greedy algorithms. There are only 4 major steps to the process:
1) Describe a distance between two clusters, called the inter-cluster distance
2) Make each point its own cluster
3) Find the two clusters with the smallest distance between them, and merge them into a single cluster
4) Repeat (3) until we are down to a defined # of clusters or sufficiently minimized cluster distances
Notice, at each iteration of this algorithm we need to calculate the distance between a cluster and the rest of the remaining points, this is a very expensive operation. The algorithm has a time complexity of O(n^3) and memory requirement of O(n^2), making it very difficult to use for large data sets. There is also the issue of how exactly we describe a distance between two clusters, the inter-cluster distance. There are three common approaches to this distance, given two clusters A and B we can define the inter-cluster distance as:
single-link distance: The distance between the two closest points in clusters A and B
complete-link distance: The distance between the two furthest points in clusters A and B
group average distance: The average of distances between points in cluster A and points in cluster B
Note that the inter-cluster distance is NOT the same thing as the distance metric we define between points above. For all the below examples we will assume Euclidean distance. Let’s generate sample data in 3 clusters and see how different inter-cluster distances affect the outcome. We specify the stopping criteria as 3 clusters and run agglomerative clustering with single, complete, and group average inter-cluster distances. Try to understand why each of these outcomes might have happened given the type of inter-cluster distance specified (try changing the cluster_std argument in make_blobs()).
# -----Perform agglomerative clustering with average linkage
Z = hierarchy.linkage(X, 'average')
cluster_assignments = hierarchy.fcluster(Z, 3, 'maxclust')
plt.figure(figsize=(10,4))
plt.subplot(1,3,1);plt.title("Average Link Agglomerative Clustering", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform agglomerative clustering with complete linkage
Z = hierarchy.linkage(X, 'complete')
cluster_assignments = hierarchy.fcluster(Z, 3, 'maxclust')
plt.subplot(1,3,2);plt.title("Complete Link Agglomerative Clustering", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform agglomerative clustering with single linkage
Z = hierarchy.linkage(X, 'single');
cluster_assignments = hierarchy.fcluster(Z, 3, 'maxclust')
plt.subplot(1,3,3); plt.title("Single Link Agglomerative Clustering", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
plt.tight_layout()
K-Means Clustering
One of the most well-known clustering methods is known as K-Means. The name of this procedure comes from the necessity of specifying the number of clusters you’d like to find as K, and from the fact that the center of each cluster is the mean of the points in the cluster. There are many variations and extensions of this method such as Mini Batch K-Means and K-Mediods that we’ll briefly discuss. The basic K-Means procedure can be described as the following:
1) Decide on value for K
2) Randomly pick K points to act as cluster centers
3) Assign other data points to nearest clusters (based on distance from K cluster centers)
4) Take the mean of data points in each cluster, make this the new cluster center
5) Repeat steps (3)(4) until the desired stopping criteria (no data points change cluster, minimal distance threshold) is reached
From reading through the algorithm, you may observe that two things play a large part in the outcome of this algorithm: the K data points we ‘randomly’ choose to act as cluster centers in the beginning, and the value of K we specify. We can mitigate the first problem somewhat by trying multiple different random initializations, the second problem is much harder. A general strategy is to plot a metric like % of variance explained against number of clusters and look for an ‘elbow’, a point where the relative improvement by adding another cluster is much less then before (overall variance explained heads toward 100% as # of clusters approaches # of data points). We will examine this problem in more detail soon.
# -----Perform Kmeans clustering with k=3
cluster_assignments = KMeans(3).fit(X).predict(X)
plt.figure(figsize=(10,4))
plt.subplot(1,3,1);plt.title("KMeans Clustering (k=3)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform Kmeans clustering with k=5
cluster_assignments = KMeans(5).fit(X).predict(X)
plt.subplot(1,3,2);plt.title("KMeans Clustering (k=5)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
# -----Perform Kmeans clustering with k=7
cluster_assignments = KMeans(7).fit(X).predict(X)
plt.subplot(1,3,3); plt.title("KMeans Clustering (k=7)", fontsize='small')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=cluster_assignments,s=25, edgecolor='k')
Extensions and Mixed Data Types
Often times, our data provides us with restrictions that prevent us from using any of the algorithms above in their natural form. The two most common limitations are large data size and mixed data types. For very large data, none of the algorithms above perform especially well, however there is an extension of K-Means that provides us with a significant performance upgrade and is very usable for large data sizes called Mini Batch K-Means. This algorithm is essentially K-Means except instead of using the entire data set at the cluster assignment step, only a small random sample of the data (known as a mini batch, usually small enough to fit in RAM) is used. There is a small amount of error produced by this procedure (about 2–8% for k<20) compared to K-Means and thus number of clusters should be kept to a minimum. When we have binary or one-hot encoded variables present in our data, the distance metric in many of these algorithms does not make sense, since the feature does not have a continuous range to define a notion of closeness. In this case, a good algorithm to use is called K-Prototypes, which can handle mixed data types by providing an extension of K-Means with a different notion of distance. We will not discuss these algorithms at length but their presence should be known as they have important practical usage.
Choosing the # of Clusters
With the examples above we generated the data to purposefully be in roughly 3 clusters. We also only had two explanatory features so we could do a good job visually assessing how discriminating each clustering was. In reality, data can be very high-dimensional and have various clustering tendencies, making it very difficult to use plots such as the one above to determine quality or choose the number of clusters. To get a numerical understanding of how good our clusters are, first we have to try to determine a metric for a ‘good’ clustering. What are some favorable and unfavorable properties? What task are we trying to accomplish?
Before we go further into common metrics for cluster quality, we must caveat that the quality of a clustering largely depends on what we’re trying to achieve. If we have a specific goal in mind (such as improving an existing model or finding low-cost patients), we should assess our clusters based on these metrics (such as classification accuracy or cluster average cost) directly instead of relying completely on statistical measures. When we are simply looking for meaningful distinctions in the data, we should examine the metrics below more closely.
One of the first things most people look for when doing clustering is homogeneous cluster data, that is that observations within a cluster should look relatively similar. This is easy enough to capture mathematically. We can simply look at how far each observation in a cluster from the cluster’s mean. This value is also known as the Within-Cluster-Sum-of-Squares (WCSS). It is formally defined as follows:
Most implementations of K-Means actually seek to minimize this quantity subject to some constraints. However, there are some issues with using this as a metric for performance. For example, look what happens when each observation is its own cluster, that is when C(xi)=xi. The WCSS becomes 0. So directly minimizing this quantity is not the way to go. In reality, the way to use this metric is to plot it against the number of clusters and look for an ‘elbow’ — a point where the marginal decrease in WCSS from adding another cluster is much smaller then before.
Having similar observations within a cluster is great. However, we can generate a set of very similar observations and split them into N clusters and this property would hold. We in fact need observations within a cluster to look similar AND observations between clusters to look different. Another way to say this is given an observation in a cluster, we’d like it to be close to other observations in its own cluster but far from observations in the nearest cluster over. The silhouette coefficient captures this notion.
The silhouette coefficient SC(X,C) is bounded between -1 and 1. High values of the silhouette coefficient indicate dense and well separated clusters but because of dependencies on cluster convexity it is a poor choice to compare between different clustering methods.
Let’s take a look at how values of these two metrics look at our generated data with 3 clusters.
# import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
%matplotlib inline
k_tests = [2,3,4,5,6]
wcss = []; silo = []
for k in k_tests:
clustering = KMeans(k).fit(X)
wcss.append(clustering.inertia_)
silo.append(silhouette_score(X,clustering.predict(X)))
plt.figure(figsize=(10,4))
plt.subplot(1,2,1);plt.title("Within-Cluster-Sum-of-Squares", fontsize='small')
plt.scatter(k_tests, wcss, marker='o', edgecolor='k')
plt.plot(k_tests, wcss, linestyle='--')
plt.subplot(1,2,2);plt.title("Silouette Width", fontsize='small')
plt.scatter(k_tests, silo, marker='o', edgecolor='k')
plt.plot(k_tests, silo, linestyle='--')
They both point to the ‘correct’ number of clusters, three. Obviously, in real data we will not have knowledge about existence of clusters and these plots will not look nearly as nice, however the general idea is the same. Another well-known measure is the Gap Statistic, which is similar in spirit to the Hopkins Statistic in that it compares clustering of random uniform data and uses that as a baseline to help us gauge the quality of our clustering.
Conclusion
Hopefully by now you are familiar with some of the most popular methods in statistical clustering, as well as how to implement these in Python. These methods are widely used in a variety of domains both for knowledge discovery and to aid predictive models, and adding these to our data set gives us one more thing to look for in new data. Thanks for reading.