3.3 Overview: k-means Clustering

The first process that we will use in this chapter to demonstrate clustering is k-means.

Since the k-means model doesn’t “know” where it should start, it begins with a random assignment of cluster centroids, after the modeler has selected a k-value (in k-means clustering, the ‘k’ represents the number of clusters to use). A centroid is a single point, made up of the mean values of each feature within some particular group of observations.

Each record is then assigned to the nearest centroid value, as measured by Euclidean distance.

Following that initial iteration, the model explores alternative placement of cluster centroids, seeking to maximize the ratio of between-cluster variance to within-cluster variance.

Once the model has reached an optimum point, at which additional movement of centroids would only decrease the model quality, it settles on a final position for the centroids, and a final set of cluster assignments for the records.

Imagine that in the plot below, each blue dot represents one observation from the dataset, and that the red dots represent centroids. The six points in the upper-right portion of the graph would be assigned to the cluster whose centroid is nearby, while the eight points in the lower right portion would be assigned to the other centroid. It may have taken several iterations for the model to arrive at these centroid positions, but at this point, there is no need for further adjustment – the clusters are as well-separated as they can be.

In larger dimensional spaces, the relationship between records and centroids is much tougher to visualize, but the same principle applies – records are still assigned to the cluster whose centroid is nearest.

Let’s imagine that we are working with normalized data in a k=3 model with five features. The three centroids’ values are:

Cluster 1 centroid	(1.5, 2.15, 0.65, 1.15, -1.05)
Cluster 2 centroid	(-1.85, 1.12, 1.01, 0.35, 0.72)
Cluster 3 centroid	(0.16, -1.07, 1.42, -0.15, -0.89)

We are looking at a record for a customer named Pete, whose values for these five features are: (1.15, 0.35, -0.49, 1.82, -0.07). Which cluster will Pete be assigned to?

Pete will be assigned to Cluster 1, as his Euclidean distance to its centroid is 2.46. His distances to the Cluster 2 and Cluster 3 centroids are 3.82 and 3.35, respectively.

Because the initial assignment of centroid values is made randomly, it is possible to obtain slightly different results from repeated iterations of k-means, even when using the same data, and running the function the same way. For this reason, a ‘random_state’ value is sometimes used within the kmeans() function in order to ensure a reproducible result.