3.4 How Many Clusters?

As noted in the previous section, the k-means algorithm requires the modeler to specify the number of clusters.

Naturally, this leads to the question, “How many clusters should be made?” While students new to clustering (naturally) seek an answer to this question, there is in fact no clear-cut answer, either in general or even with any specific dataset. In contrast to regression and classification, clustering is performed without a target variable, or response variable, in the dataset. Essentially, it is a form of rearranging the observations – and in some ways, it is analogous to other forms of rearrangement.

Imagine walking into a large room. Most of the space inside the room is empty, but in the corner, there is a couch, a desk, a coffee table, and two reclining chairs.

How should the furniture be arranged? What is the best way to set it up? Should the couch be placed near the desk, so that someone can recline on the couch with his feet propped up on the desk? Should the desk go into one of the corners, or perhaps somewhere closer to the center? Should the reclining chairs be placed alongside the couch, or should they be perpendicular to the couch? Should they be next to one another, facing one another, or arranged in some other way?

Of course, there is no single “correct” answer to those questions. There may, however, be preferable arrangements, seen from the points of view of those who will use the room.

Clustering, a form of unsupervised learning, can be seen the same way – while it does not have a single “correct” answer, it may lend itself to arrangements that work well for the stakeholders involved.

The decision regarding the number of clusters may not even be related to the dataset at all. A company might say “We have six inside sales teams, so please take all of our corporate accounts, and separate them out into six distinct clusters.”

A clothing company may use a clustering model to help it determine size options for its customers. As the company weighs the decision regarding a cluster number, it may seek to balance two countervailing influences – more clusters will mean that customers are more likely to find something that suits them well, but too many clusters could lead to manufacturing and supply chain complications.

In this chapter, we will explore the use of elbow plots (for k-means clustering) and dendrograms (for hierarchical clustering) to help us develop a starting point for answering the “How many clusters?” question.