Select Page

3.9 Okay, So Once Again, How Many Clusters?


As noted in the previous section, the elbow plot only provides us with a starting point to begin exploring and assessing k-values.  Other approaches, such as the silhouette method, or analysis of the percentage change in SSE as k increases, are also valid.  However, all of these approaches suffer from an important limitation that is hard to grasp at first – the statistically-optimal answer may not align with the business needs of the model.

If your manager says, “I have four tech support teams, so I want you to take our client dataset, and split it into four clusters of relatively-similar sizes” then that is what you should do.  For virtually any dataset imaginable, moving from k=1 (the full group) to k=4 (what this manager has requested) means that there will be a massive improvement in within-cluster homogeneity.  The k=4 solution here aligns with the business needs, which is more relevant than any elbow chart, silhouette plot, or other statistical analysis.  

Okay but back to our dataset with the Portland families.  Since we do not have a very clear bend, or elbow, let’s just start with k=3 and see what we would obtain.   How strongly separated are these groups?  

From the mean values shown here, we can see that Cluster 0 stands out for having the largest household sizes and the smallest homes.  Their pet totals, household incomes, and entertainment spending are all close to the dataset mean.  

We can see some other clear ‘breakouts’ here, such as the high travel spending and large homes for Cluster 1, and the low household sizes, and small number of young children for Cluster 2.

What if we try a larger k-value?

With k=4, we have some “breakout” clusters for pet ownership – whereas the k=3 model only showed pet ownership hovering around the dataset mean, we now have Cluster 0 as our pet enthusiasts, and Cluster 1 as their opposite.  For some variables, such as household income and entertainment spending, we don’t yet have any true “breakout” groups.  

Next, let’s try k=6:

With k=6, we can see quite a bit more differentiation from group to group.  For most, but not all variables, some cluster now stands out as being a full standard deviation above or below the mean.  

At this point, you may be wondering, “Okay but when should we stop?  Do we just keep iterating through higher and higher k values forever?’

At some point, the incremental improvements from going from k to k+1 would be so small that it would not be worth it to add another cluster.  Each additional cluster will come with some costs for a business, as it’s an additional group to keep track of.

For the sake of this example, we will stop here at k=7.

Some analysts may prefer to use visualizations to assist with this process.  If visualizations reveal inconsequential differences among multiple clusters for key variables from the dataset, that can be a sign that the k-value could be too low or too high.