Select Page

3.12 Another Approach: Hierarchical Clustering


An alternative to k-means clustering is hierarchical clustering.  Unlike k-means, hierarchical clustering does not involve any randomness.  

With hierarchical clustering, we can decide on a metric for measuring distances between records as well as a method for determining one cluster’s distance from another cluster.  To think about why the distance between two clusters could be calculated differently, imagine someone asking you “How far apart are Boston and Cambridge?”5

While this question seems quite straightforward, the best way to answer it is not always immediately clear.  Since the cities border one another along several bridges, and on the land, you could say something like ‘One nanometer.”  Alternatively, you could pick a common landmark in Boston, like the Boston Common, and answer with its distance to a Cambridge landmark, Harvard Square (4.3 miles).  Alternatively, you could say something like “18 miles” because you are thinking about the distance from a southerly part of Boston to a northern part of Cambridge.  The distances between hierarchical clusters work in a similar way!  

MethodHow it WorksKeep in Mind…
SingleThe distance between two clusters is the shortest distance between any two recordsEmphasis on the two most similar points can lead to “long chains” (other points in Cluster A may be very far from Cluster B, but those long distances would be ignored)
CompleteThe distance between two clusters is the greatest distance between any two recordsMany points in Cluster A will be close to points in Cluster B, since the inter-cluster distance is assessed by the largest distance between A and B
WardWhen joining clusters together, considers the resulting sum of squared errors (based on records’ distances from centroids); makes the merger decision in a way that minimizes SSE within resulting new clusterInter-cluster distance metric is similar to the process used by k-means to determine centroid placement
AverageThe distance between two clusters is the arithmetic mean of all the pairwise distances between the clustersLess impacted by outliers than single or complete, which are defined by just one pairwise relationship  
CentroidThe distance between two clusters is calculated by the distance of one cluster’s centroid to another’s centroidBased on just one distance (Cluster A’s centroid to Cluster B’s centroid); also, note that the centroid is not necessarily an actual point in the dataset.
MedianThe distance between two clusters is the median of all the pairwise distances between records in the two clustersCompared with average method, would be even less impacted by outliers, since it uses all the pairwise distances and since the median is robust to unusual values

After seeing all of these inter-cluster metrics, you might naturally be wondering now, “Which one should I use?”  This question is so specific to the particular quirks of each dataset, and to the purposes of each model, that there is not a one-size-fits-all answer.  Automated methods of picking the “best” metric to use may rely on things like the between-cluster and within-cluster variance; however, as is the case with k-means clustering, the method recommended by such an approach may not actually by ideal for the situation at hand, regardless of its statistical efficiency.

The best thing we can recommend is to explore and assess – look at the clusters generated by different model variants, and see how well those align with the goals of the model.  

For the hierarchical clustering model here, we will take a random sample of 60 rows from port_fam.  We do this only for demonstration purposes – using a smaller number of records will enable us to more clearly see the labels on the dendrogram.

After importing the dataset, taking the sample, and standardizing the numeric variables, we will generate a distance matrix.  This matrix will form the basis for a hierarchical clustering model based on this data.  Notice that we are using the ‘ward’ method in this example, which creates clusters by grouping data points that have the least sum of squared errors between data points and the centroid. This approach is also used in k-means clustering.

Next, we generate a dendrogram to view our model.

Across the x-axis are each of the rows in our dataset sample, numbered 0 through 59.  The numbers on the y-axis represent distances.  

The horizontal lines on a dendrogram represent the joining of clusters.  The length of those lines indicate the distance from cluster to cluster.  Note that hierarchical clusters are nested –in other words, once records have joined together in such a model, they remain joined at greater distances.

Near the left-most part of the graph, we can see that records 10 and 53 have joined at a distance of approximately  ‘2.’   Since we specified ‘euclidean’ as our distance metric earlier, this calculation is performed by the model in the same way that was demonstrated earlier in the chapter.  Here is a peek into the data to see why these two records are fused together at this point.  

After viewing the dendrogram, we can make a decision regarding the number of clusters.  As with k-means clustering, the choice of cluster numbers is up to the modeler, and should align with the business purpose at hand.  With a distance cutoff of 8, as shown below, we will obtain a four-cluster result.

The number of clusters in the model is equivalent to the number of vertical lines intercepted by the distance threshold.  With a smaller distance threshold, we could wind up with a much larger number of clusters.

The code steps shown below are used to generate the cluster assignments for each record, and to reattach them to the port_sample dataframe.

We can generate per-cluster summary stats just as we did before with the k-means model.

This table helps us to identify the salient features for each of the clusters.  We can see, for instance, that Cluster 2 households tend to be rather large, but without the same incomes or spending habits as those of the other clusters.  Clusters 1 and 3 both tend to have larger homes and bigger incomes, but they tend to spend differently – Cluster 1 goes bigger on travel, whereas Cluster 3 tends to spend more on entertainment.


5 For context, Boston and Cambridge, are both located in eastern Massachusetts.  They share a water border and a land border.