3.13 Hierarchical Clustering with Variable Reweighting
Earlier in this chapter, we explained the importance of scaling, or standardizing, when working with variables measured in different units, and on different scales.
As we demonstrated, scaling solves an important problem – a single unit of home value, measured in dollars, is simply not as meaningful or impactful as a single unit of home residents, as measured in people. A home with two people is very fundamentally different from one with 22 people, yet a home worth $500,000 has essentially the same value as one worth $500,020. As measured by Euclidean distance, though, these degrees of difference are identical.
When we converted home values and home residents to z-scores, we essentially leveled the playing field – now, a relative difference from the mean home value can be compared to a relative difference from the mean household size in an “apples to apples” way. While scaling solves this problem, it may leave us with a new one – what if we want one variable to influence the model more than another? What if we think that home value should be 2.5 times as influential as household size? What if we think that household size should be 5 times as influential as home value?
We can achieve this with variable weighting. After first standardizing the variables, we can simply scale them up, multiplying them by constant values that reflect the desired weights in the model.
As for the decision regarding what those weights should be, there are no firm rules regarding the weights to assign. Yet again – stop us if you’ve heard this one before – the decision should ultimately align with the business purposes of the model.
For the example below, we will assume that Lobster Land management is interested in retaining all of the original variables, but wishes to place a special emphasis on household income, pet ownership, and travel spending. Perhaps the park is considering leasing space on its property during the off-season to pet owners who like to take vacations, and wishes to identify likely clients. Either way, we will adjust the z-scores by multiplying each by the following constants:

The head() function offers a good way for us to see what the re-weighting accomplished. For the variables that we scaled up, positive z-scores became far more positive, while negative ones became far more negative. For the ones that we scaled down, the ranges of their standardized values became more compressed.

Using the reweighted values, we create a new distance matrix, and then a new dendrogram, as shown below.


Using a y-axis threshold of 250, we will again generate four clusters with this data.

For the variables that we scaled up – household income, entertainment spending, and travel spending, we should expect more variation from cluster to cluster, and less within-cluster variation. In other words, households that spend a lot on travel should be much more likely to be clustered together now, while those that spend very little on travel should also be much likelier to be grouped together.
Let’s examine the means and standard deviations from the unweighted model clusters (ward_cluster) with the ones from the weighted model clusters (ward_cluster2). As you look at the tables below, bear in mind that the cluster labels are arbitrary – in other words, there is nothing that connects Cluster 1 from one model iteration to Cluster 1 in another.


The tables above show us what we should expect to see from the re-weighting: it caused the inter-cluster differences for these variables to increase, while the intra-cluster differences decreased.
How do we know this? In the first model, the per-cluster income means stretched from $49,434.00 to $76,019.00, for a range of $26,585, while the per-cluster travel means stretched from $2345.60 to $3831.48, for a range of $1485.88. In the weighted model, the differences from cluster to cluster became more stark – for this iteration, the ranges grew to $35,765.64 for income ($79701.54-$43935.90), and to $2,029.85 for travel spending ($4246.23-$2216.38). These are percentage jumps of 34.53% and 36.61%, respectively.
Meanwhile, intra-cluster differences fell considerably for the heavily-emphasized variables here. In the unweighted model, the mean value of the standard deviations in the four clusters was $17,369.71. In the weighted model, that figure fell to $13,911.27. The impact to intra-cluster difference for travel spending was even more stark, as the mean value of those intra-cluster standard deviations fell from $563.10 to just $305.07.
Alternatively, we can look at the impact of the re-weighting visually. Below, we can see a distribution plot for travel spending in the unweighted model, with separate colors for each cluster. It shows some notable distinctions, such as the way most of Cluster 0 falls to the right of Cluster 3. In the plot on the right, the distinctions appear much more starkly, with far less overlap from cluster to cluster.

The scatterplots below also help to tell the story. In the unweighted model, depicted on the left, we can see some general patterns, with Clusters 0 and 1 occupying much of the upper-right portion of the graph. In the weighted model, depicted on the right, all four clusters are identifiable now, with nearly no overlap among them.
