Report an error

3.7 Scaling the Variables

Variable scaling, or standardization, is not necessarily a requirement for all clustering models. If we were to build a clustering model based on the outdoor activity survey mentioned earlier in this chapter, we would not need to perform any scaling, as all of the values reflect survey responses from 1 to 10.

However, it is much more common to encounter datasets whose variables are measured in different units. In such instances, the modeler should definitely scale the data in order to enable a meaningful comparison between them. To see why, let’s consider a model that includes the appraised value of someone’s home, as well as the number of residents who live in the home.

Before we look at the Euclidean distances involved, let’s take a moment to consider three families, the Smiths, the Joneses, and the Williamses. Glancing at the table below, we see that all three families have very similar estimated home values. The most remarkable difference we see here is that whereas the Smiths and Joneses seem to be average-sized families, with four and three members, respectively, the Williams family is quite large.

Family Name	Est. Home Value (USD)	Household members
Smith	500,000	4
Jones	540,000	3
Williams	550,000	11

What if we were asked to put the two most similar families from this group together? Glancing at the numbers above, and applying just a bit of domain knowledge, would lead us to place the Smiths with the Joneses.

What would happen if we did this calculation with the original variables? The figure below demonstrates the answer to this question.

As indicated by the figure above, if we simply calculate distances using the unscaled values, we obtain a result suggesting that Jones and Williams are the closest match. In the Euclidean distance calculation, home value dominates, while household size is effectively muted.

We can solve this problem with some method of variable scaling. The most typical type of scaling involves converting the original values to z-scores. An observation’s z-score for some variable represents that value’s distance from the mean, as measured in standard deviations. We can find the z-score for any value x by subtracting that variable’s mean from x, and then dividing by the variable’s standard deviation.

When we use z-scores, rather than the original values, the variables’ relative importance are now brought onto a level playing field. For this example, we will assume that the values for these three families come from a larger dataset, for which the mean home value is $525,000, with a standard deviation of $12,400, and for which the mean household size is 3.15, with a standard deviation of 0.86.

After defining homemade_zscore(), as shown above, we pass all three households’ values into the function.

Now that the original variables have become z-scores, an unusual household size bears the same impact as an unusual home value. The Euclidean distance calculations shown below, performed with the z-scores, show a result that will appear far more sensible to readers: now, the Smiths and the Joneses are the most similar pair, while both appear to be quite distinct from the Williams family.

Another method for standardizing data, min-max scaling, involves adjusting each value by subtracting the minimum value for that variable and dividing by its range. This approach ensures that all values for the scaled variables will fall between 0 and 1, with the largest value in any column taking a value of 1, and the smallest taking a value of 0.

To find the min-max scaled number for the Smiths’ household size, we start by subtracting the minimum household size value from theirs: 4-3 = 1. Then, we divide by the range for that variable (8) to arrive at 1 / 8 = 0.125.

To find the min-max scaled number for the Jones’ home value, we start by subtracting the minimum home value from theirs: 540,000-500,000 = 40,000. Then, we divided by the range of home values (50,000) to arrive at 40,000 / 50,000 = 0.8.

Family	Original Home Value	Original Household Size	Min-Max Scaled Home Value	Min-Max Scaled Household Size
Smith	500,000	4	0	0.125
Jones	540,000	3	0.8	0
Williams	550,000	11	1	1

Conversion of the data to z-scores is far more common than min-max scaling. Min-max scaling may be preferred in situations in which the modeler wishes to constrain the data to values between 0 and 1, and/or avoid dealing with negative numbers. Note that min-max scaling results can be heavily influenced by the presence of outliers. Outliers expand the dataset’s range, which tends to mute the distinctions among non-outlier values when the data is scaled this way.