Report an error

3.2 Euclidean Distance

Intuitively, we can look at the people, places, and things all around us, and develop a general sense of the difference, or sameness, among them. Without requiring any math, or any coding, people are readily able to distinguish between images of urban and rural landscapes, between groups of teenagers at a rock concert and groups of senior citizens on bingo night, or between eloquent speakers and buffoons.

Ask them precisely how they can make such distinctions, however, and you might be met with blank stares – after all, they do it intuitively, without really needing to think about the process. To build a model, however, we cannot rely on intuition alone. Instead, we need some way to measure those differences, and to enable comparisons among the relative distances between various objects. For k-means clustering, that method will be Euclidean distance.

The Euclidean distance between any two records is found by first squaring each of the pairwise differences between those records, summing those squares, and then taking the square root of that sum.

Here, we will first demonstrate how Euclidean distance can be used to measure the difference between two records; in the next section, we will see how Euclidean distance can determine a record’s cluster assignment.

For this example, we will assume that a group of 5 friends has been asked to rate their interest levels in a series of outdoor activities, on a scale from 1 to 10. A response of “10” means that the respondent would be highly interested in such an activity during his or her time off from work. A “1” means that the respondent would have no interest at all in pursuing the activity. A “5” represents a neutral perspective, and all of the other integer values between 1 and 10 can also be used to express relative interest in the activity.

	Basketball	Golf	Frisbee	Swimming	Skydiving	Hiking	Fishing
Anson	6	6	8	5	7	4	9
Bradford	5	7	3	4	6	8	6
Carlton	2	3	2	1	9	10	5
Dalton	5	10	5	8	4	2	7
Edgar	10	5	1	9	2	3	4

From among Bradford, Carlton, Dalton, and Edgar, who is most similar to Anson? Who is most different? A quick “eyeball test” here will not help us, especially since there are some in the group who are very similar to Anson in particular categories, yet very different in others.

To find the Euclidean distances between any two records, we will follow the formula shown above. We will:

Square each of the pairwise differences;
Sum those squared values;
And take the square root of that sum.

Let’s begin with Anson and Bradford. The differences between their answers, moving across the categories, subtracting Bradford’s numbers from Anson’s, are: 1, -1, 5, 1, 1, -4, and 3 (note that if we had found these by subtracting Anson’s scores from Bradford’s scores, the end result would be the same, since we will square these values in the next step).

Those squared values, respectively, are 1, 1, 25, 1, 1, 16, and 9. Those values sum to 54. The square root of 54 is 7.35 – this is the Euclidean distance between Anson and Bradford. In and of itself, though, this number does not offer us much meaning; only when we can compare it to the other distances can we get a comparative sense of Anson’s distance, or closeness, to Bradford.

As it turns out, Anson is even closer to Dalton – their Euclidean distance (7.21) is just a hair closer than the distance between Anson and Bradford, while his distance to Carlton and Edgar is tied at 11.53. To see where those numbers come from, use the formula above and see if you can work it out with a pen, a paper, and a calculator.

Besides Euclidean distance, there are dozens of other ways to measure the distances between records. Here is an article showing nine different types of distance metrics, their formulas, and their associated pros and cons.¹

¹ Grootendorst, Maarten. “9 Distance Measures in Data Science: The advantages and pitfalls of common distance measures,” Towards Data Science, 01FEB2021. https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa