Report an error

1.20 Another Type of Correlation: Spearman

Although Pearson product correlations are the most common type of correlations, by far, they are not the only game in town.

An alternative, the Spearman correlation, is calculated with the rank relationships between two variables. Since the Spearman correlation between two variables is based on the pairwise differences in rankings of values within the data, rather than on the values themselves, it mutes the impact of outliers.

To see the Spearman correlation in action, let’s start by taking a look at the relationship between soda sales and snow-cone slushie sales at Lobster Land in July 1976.

For nearly all of the days shown here, the number of sodas sold looks pretty stable – and so does the number of slushies sold. However, something a bit crazy happened on July 31st that year – as part of a one-day sales promotion, Lobster Land offered slushies for $0.01 each! On that day, slushie sales jumped up from their average by about tenfold.

If we calculate the Pearson product correlation between soda and slushee sales from that month, with the 31st removed from the calculation, we see evidence of the strong linear relationship between these variables, as the correlation is nearly 0.77.

Using all of that data, however, we arrive at a correlation that could look downright confusing. The inclusion of this single outlier value brings us from a positive correlation with a large magnitude all the way to one with a small magnitude and a negative sign. Indeed, things got so crazy here that we have a result suggesting that more soda sales are associated with fewer slushie sales!

Based on our domain knowledge of this dataset, and of these variables, we know that this is a nonsense result. Rather than remove the outlier row, though, we can take an alternative approach that preserves all of our data – we can use a different type of correlation.

This alternative, the Spearman correlation, generates a result based on the ranks of the variables’ values, rather than the underlying values themselves. In the formula below, D stands for the pairwise differences in ranks, and n stands for the number of observations.

To see how this mutes the impact of outliers, take a look at the dataframe below. Using pandas’ rank() function, we add ‘soda_rank’ and ‘slushee_rank’ columns for each observation (the ranks here are ascending, with the lowest taking the value of 1, and the highest taking 31. The ranking can be done in either direction, as long as it remains consistent between the two variables).

Next, we’ll find the differences between ‘soda_rank’ and ‘slushee_rank’ for each observation, and then square those values:

Those values are shown in the ‘squared_diffs’ column below:

Finally, we can obtain the Spearman correlation as shown below, using the sum of the squared differences (D), the number of observations (n), applied to the Spearman correlation formula shown earlier.

Of course, it’s much quicker to arrive at this number with the spearmanr() function from scipy! That approach is shown below. However, seeing it the “long way” – if even just once – is helpful for building an understanding of where this statistic comes from. Since the ranks of any two n objects can differ at most by n-1, outliers have a limited impact on a Spearman correlation.

https://commons.wikimedia.org/wiki/File:Sigh_and_Elaine_Bell_Catering_Pop-Up,_Sonoma,_California_01.jpg