Select Page

1.19 Relationships Between Variables: Covariance and Correlation


Yet another central component of EDA is building an understanding of the interrelationships between variables.  Understanding such relationships can have many practical applications for a business. For instance, if Lobster Land identifies a direct, positive, linear relationship between expected high daily temperature and interest in the waterslide and wave pool, it can make staff allocation decisions after checking the weather forecast.  

First, let’s take a look at covariance, which measures the way that two variables move together.  When the covariance between two variables is positive, we expect to see that as one variable goes higher, so does the other; likewise, as one goes down, so does the other.

Let’s look at ten days’ worth of Lobster Land data to examine the relationship between high daily temperature and total water ride demand (each time a visitor uses either the waterslide or the wave pool, this variable is incremented by one):

Day #High TemperatureWater Ride Demand
177912
265759
3811256
468569
5851612
678944
771939
875822
9881539
1073898

This table shows us that there seems to be a relationship between the two variables.  It is not perfectly linear – note that the hottest day only shows the second-highest water ride demand – but there are other potential factors that could influence demand.  From this table, we do not know which days were weekends, or whether there were other weather-related influences, such as thunderstorms, among this group of ten days.

Day #High Temperature (x)Water Ride Demand (y)Mean High TemperatureMean Water Ride Demandx-mean(x)y-mean(y)x-mean(x) * y-mean(y)
17791276.110090.9-97-87.3
26575976.11009-11.1-2502775
381125676.110094.94271210.3
46856976.11009-8.1-4403564
585147976.110098.94704183
67894476.110091.9-65-123.5
77193976.11009-5.1-70357
87582276.11009-1.1-187205.7
988151276.1100911.95035985.7
107389876.11009-3.1-111344.1

After summing all of those values in the right-most column, we arrive at this figure:  18414.0.  We take this sum and divide it by the total number of records, 10, to find the covariance: 1841.4.

Note that when determining a covariance for a sample, rather than for a population, we would use n-1 in our denominator.  

When people speak of the “correlation” between two variables, they are most typically referring to a Pearson product-moment correlation. While covariance measures the direction of a relationship between two variables, Pearson’s correlation measures the strength of that relationship.  This value is found by dividing the covariance by the product of the two variables’ standard deviations.  

We can also check a correlation’s significance, using the stats.pearsonr() function from scipy’s stats module.

The first number returned here is r,  the correlation coefficient.  The second number returned is a p-value.  This very low number indicates that if there were truly no meaningful linear relationship between temps and water total, the chance that we could obtain an r value as large as the one obtained here is around 0.02%.  We can rest assured that this correlation is significant!  

Always keep in mind that variables can be strongly related to one another, yet still show a very low linear correlation, if their relationship is a nonlinear one.  

For an example of this, let’s take a look at some recent data from the ‘Strongman-O-Meter’ attraction at Lobster Land.  The Strongman-O-Meter enables a participant to swing a sledgehammer against a bell that is positioned on the ground.  Based on the amount of force applied to the bell, a ball rises to a certain point on the meter.  Many participants try out the Strongman-O-Meter for bragging rights – they hope to demonstrate their physical prowess by reaching the highest point on the meter.11

The secret, however, is that using a moderate level of force is the way to reach the top of the meter.  Applying too little force, or applying too much force, will prevent the meter from reaching the top.  

We gathered some data from Lobster Land, based on the 2500 most recent instances in which a visitor tried the Strongman-O-Meter.   That data is shown below, in which we can see the average meter_height values for 20 separate levels of applied force.

As the plot below indicates, there is clearly a close relationship between force_level and meter_height.  Up until a force level of 11, the meter height steadily increases; beyond 11, the meter height actually decreases, albeit with a gentler slope after 14.

While this relationship is undeniably present, it is a non-linear one.  As a result, if we try to express the relationship with a Pearson product correlation, we get a result that suggests only a moderate relationship.

In reality, however, as the graph shows, we can very clearly predict the expected meter height, using the amount of force applied.  The takeaway from this is that we should be careful not to view an unremarkable Pearson product correlation between two variables, and then conclude that there is not a strong relationship between them.


11 https://commons.wikimedia.org/wiki/File:Herne_-_Cranger_Kirmes_2012_070_ies.jpg