Select Page

16. Advanced Modeling Techniques – Interaction terms


In standard, plain vanilla regression models, we consider each independent variable’s impact on the response variable in an isolated way.  In fact, the correct interpretation of each coefficient’s meaning in such a model is that it shows how a one-unit change to an input variable impacts the response variable, with all else held constant.  In many real-life situations, though, there are complex relationships among variables.  When interactions among the independent variables are present, it means that changing the level of an input not only affects the response variable, but it also changes the way that other independent variables impact the response.  

In everyday life, we can observe interactions with food and tastes.  Imagine asking a room full of people, “How many of you like ketchup?”  Next, imagine asking them, “How many of you like ice cream?”  Now, ask that same group, “Okay, great…so how many of you like to put ketchup on your ice cream?”  Somehow, even in a room full of people who love both ketchup and ice cream, we’ll be hard-pressed to find even a single fan of tomato paste-laden mint chocolate chip.  

As silly as this example may sound, it could be modeled with linear regression.  We could use some kind of numeric rating system for dishes as a response variable, and it could offer mathematical proof that indeed, putting ketchup and ice cream together would be a waste of both ketchup and ice cream.  

Interactions can also be observed in many everyday situations involving people.  

Imagine that you are assigned to a group project with Terry, one of your favorite co-workers.  When you work with Terry, you notice that you feel more energized.  Somehow, your creativity becomes greater when Terry is there to brainstorm with you.  Meanwhile, Terry knows that you are a great listener, and that you are reliable.  Therefore, Terry always feels more motivated when you are assigned to the same project team.  In short, you make Terry better, and Terry makes you better.  

If we could model your expected contribution to a team project, we would assign some coefficient value to it – in other words, the presence of you, on a team, contributes x units of achievement.  Likewise, we would do the same for Terry, and come up with a coefficient of y units.  We could do the same thing for all the workers in your office, who would each have some unique ‘value-added’ coefficient.  

The strong, positive interaction between you and Terry, though, means that when you and Terry work together, the collective impact of your contributions exceeds x + y.  When you and Terry are on the same team, the ‘standard’ regression model with separate terms for each of you would consistently under predict the team’s contribution.  The missing element is the interaction term.  By including an interaction term, we can capture your input, Terry’s input, and the special ‘extra’ benefit that comes from you and Terry being placed together.  


Whereas the model without the interaction term would take this form:

y = 1X1 +2X2

The model with the interaction would take this form:

y = 1X1 + 2X2+3X1X2

The third term here is the interaction term, a coefficient that we multiply by the product of X1 and X2 before adding to the other terms in the model.  

If any independent variable is included in a regression model as part of an interaction term, then that variable should be kept in the model, regardless of its p-value.  This is known as the hierarchical principle1.  

In marketing, interactions are sometimes referred to as synergies.  We will see one in the example below.  

Let’s take a look at the dataset mspend_rev.csv, which can be found at lobsterland.net/datasets.

This dataset contains 52 weeks worth of data for Lobster Land’s online merchandise store, also known as the “Merch Store.”  The other columns in this dataset provide info about the Merch Store’s marketing budget was spent during the prior week.  All figures here are expressed in USD, and all are rounded to the nearest dollar.  The categories here are: outdoor display ads, such as billboards and bus stop posters; streaming service ads, including commercials played on YouTube, Spotify, and Pandora; and college radio, sponsored content from Lobster Land that helps to support a local, independent radio station run by college students.

It looks like we can use this dataset to predict merch_sales using the other variables, with an ordinary least squares linear regression model.  Before we go any further, though,  let’s check our potential independent variables for correlations.  

Since none of our potential independent variables are highly correlated with any others, we can proceed with the model. 

Let’s start by trying to model this with all three inputs:

From this first model, we can see that college radio ad spending is not having a major impact on merch sales because the p-value is above our alpha threshold of 0.05 and the confidence interval includes 0.  The model’s F-statistic tells us that there is a meaningful relationship here between the response variable and our inputs as a group.  The very low p-values for outdoor_display and stream_service indicate a high significance level for these independent variables.  Therefore, let’s now build a model using just these inputs.

This model looks pretty good.  The adjusted r-squared of 0.794 shows us that the model explains nearly 80 percent of the variation in merch sales.  As a diagnostic check, let’s take a look at a residuals vs. fitted values plot.

The plot here shows the model predictions (the fitted values) on the x, with the model residuals on the y, along with a solid blue smoothing line.  In an ideal model, we would see a flat blue line here where y equals 0, and a roughly even distribution of residuals on either side of the line.  This plot looks that way in the middle, but there seems to be an issue here – at the high end of the fitted values, the model is underpredicting the sales, and at the low end of the fitted values, the model is overpredicting the sales.  Perhaps, therefore, the model is missing an interaction effect between these two ad formats – it may be the case that when one format is high, it makes the other format more effective, and vice versa.  

When an interaction between two independent variables positively impacts the response variable, the inclusion of both inputs in the model leads to a greater outcome value than we would expect by summing the expected contributions of the two independent variables.  In the marketing scenario here, this may be caused by consumers’ perception of the messages.  When we are exposed to marketing messages through multiple formats, the messages may tend to be more ‘sticky’ in our minds.  

Using statsmodels, we can code an interaction by adding a new term to the model inputs – we will pass the names of the two variables involved in the interaction, separated by a : symbol.  

Sure enough, the interaction term is quite significant, with a p-value of just 0.032.  Keeping the hierarchical principle in mind, we will keep the main effects of outdoor_display and stream_service in the model as well.  

A look at the residual plot for the model with the interaction term shows a big improvement.  The smoothing line here looks nearly perfect, and we don’t see any patterns here of overprediction or underprediction. 

To see how the model with an interaction term makes a prediction for a record, we can use the code below.  First, we use the predict() function to find the model’s fitted value for the observation in the first row.  Then after viewing the params attribute, we can manually generate the prediction, remembering to multiply the two input values together by the interaction term after adding the intercept to the 1X1 and 2X2 terms.  


1. James, Gareth and Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013