Report an error

11.13 ARIMA Methods for Forecasting

Next, we will explore another set of forecasting methods, collectively known as ARIMA. ARIMA stands for Auto-Regressive Integrated Moving Average.

An ARIMA model is defined by three components, often referred to as p, d, and q. p, d, and q correspond to the autoregressive, integrated, and moving average elements, respectively. Each of these three elements is described below.

The p component is the auto-regressive term for the model.

For an AR(1) model, we would use the previous value in the series (lag-1) to predict the next value. This is not a naive forecast, though! To generate that y_tprediction, we multiply the previous value by some coefficient, a₁. The equation shown below also includes ε_t, a white noise error term, unrelated to any other such values in the series.

y_t= a₁y_t-1 + ε_t

You can think of an AR(1) model as being akin to a simple linear regression, in which y_t is the response variable, y_t-1is the input variable, a₁is the beta coefficient, and ε_tis the residual.

The a₁coefficient always falls between -1 and 1; this constraint keeps the series stable and stationary. When a₁is negative, this indicates that we are working with a mean-reverting series; it means that higher values in period y_t-1are likely to be followed by lower ones in period y_t, and vice versa. When a₁is positive, this indicates that our series exhibits momentum, with high values associated with even higher ones in the next period, and vice versa.

If we went to a higher-order AR model, this would mean that we would be incorporating more previous terms into our prediction for the next value in the series. An AR(2) model, for instance, would use the previous two lags, each of which would have a unique coefficient:

y_t= a₁y_t-1 + a₂y_t-2 + ε_t

For p periods, we can just write this equation as:

y_t= a₁y_t-1 + a₂y_t-2 + … + a_py_t-p + ε_t

Next, we will consider the d component, which indicates the degree of differencing that our model will require. Differencing will be covered in more detail later in this chapter.

A first order difference is often written as:

(1-B) y_t

In this notation, the ‘B’ stands for backshift.

Differencing can sometimes involve more than just a single period. We can write a dth order difference as:

(1-B)^d y_t

Finally, we arrive at the q component, which indicates the number of previous error terms to incorporate into our prediction for the following period. This is the Moving Average, or MA, component of the model. Note, however, that this is completely unrelated to the moving average / simple moving average method covered earlier in this chapter.

For a first order MA model, or an MA(1) model, we make a prediction for a period using the previous period’s error term, or “shock” term, multiplied by a coefficient:

y_t= m₁ε_t-1 + ε_t

A positive value for the m₁coefficient implies a momentum effect, whereas a negative value for m₁ is associated with mean reversion.

For a second order MA model, or an MA(2) model, we would just extend this to incorporate information from the previous two shock terms:

y_t= m₁ε_t-1 + m₂ε_t-2+ ε_t

For q periods, we can just write this equation as

y_t= m₁ε_t-1 + m₂ε_t-2+ … + m_qε_t-q + ε_t

We can combine elements of each of these types together to generate models that incorporate aspects of each. Such a model is an ARIMA model, and its form is denoted by the three numbers that make up the p, d, and q values. An ARIMA (2,1,2) model, for instance, uses two autoregressive lags, one degree of differencing, and two previous error terms in order to generate a prediction for the following period.

Stationarity: Who Cares?

Earlier in this chapter, stationarity was defined as the condition in which a time series’ mean, variance, and autocorrelation remain consistent throughout.

Stationarity is a requirement for ARIMA modeling. Stationarity ensures us that we can use the same model to generate predictions across the time series. If the mean, variance, and autocorrelation were changing throughout the time series, we would need to continually update the model parameters for every single prediction – this would increase the complexity to such a degree that the ARIMA process would not be reliable.

When working with data that is non-stationary due to the presence of a trend, we can often achieve stationarity through a process called differencing.

To check for stationarity, we can use the Augmented Dickey-Fuller test.

Differencing a Time Series

Let’s take another look at the New Coke data from earlier in this chapter.

Visually, we can see a clear trend here – across the period, New Coke sales have steadily declined. We can verify this by running the Augmented Dickey-Fuller test, as shown below. The null hypothesis of the Augmented Dickey-Fuller test is that the time series is non-stationary, due to trend. For our test, we will assume an alpha threshold of 0.05.

With a test-statistic of -2.79, and a p-value of 0.059, we will fail to reject the null hypothesis.

We will be able to achieve stationarity through a transformation. In this case, we’ll convert the actual weekly sales figures to values representing proportional change. We will achieve this by shifting the time series by one period, and dividing the shifted value by the current one for each observation as shown below.

As shown below, newcoke2 appears to be de-trended, albeit with a bit of a spike at the very end.

To really be sure, though, we should run the Augmented Dickey-Fuller test on this adjusted data.

With this very low p-value, we can reject the null hypothesis, and conclude that newcoke2 is not nonstationary due to trend.

Autocorrelation

A time series exhibits autocorrelation when its value at time t is closely related to its value at some other period, such as t-1.

A contrasting situation to autocorrelation is mean reversion. In a mean reverting system, periods with unusually high values are likely to be followed by periods with lower ones, and vice versa.

A real life example of autocorrelation can be seen with American college football teams’ season performances. When a team wins many games in a season, its ranking tends to go higher. As the team climbs higher in the national rankings, it is more likely to be featured on nationally-televised games, and to earn a better placement in post-season bowl games. This heightened exposure draws the interest of recruited high school players, who are then more likely to want to attend that school. The addition of top-tier recruits to the team makes it likelier that the team will succeed in the next season, which contributes to better ranking, more exposure, and, once again, better recruiting. In short, a great season makes it more probable that the next one will be great, too (and vice versa for a poor season).

Professional teams’ season performances, on the other hand, more typically exhibit mean reversion. Prior to the start of each season, these teams select new players through a system called the draft. In the draft, teams with the worst records in the preceding year get the best picks, whereas those with the best records in the preceding year get the worst ones. While this system does not guarantee mean reversion, it stands in stark contrast with the collegiate system, in which success or failure is more likely to be perpetuated into the future.

A time series’ lag-1 autocorrelation is the correlation between its values at time t and its values at time t-1. Its lag-2 autocorrelation is the correlation between its values at time t and its values at time t-2, and so on.

White Noise

In a white noise model, each value is randomly generated, and is uncorrelated with any previous values. White noise data has a constant mean and variance.

The code shown below will generate Gaussian white noise, or normally-distributed white noise. The random.normal() function from numpy will generate the normally-distributed random values. The loc argument specifies the values’ mean, the scale argument specifies the standard deviation, and the size argument indicates how many values to generate.

We will use the plot_acf() function from statsmodels to see that none of the autocorrelations venture beyond the statistically-significant range, which is shaded in blue by default. This tells us that past observations will not be helpful for us to use when predicting future values.

When assessing the plot above, bear in mind that the lag-0 autocorrelation will always be 1, since lag-0 is just the observation itself.

White noise can be thought of as an ARIMA (0,0,0) model.

Let’s break that down a bit. For a white noise model:

p = 0, because each data point is totally unrelated to the other points; in other words, there are no “lags” that would help us to generate any particular prediction;
d = 0, because there is no trend in the data. Therefore, no differencing is needed;
q = 0, as we cannot make any updates based on previous error terms, since the data points move about randomly.

At this point, you might be wondering, “What was the point of that? If white noise is just this big messy pile of unrelated values, why take the time to learn about it?”

The concept of white noise is important in assessment of forecasting models. One of the criteria we can look for when assessing a forecasting model is whether the residuals are uncorrelated, with a constant mean and variance; in other words, we want the model’s residuals to be white noise. If the model’s residuals contain discernible patterns, that would suggest that something meaningful in the data (such as a component of its trend or seasonality) was not captured by the model.

A Random Walk

In a random walk, the predicted value for the next period equals this period’s value, plus some random error term. This can be represented in an equation as follows:

y_t= y_t-1+ ε_t

The change in value for a random walk is just white noise. We can see that by rearranging the above equation as follows:

y_t – y_t-1 = ε_t

To simulate a random walk in Python, the first step is identical to what we did with the white noise simulation. In this model, however, the level for each observation depends on the previous observation, plus or minus some random noise. For that reason, the level of the series at any particular point in time is found by the cumulative sum of all the values up to that point.

In the example below, a seed value was used. If you re-run this code, over and over, but without a seed value, you will find that your results vary from iteration to iteration.

While the random walk shown above shows an upward trend in the y-value across the time period, the individual changes in level, across the series, are just white noise:

A slight variation on the random walk model is a random walk with drift. In a random walk with drift, the prediction for the next period is:

y_t= + y_t-1+ ε_t

For a random walk with drift, change in values across the series is white noise, but with as its mean:

y_t – y_t-1 = + ε_t

A random walk can be thought of as an ARIMA (0,1,0) model.

p = 0, because each data point is totally unrelated to the other points; in other words, there are no “lags” that would help us to generate any particular prediction;
d = 1, because the value at any given point does depend on the most recent known value;
q = 0, as we cannot make any updates based on previous error terms, since the data points move about randomly.

Building an ARIMA Model

On mid-summer days, a street magician named Harry the Magnificent sometimes sets up shop along Lobster Claw Avenue, just outside of Lobster Land’s gates. He performs for free, but he accepts cash gratuities, dropped into a top hat, by delighted fans.

Harry is a bit unpredictable, but his performances tend to follow a general pattern. First, Harry tends to start his shows quietly, by performing just a couple of short card tricks for a small crowd of onlookers. As that small crowd lets out a few “ooohs” and “aaahs” in response to Harry’s tricks, others tend to stop and take notice. From there, the crowd grows a bit more, and the increasing volume of the “ooh” and “aah” crescendo attracts even more attention from curious passersby. The crowd grows more and more, until Harry ends the show with a grand finale, and reminds fans to leave a few dollars in the top hat.

For a few minutes afterwards, Harry tends to take a quick break for basic necessities. He sometimes uses this time to chat with fans, sign autographs, and pose for selfies, too.

After Harry’s energy level is reset, he starts up another show, and the cycle repeats.

To better understand the patterns with the crowd sizes around Harry, the summer interns with Lobster Land’s analytics team analyzed some old video footage from a single day in 1983, the year that Harry got started with his performances.

Starting at 10:45 a.m., the interns diligently recorded the crowd sizes at five-minute intervals, all the way until 9:00 p.m. that evening. Here are the numbers that they recorded:

At first, the crowd_count list just looks like a big, jumbled pile of integers. We can add time stamps to it, though, by first generating values for the times, before joining the times and the crowd counts together as a series.

With the data now in this format, perhaps we can plot it to gain a visual sense of the patterns here:

Harry’s crowd data exhibits clear autocorrelation, with this irregular series of peaks and valleys. As Harry gets a show started, some of the passersby on Lobster Claw Avenue stop to check it out. This draws more onlookers, who are curious to learn about what’s going on. The total crowd size continues to grow until Harry takes his break. It remains subdued until Harry starts up another show, at which point the cycle repeats.

Using the data from this single day in 1983, we will see whether we can predict crowd sizes using an ARIMA model.

Determining the Model Order

Choosing the model terms for an ARIMA model is part art, part science. Sometimes, it involves judgment calls on the part of the modeler. All else equal, the principle of parsimony should be kept in mind (go with a simpler model, rather than a more complicated one). Also, when stuck between model options, iteration is a viable way to go forward – just try out some variations, and then compare them using a metric such as AIC.

Since our crowd data for Harry the Magnificent does not contain a trend, we will not include a differentiating term.

To gain a sense of the best AR term, an effective tool for us to use is the partial autocorrelation plot. The partial autocorrelation between a measurement in the time series and some particular lag tells us about the direct relationship between that measurement and the lag, absent the effects of any interim lags.

As a general rule going forward with this, we will take the number of consecutive significant lags, per the PACF plot below.

Based on the PACF plot above, it looks like AR(2) is the way to go here. We will use the lags from the previous period, and from the period before that one, to help us predict the next value in the series.

To gain a sense of the optimal MA term, we will use the autocorrelation plot. Again using that same approach, we will look for the number of consecutive significant lags. Based on the plot, it looks like we could choose either MA(1) or MA(2).

To compare some different options, we can loop through several model versions, using the code shown below.

Since the ARIMA(2,0,2) model has a slightly lower AIC value than the (2,0,1) model, we’ll go with ARIMA(2,0,2) here.

With our model now in place, we can generate in-sample predictions. We will use the line below to generate predictions for the last 50 timestamps in the series, and to save those forecasted means:

We can plot those means, as shown below.

However, the plotted mean values are hard to interpret without the actual outcome values as a benchmark for comparison. To enable this comparison, we will first create a series with the predicted values as their associated timestamps:

Next, we will plot the entire ‘crowds’ series along with the ‘predictions’ series. The contrast between the red and blue lines shows that our ARIMA model mostly got it right, albeit with a few dips that extended below the actual “valleys” in the crowds data, and a few peaks that were just one or two periods off from the true peaks in the data.