Report an error

8.8 A word of caution regarding extrapolation

Logistic regression is relatively easy to implement. The 0s and 1s that it predicts for particular records are easy to interpret, but therein lies a pitfall: we may miss warning signs that our predictions are unreliable, due to extrapolation. Extrapolation occurs when a modeler attempts to predict a response value while using independent variable values beyond the range of data used to construct the model. This is dangerous, because the model only “knows” the relationships between the inputs and the outcome for the data that it sees during model fitting. The nature of that relationship may change at other ranges of values, so any model prediction based on extrapolated inputs would not be reliable.

In general, this is much easier to catch when performing linear regression, which returns a numerical outcome. For instance, if we used a linear regression model to predict Lobster Land’s revenue based on visitor numbers, and our model predicted an all time high of $100 million in a single day, that astronomical figure would clearly make us sit up and take notice due to its improbability. Digging into the data a bit, we would find that some wildly out-of-range input values could have caused the model to deliver a predicted y-value that is orders of magnitude bigger than any true outcome for that variable.

We would not be able to spot problems caused by extrapolation so easily in a logistic regression model. Since the model only delivers 0s and 1s out categorical outcomes, and since the associated probabilities are range-bound as well, nothing would “scream out” to the modeler to indicate that an inappropriate input value had been used in a prediction. When making predictions, we should be mindful of the ranges for the variables used to build the model. It is less likely to be an issue in a scenario such as the one shown here – in which the training set and test set were partitioned from the same original, larger dataset – than in a scenario in which some new, outside data was brought in to be checked against an existing model.

Model accuracy can be improved by:

Avoiding the ‘kitchen sink’ approach – Including all predictor variables in your dataset introduces ‘noise’ which reduces the model’s strength and obscures the meaning of the other features’ coefficient values.
Eliminating predictor variables that are highly correlated with one another – keeping them makes it harder to know the true effect of individual predictors on the outcome.
Keeping the model as simple as possible – If we hypothesize that the number of visits depends on whether a family has their own car, we could add an interaction term to the model to account for this condition. However, if this complex model performs roughly as well as a model without the interaction term when tested against the validation data, then the simpler model is generally preferred because this is easier to interpret, is computationally efficient, and prevents overfitting.
Checking the dataset for class imbalance – If your dataset contains many more customers who churn than those who renew, then oversampling will be needed to prevent your model from being biased towards people who are likely to leave.