Select Page

1.23 Imputing Missing Values


At times, we may wish to replace missing values with a substitute, rather than leave them alone or remove them from the dataset.  

One approach for missing value imputation involves replacing NaNs with a central measure, such as a median or a mean.  That would enable a data modeler to have many more complete observations than would be the case otherwise, thereby enabling certain types of visualizations to render, or certain types of models to be built.  

A drawback associated with such an approach is that it tends to reduce the dispersion of the variable that has been treated in this way – the addition of more data points in the center will result in a variable with a lower standard deviation and variance than the original, untreated version.  

In the example shown below, we first determine the median ‘PRECIP’ value to be 0, with the help of the describe() function.  After determining that the median is 0, we can use pandas’ fillna() function to replace the missing ‘PRECIP’ values with 0.

A slightly more sophisticated way to impute a central value is to use some other “clue” from within the data (such as a known categorical value for a record), and then replace an NaN with an average for that category, rather than for the entire dataset.

In this example, we’ll take a look at how to do that for the missing ‘GoldZoneRev’ values from the 2020 dataset.  For each of the six Gold Zone values that we are missing, we do know the day of the week.  

Let’s check to see if ‘WEEKDAY’ offers us any hints about what to expect for ‘GoldZoneRev’.  

After grouping the data by ‘WEEKDAY’, and then generating this table of mean values, we can see a pattern – ‘GoldZoneRev’ tends to spike considerably higher on Fridays, Saturdays, and Sundays, compared with the other weekdays.

Knowing this, we can write a separate function to replace missing values for ‘GoldZoneRev’ with the known average for the day of the week.

Alternatively, we could accomplish the same result using a lambda function, as shown below.

An alternative approach, LOCF, stands for “Last Observation Carried Forward.”  This is often used with seasonal data, such as temperature.  To see why, think about the weather in Boston – if we were missing a single day’s high temperature in July, it seems more reasonable to use the most recent known value than to use something like the year-long average, which incorporates all the values from other seasons, too.

We might also use LOCF in other circumstances.  Imagine that we are working with health care data, collected as part of a decades-long, longitudinal study.  Such data is not easy to obtain, nor is it easy to replace!  What if we are missing a single weight value, or a single blood pressure reading, from a patient at one of the periodic check-ups? Rather than throw away the entire observation, we might instead substitute that missing value with the most recent known value for that particular patient. This could be achieved with the approach shown below.

Yet another method for imputing missing values involves the creation of a model, using an observation’s known values as inputs, to predict the likely ‘true’ value for an NaN.  

When it comes to variable imputation, there simply is not a ‘one-size-fits-all’ approach; often, the modeler’s discretion is required.  However, there are two ironclad imputation rules that we can put forth here:

  • Rule #1: Always be transparent.  Remember that imputation involves changing the original dataset in some way.  When you do this, just be sure to make it completely clear to your audience that you have altered the data;
  • Rule #2: Never impute the response variable for your model.  The response variable, a.k.a. target variable, or dependent variable, is the outcome that your model aims to predict.  

Variable Imputation

MethodGenerally Used When…Keep in Mind…
Replacement with a central valueYou want to perform an imputation to replace NaNs with a value, but do not see any other ‘clues’ within the data regarding the means of replacementThis method will reduce the variability/ dispersion for the variable that is imputed
Replacement with 0Among the known values for this variable, 0 is very common (as with ‘precipitation’ in the example above) This is not a wise decision when you are working with a numeric variable that contains many non-zero values; inserting zero values into the column will warp the variable’s summary statistics
Last Observation Carried ForwardYou have a very strong reason to believe that one observation is likely to be similar to nearby observations in the dataset for that variableThis will not work well with ‘mean- reverting’ data (e.g. stock market returns)  
Imputation based on a known categorical valueThe values of a known categorical variable bear a strong relationship with the numeric value that you wish to imputeThis should only be used when there is a clear relationship between a categorical value and the known values for the variable being imputed
Imputation based on a model that predicts the likely value for the NaN, using known values from that observation as inputsBased on the complete cases in the dataset, you can see that the values for some variable can be predicted by a combination of other variablesThis method is only appropriate when you have evidence the dependent and independent variables are related