Report an error

1.22 Removing Missing Values: The Sledgehammer and the Scalpel

When dealing with missing values, one option is to simply remove all the rows that contain any missing values – this can be achieved with pandas’ dropna() function, called on the entire dataframe. The upside to such an approach is that it will leave you with only the “complete cases” – the rows for which all the variable values are populated by known quantities. The downside to this approach, however, is that at times it will be the equivalent of taking a sledgehammer to your dataset, and smashing it violently – if you have many NaN values spread across your various rows and columns, this could leave you with a tiny fraction of your original number of observations.

With lobsterland’s 2020 data, the original dataset has 106 rows and 16 columns, as indicated below from the dataframe’s shape attribute.

After assessing the column-by-column missingness of the data, we can see that there are six missing values for GoldZoneRev (total daily revenue from Gold Zone receipts), and four missing values for PRECIP (total rainfall).

By calling the dropna() function, as shown below, we are instructing pandas to drop all rows that contain any NaN values. Doing so here removes 10 rows, or approximately 9.4 percent of our data.

An alternative, more surgical approach is to remove only the rows for which there are NaN values in particular columns of interest. This is more akin to using a scalpel, or another fine instrument, to carefully remove NaNs from certain places (perhaps the outcome variable for a model that you wish to create) while leaving the rest of the original dataset intact.

Using such an approach here, as shown below, leaves us with 102 rows, as we are only removing the four that contain ‘NaN’ values for ‘PRECIP’.