Select Page

1.21 Missing Values: Overview


When performing EDA on a dataset, another key step is assessing the missingness of the dataset’s values.  In other words, how many actual values are contained within the dataset, compared with the number of missing ones?  As for the missing ones, are they spread across the dataset evenly?  Or are they concentrated among some small subset of the dataset’s variables?  

Values can be missing from datasets for myriad reasons.  These reasons could include, but are not limited to:

  • Instrument malfunction
  • Human error
  • Variables whose values are missing because of a pending action (like a library tracking book circulation, with a ‘Date Returned’ column, whose value is missing while the book is checked out)

Survey data might include missing values for particular questions that respondents did not wish to answer, or were not able to answer (perhaps due to confusing wording), or simply because the questionnaire was too long.

In Python, we will see missing values represented by NaN, which indicates “Not a Number.”