Report an error

1.24 Impossible Values & Inconsistent Values

There is no automated function that can indicate the presence of an impossible value with 100 percent reliability. To spot an impossible value, an analyst simply needs to understand the dataset and its variables.

Let’s suppose that you are looking at a dataset from a major retailer, in which one column shows consumers’ annual spending totals. At first, it might be tempting to assume that a negative value in the ‘annual spending’ column is impossible. However, it may be the case that negative values can result from credit card chargebacks, or from refunds related to purchases from a previous period.

At times, there will not be any dataset description available. Then, the determination of an impossible value may simply be a judgement call on the part of the analyst.

In the example below, after bringing in the week_vacation dataset and calling the describe() function, we spot something unusual in the ‘householdpax’ column – there is at least some value in the dataset with -1. Since this variable tells us the number of people within a particular household, we can use our judgment here to say that something must be wrong.

How big of a problem is this? Since ‘householdpax’ is a discrete variable, we can use value_counts() here to assess the issue:

This table reveals two types of impossibilities here! In addition to having four households with -1 residents, we also have 69 households with 0 residents.

As for the question of what to do after encountering such a value? Again, this decision needs to involve the modeler’s judgment.

It might be reasonable here to assume that the ‘-1’ values resulted from a data entry error, and the true values for those households should be 1. However, the presence of the ‘0’ values as well suggests that there might be some other problem as the root cause. Since these values comprise such a tiny percentage of the overall data, we can simply remove them.

Inconsistent values appear within datasets when the same value (whether numeric or categorical) is represented in different ways.

Inconsistencies might arise when a dataset is merged together from other existing datasets. Among a team of librarians, perhaps one characterizes a certain type of fee as ‘LATE’, another labels it ‘late’ and yet a third as “not on time.”

A peek at the ‘cruise_returners’ dataset shows us that ‘cruise_theme’ is one of the dataset’s variables. If we run the nunique() function to see the number of distinct categories, we might reasonably come to believe that there are five different cruise themes here in this data.

A closer look at the cruise themes reveals, however, that there are not really five separate themes here, but three. Two of those three appear inconsistently in the dataset. We can fix this with a Python dictionary, which is a set of key-value pairs. The solution here is parallel to the one shown earlier in this chapter for collapsing the levels of a factor.