Report an error

1.4 Data Types: Numeric and Categorical

An initial step in exploring a dataset is to assess the types of its variables. While more specific sub-categorizations can be made, we can start by looking at which variables are numeric, and which are categorical.

Answering the question “Is this variable numeric or categorical?” is not always quite as easy as it might first seem to be.

To determine whether a variable is numeric, we will use a two-part test:

(1) Are the variable’s values represented by numbers, or their equivalents?³; and

(2) Do those numbers have meaningful mathematical properties?

In order for us to properly identify a variable as “numeric” it must satisfy each of the conditions mentioned above.

For data that is represented with text-based descriptions, like the ‘day_type’ column in lobsterland_2021, it is easy to see right away that the variable is non-numeric.

However, what about ‘Spec_Event’? Is it numeric or categorical? Let’s apply the two-step test mentioned above.

If we call the info() function on the data frame, we can see each variable’s data type:

The screenshot above indicates that Python sees ‘Spec_Event’ as an integer.⁴

From the dataset description, though, we can see that the numbers associated with that variable do not represent quantities; instead, they are “stand-in” values that indicate whether things like comedy shows, country music shows, poetry readings, and types of events occurred on some particular date.

‘Spec_Event’ is not, therefore, a numeric value. Yes, it is represented by the values 1, 2, 3, 4, 5, and 6, so it passes part 1 of the two-part test shown above. However, these values do not have real numeric properties. We cannot add a live comedy show (2) with a rock music show (4) to equal a children’s-themed show (6). Similarly, we cannot say that the square root of a rock music show is a live comedy show. Since it fails to meet the criterion for part 2 of the test, it must be categorical.

The use of numbers to represent categories is a common occurrence across datasets. This can happen with the days of the week (represented as 1 through 7), the months of the year (represented as 1 through 12), transaction ID values, ZIP codes, and the innumerable range of situations in which numbers are simply chosen to represent various groups.

You may wonder, “Who cares? Why does it matter how we view the ‘Spec_Event’ variable? Whether we label it ‘numeric’ or ‘categorical’ the data itself is still the same, right?”

If we do not pay attention to the variables’ type, we can get very misleading results from the models that we build with our data.

For instance, if we used ‘Spec_Event’ in a linear regression model without first converting it to a categorical variable, we would be implying that the event types exist along a continuum, with a children’s-themed show being “worth” three times more than a live comedy show. For something distance-based, like clustering or k-nearest neighbors, using ‘Spec_Event’ as a numeric input would create a similar problem – it would imply that events whose numeric values are further apart were more inherently different from one another, compared to events whose numeric values are closer together.

When making data visualizations, it is essential that we understand our variable types. Plotting packages such as matplotlib and seaborn (which we will examine in Chapter 2) will render numeric variables using a gradient coloring scale, implying a continuous spectrum across the values, whereas they will depict categorical variables in a way that presents each category as completely distinct.

³ By ‘equivalents’ we are including text representations of numbers. If a column of data contained values like ‘three’, ‘twelve’, and ‘eighty-two’ to indicate the number of season pass sign-ups that day at Lobster Land, this would be a numeric variable, as those values stand for actual quantities.

⁴ int64 technically means “64-bit signed integer.” This just means that this variable could hold a huge range of possible integer values. A float64 can hold many values, including decimal values that fall between integers.