1.9 Categorical Data: Collapsing the Levels

At times, a data modeler may wish to reduce the number of distinct levels in a factor variable.

Let’s take a closer look at the day_type variable from the lobsterland_2021 dataset, with some help from the value_counts() function in pandas:

Do we really need that many levels? This is showing us eight different categorical descriptors of the day type. If we wished to visualize the relationship between day_type and some quantitative variable, like unique visitors, it might be easier to do with fewer total day_type categories.

Perhaps we can lump “Overcast”, “Cloudy”, and “Partly Cloudy” together, and simply label that category “Cloudy.” Similarly, we could turn “Rainy” and “Very Rainy” into “Rainy”, and combine “Partly Sunny”, “Very Sunny”, and “Sunny” into “Sunny.” With just three unique levels of day_type, this variable would become easier to visualize or summarize.

To collapse a factor in Python, we will again use a Python dictionary. After doing so, and then calling the replace() function on ‘mapping’, we are left with just three unique levels for day_type.⁶

Factor collapsing comes in handy whenever we wish to lump qualitative data together in order to reduce the number of distinct categories. For instance, we could take the 48 continental states within the United States and reduce them to six categories: New England, Mid-Atlantic, Southeast, Midwest, Southwest, and Pacific Northwest. This is akin to binning, in that it involves a tradeoff between specificity and ease of use. Turning 48 states into six groups makes something like a bar chart or pie chart feasible, but at a cost – the distinctions between Connecticut and Vermont, or between Idaho and Oregon, would be lost among the regional groupings.

⁶ If day_type had been stored as a categorical variable beforehand, this operation would revert it back to a ‘string’ object.