Report an error

1.6 Numeric Data: Binning

At times, we may wish to convert numeric data into categorical data. This is typically performed through a process known as discretization, or binning.

The benefit of binning comes from the resulting simplification of our data. If we have n observations in the dataset, it’s quite possible that we will have n unique values for any of our continuous numeric variables. That can be a lot to keep track of! If we group those values into categories, however, the data becomes easier to manage.

A real-life binning example occurs at the end of each semester, when a university professor tallies the final grades for a course. For this group of seven students, the final averages come out as follows:

Name	Final Average
Albert	.85153
Brianna	.87001
Charles	.94411
Diana	.96442
Edward	.83157
Freddie	.89341
Gina	.91312

At the end of the semester, the professor does not send these final averages to the registrar. Instead, he sends one letter grade per student, using the following conversion table:

Final Average Range	Final Grade
.93 +	A
.90 to .93	A-
.87 to .90	B+
.83 to .87	B

After performing that conversion, this is what the professor actually sends for the seven students in this example:

Name	Final Grade
Albert	B
Brianna	B+
Charles	A
Diana	A
Edward	B
Freddie	B+
Gina	A-

Note how the data has been simplified: Whereas each student had a unique score in the professor’s spreadsheet of final numeric averages, there are just four unique values remaining after the binning process.

Take this tiny example of just seven students, and expand it across the full course rosters of each course at the entire university, to see why the registrar would greatly prefer to see the categorical, binned grades. Limiting the number of unique values, and bringing these into a common format, makes the data far easier to manage.

While binning certainly brings simplicity and convenience, it does have an associated cost, though – whenever binning occurs, some level of precision from the original data is lost. In our example above, Freddie’s final average of .89341 was more than two full percentage points higher than Brianna’s average of .87001. However, an observer of the grades, post-binning, would not be able to see any distinction between Freddie’s grade and Brianna’s grade.

This tradeoff between simplicity and precision must be considered whenever the decision is made regarding the total number of bins to create – a larger number of bins means that less data will be “lost,” whereas a smaller number of bins means that the resulting categories will be easier to work with.

Informally, people simplify data in a similar way, without consciously thinking about it.

Suppose that you are a graduate student studying in Boston, and that a good childhood friend of yours, thousands of miles away, is considering U.S. graduate programs. If she asks you, “What is Boston like in the month of January?”, you might reply with just a single word: “Cold.” If you’re feeling especially descriptive, you might even tell your friend, “Cold and snowy.”

Alternatively, you could be more precise than this, with the use of quantitative values in your responses. When your friend asked you about Boston in January, you could have instead replied, “The average daily high temperature in Boston in January is 37 degrees Fahrenheit, and the average low is 24 degrees. The city averages 12.9 inches of snow during that month, as well as 3.4 inches of rain.”

That second answer would be quite a mouthful, though!

It is much simpler for you to say “cold and snowy” – and much simpler for your friend to process, too. Admittedly, that answer is imprecise. It lumps Boston together with every other cold and snowy city in the world. We could use “cold and snowy” just as readily for Buffalo, New York, Aomori, Japan, or Harbin, China. Which of these places is actually coldest? Which is snowiest? When we reduce the description to just a single category or two, we lose those distinctions – but we can still convey a general idea in a concise way.

With pandas, we can bin numerical variables into categories using cut() and qcut(). The cut() function generates equal-width bins, whereas the qcut() function generates equal frequency bins. Equal-width binning generates bins with similar ranges, whereas equal frequency binning generates bins with similar numbers of records in each group.

Let’s take a look at the difference between those two approaches, using the ‘Average’ variable (representing the day’s average temperature) from the lobsterland2021 dataset.

To generate equal-width bins, cut() starts by looking at the range of values in this column. Here is a look at the max, the min, and the range for ‘Average’:

With a range of approximately 32.1, and a user-specified parameter calling for three bins, the resulting bins each have a range of approximately 10.7.

With qcut(), the data are instead split by quantile. Specifying q=3 tells qcut() to generate three such bins. Note that using qcut() will not always generate bins of exactly equal counts. The data may not be evenly divisible by the number bins specified, or there may be some values that occur repeatedly throughout the column, thereby leading to a slight imbalance among the resulting groups.

To make such a change permanent within the dataframe, you can assign the results back to the name of the variable that you wish to modify:

So which is the right way to bin your variables – equal width or equal frequency? As with so much else in the world of data science, it depends. The answer to this question will often come back to the business or research purposes of the modeler. In fact, there really isn’t any requirement to use either of these methods – someone could instead set temperature cutoffs at certain points, based on his own definitions of “Cool”, “Moderate”, and “Hot”, and bin the temperatures based on those thresholds.

One note of caution with any form of binning is that the modeler should examine the results afterwards, using value_counts(), to ensure that the outcome aligns with his expectations. If a dataset contains some extreme outliers, equal-width binning could lead to some very imbalanced groups (to see why, think about how outliers affect the range).

When a binned variable is the response variable in a classification model, it is especially important to understand the balance of records from group to group. If a numeric variable is binned into five groups, but one group contains 95 percent of the records, then a classification model with 90 percent accuracy may not be very impressive. If those five groups were binned with equal frequencies, however, 90 percent accuracy would be phenomenal!