Report an error

2.9 Histograms

Checking the distribution of your dataset is one of the first things you should do as part of the exploratory data analysis because it helps us determine how the data should be treated. A histogram is a quick and effective way for us to do this for a single continuous numeric variable. A histogram typically depicts the variable of interest on the x-axis, grouped in bins of equal width. On the y-axis, a histogram usually shows a count variable – it shows us the number of records that landed in some particular bin. ⁵

By changing the number of bins, we can alter the appearance of the histogram; keep in mind, however, that we are not changing the data itself when we do this.

There is no “correct” number of bins to use with a histogram, just as there is no “correct” amount of zoom to use when taking a photograph. A photographer who zooms in on a subject gets a more close-up view, whereas a photographer who does not zoom in gets a more general, big-picture view.

Histograms can be helpful for identifying a variable’s skewness. In the histograms below, we will see clear evidence of positive skew, or right skew, for the merchandise revenue variable.

The histogram shown above is built with a default number of bins – note that only the variable name and the dataset source were passed to the sns.histplot() function.

We can adjust that default bin number by explicitly passing a ‘bins’ parameter to the histplot() function, as shown below.

The histogram shown above depicts the exact same data distribution as does the previous one; however, this one provides a more detailed perspective. Each bin’s width is precisely 1/15th of the variable’s range.

If we want to see a more general picture of the distribution, we can reduce the number of bins, as shown below:

With the hue parameter, we can add even more information to the plot. Let’s see if we can spot any relationship between daily merchandise revenue sales and special event types. We will include multiple= ‘stack’ so that the groups do not overlap in the graph.

That’s a tough graph to interpret! The event codes are just listed as numbers here (remember that the dataset description for lobsterland_2021 can be found in Chapter 1), but the bigger problem with this graph is that these Spec_Event levels are not currently coded as distinct category levels. Let’s check the data type for this variable to verify this:

Because the event categories are “seen” as integer values, they are being treated that way in the plot. Rather than appear as distinct colors, the categories are plotted along a continuum of shades of purple.

We can address each of those issues with the code shown below. More details about this code can be found in Chapter 1, but we can summarize here by saying that the code converts this variable to become a ‘categorical’ type, and also renames the category levels in a more descriptive way.

The resulting plot is coded the same way as the previous one, but the result is quite different.

It is now far easier for us to extract meaning from this plot. We can see here that on days with country music or rock music performances, merchandise revenue tends to be on the lower side. Poetry readings, however, seem to coincide with higher merchandise sales totals.

A note of caution: do not confuse a histogram with a bar plot. A histogram shows the distribution of a continuous numeric variable; a bar plot shows the value of categorical variables.

⁵ Alternatively, a histogram can depict relative frequencies on the y-axis. In seaborn, a user can adjust a histogram from counts to relative frequencies by including normed=True in the histplot() function.