Report an error

2.16 Additional Seaborn Plot Types

A kernel density estimate (KDE) plot can be regarded as a curved histogram, but instead of discretizing the observations into separate bins, it displays a univariate distribution as a continuous curve. This is essentially a probability density function, with the peaks in the graph showing us where the data is most concentrated. Just like how we can zoom into a histogram by adjusting the bin width, we can zoom into a kdeplot by adjusting its bandwidth.

By specifying cumulative=True in the kdeplot() function, we can see a cumulative density function (CDF). A CDF’s y-axis value starts at 0.0, and moves to 1.0 (or 100%), as more data points are accounted for along the x-axis. Using the CDF function allows us to calculate the probability of something happening. In this case, we can determine the chances of merchandise revenue falling within a certain range or at a data point.

By eyeballing the chart above, we can estimate that the probability of Lobster Land’s merchandise revenue falling between $10,000 and $35,000 is 0.6 or 60%.

Seaborn’s lmplot() function generates a scatterplot with a best-fit line, generated via ordinary-least squares linear regression. This function will not give us specific details about the model or its coefficients, though.

A jointplot enables us to see a scatterplot depicting an x-y relationship, along with separate histograms for the x and the y variables, as shown below:

A hexagonal bin plot is in some ways like a scatterplot, in terms of its depiction of numeric variables on the x- and y-axes. However, instead of depicting each point individually, it instead groups them into hexagonal shapes, as shown below. The hexagonal bins are shaded based on the frequency of observations within each group. The plot below shows us, for instance, that points are most concentrated where the unique visitor count is just below 3000, and staff hours are between 600 and 800.

Hexagonal bin plots are especially effective when you wish to plot a huge dataset, and a scatter plot is ineffective due to overplotting.

The section on basic visualization types noted a potential drawback associated with boxplots and with bar plots that depict average values – they do not communicate the number of records that belong to each group.

A swarm plot addresses this issue by ensuring that each point is plotted separately, with no overlap among points. Swarm plots work well for many datasets, such as the one shown here – however, they may not work well with enormous datasets, due to the way they try to create unique points for each observation.

Next, we will look at a violin plot – but before we do, let’s have a look at another box plot. In the plot below, we can gain a sense of the distribution of daily gross revenue for each of the seven days of the week. However, because each of the boxes has a uniform width, we do not gain any sense of where the distribution is most dense.

Enter the violin plot. As shown in the example below, a violin plot also depicts the distribution of a numeric variable, separated by category. Unlike a box plot, however, a violin plot becomes wider at the places where the distribution is densest. The violin plot contains a “mini” box plot for each group, with a white dot at the median. From the plot below, we can see that Tuesday and Wednesday are the most predictable days, in terms of daily revenue. Each of those days shows a wide ‘bulge’ near the center of the distribution, whereas other days appear more long and narrow.

Heatmaps offer a color coding system for depicting variables’ values. Part of heatmaps’ appeal comes from the way that they can quickly draw a reader’s attention to some particular value.

A seaborn pairplot can be a helpful tool for a modeler seeking to view several relationships among variables at once. This function generates a histogram for each of the variables passed to it, as well as separate scatter plots for each variable pairing, as shown below.

In the first pairplot, some of the scatter plots appeared twice, albeit with the axis relationships reversed. For instance, the plot in the upper right corner depicts ‘LobsteramaRev’ on the x-axis, with ‘UniqueVisitor’ on the y-axis. In the lower left corner, we can see ‘UniqueVisitor’ on the x-axis, with ‘LobsteramaRev’ on the y-axis.

Pie charts can be effective for telling a very general story. They can work well when there are a small number of categories depicted, and when the differences among them are large.

The trouble with pie charts is that when there are too many pie “slices”, or when the slices are similar in size, it becomes very hard for an audience to discern the relative sizes of the items being compared. In the image below, we can see a pie chart that depicts the relative occurrences of the special events at Lobster Land.

In the plot shown above, which slice is bigger – comedy show or country music? And how does youth show stack up against either of these? You could ask these questions to many people and get a very inconsistent set of responses. Although comedy show and country music have equal-sized slices, it’s tempting to think that comedy show is larger, because of its lighter color.

We can address this shortcoming by annotating our plot with the per-slice percentages, as shown below.

We can also avoid any issues related to pie slice interpretability by just using an alternative plot. Let’s see how the same data would look, if presented as a bar plot built with seaborn’s countplot() function:

With this depiction of the data, there would not be any interpretability challenges for the audience. The relative heights of bars are just far easier to discern, compared with the areas associated with each slice of a pie.

In this chapter, we have only scratched the surface regarding plotting options available in matplotlib and seaborn. For examples of additional seaborn plot types, you can view the Gallery on the seaborn home page. For the complete matplotlib documentation, we encourage you to check out that library’s home page.