2.10 Box Plots
Box plots summarize the spread of numerical data, and can be used to make comparisons between groups e.g. are fireworks associated with days where there is higher total revenue at Lobster Land? They are also effective at highlighting outliers.
Typically, a box plot depicts a categorical variable on one axis, and a numeric variable on the other. The markings on the box plot are essentially a five number summary: the box’s lower boundary is the 25th percentile, and the upper boundary is the 75th percentile. The box, therefore, depicts the Interquartile Range (IQR). The median is depicted with a thick black line in the middle.

By default, the seaborn boxplot depicts outliers as any value that is either greater than the 75th percentile plus 1.5 x the IQR, or less than the 25th percentile minus 1.5 x the IQR. Outliers are depicted as separate points, beyond the horizontal lines that are perpendicular to the “whiskers” that extend from the boxes. In the box plot shown below, there are two low outliers for daily gross revenue that occurred on days with fireworks, and one high outlier for daily gross revenue on a day without fireworks.

An important caveat to keep in mind with box plots is that they offer no information to the reader about the number of records in each category. This distinction can be easy to miss. Naturally, we are predisposed to associate a larger container with more items; however, the size of the box in a box plot is really just showing us how different the 75th percentile and 25th percentile values are.
The boxplot shown above indicates that days with fireworks are associated with greater total revenue.
Simple Visualizations
| Plot Type | Most Often Used When… | Advantages | Keep in Mind… |
| scatterplot | You want to depict the relationship between two numeric variables | These enable a viewer to spot a linearity, a non-linear relationship, or a lack of relationship between variables, quickly. | Overplotting can occur, especially with large datasets that have heavily-concentrated values. |
| line plot | You want to depict the measurement of a single numeric variable across time | Line plots are especially good for enabling a viewer to see changes across time | A narrower range of values along the y-axis can make changes appear more dramatic, whereas a wider range of values along the y-axis has the opposite effect |
| bar plot | You want to either compare frequencies of the levels of a categorical variable, by counts, or compare levels of a categorical variable by some central tendency measure such as a mean or median | These tend to be easy for viewers to interpret. People tend to be very good at discerning the height differences among bars. | Bar plots showing counts are among the most straightforward types of plots. Bar plots that show a central measure, such as a mean or median, can be misleading if the categories are very imbalanced by counts. |
| histogram | You want to depict the distribution of a single numeric variable | Can help to identify right or left skewness for a univariate distribution | Determining the optimal number of bins can be a matter of trial-and-error. It can depend on your goals, as well as your audience |
| boxplot | You want to compare the distribution of a numeric variable, separated out by category levels | Can be used to quickly identify outliers | Box sizes bear no relationship with the number of records in each group. |