Report an error

2.5 Scatter Plots

The fundamental purpose of a scatter plot is to depict the relationship between two numeric variables. One of these variables is placed horizontally, along the x-axis, while the other is placed vertically, along the y-axis. When there is a cause-and-effect relationship between them, the analyst places the input variable (or independent variable) on the x-axis, with the outcome variable (or dependent variable) on the y-axis.

A scatter plot can also depict information about other variables, besides the ones that form the x- and y-axes. For instance, plot points can be assigned size, shape, color, or transparency attributes that adjust based on the characteristics of other variables.

Using a pandas dataframe that is in our Python environment, we can generate a scatter plot in seaborn with the scatterplot() function, passing in our desired variables, their axis locations, and the data source, as shown below.

To avoid seeing the metadata immediately above our plot, enclosed in angle brackets (< >), we can run the code with a semicolon at the end of the line, as shown below:

This plot reveals a relationship that fits with what we would probably expect to see – as the number of unique visitors increases, the total merchandise revenue at the park also increases, in an almost perfectly linear fashion.

Each point on the plot represents one observation – it tells us the number of unique visitors, and the total merchandise revenue generated at the park, for one day’s operation in the Summer of 2021. We cannot discern 99 distinct points here because of overplotting – some of the points are plotted directly atop one another.

Overplotting is not necessarily a problem – the plot still paints a very clear picture of the linear relationship between unique visitors and merchandise revenue. However, sometimes adjusting the alpha setting of a scatter plot helps to make some points more visible. Alpha values range from 0 to 1, with 0 being completely transparent, and 1 being completely opaque. When no alpha value is specified, the default setting is 1.

In each of the examples shown above, seaborn conveniently labeled our axes, using the same names as the variables from the dataset. Suppose we wanted to use more formal descriptions for these terms, rather than the abbreviated names that appeared in the dataset. Using the plt.xlabel(), plt.ylabel(), and plt.title() functions shown below, we can customize the title and axis labels to make this even easier for a reader to interpret. Note the use of the \n symbol in the plt.title() function. This is known as an “escape sequence”, and is used to create the line break that appears in the graph’s title. More types of plot customizations will be shown later in this chapter.

Although the axes of a scatter plot should always consist of numeric variables, we can also incorporate categorical data into such a plot type. In the graph below, we accomplish that by adding the ‘hue’ parameter. A legend is added automatically. We can see here that unique visitor numbers are strongly correlated with Gold Zone revenue, and that the values of each of these two numeric variables tend to be higher on days when Lobster Land hosts fireworks shows.

Of course, this plot does not tell us anything about causality. Perhaps more people are drawn to the park on days with fireworks, or perhaps the park simply holds more fireworks shows on busier days, such as weekends and holidays.

If we want to adjust the assignment of colors to values, as well as the placement of the levels on the legend, we can make this tweak by using the ‘hue_order‘ parameter, as shown below.

As for changing the colors themselves, we will explore that later in this chapter.

Note that we can also bring information about numeric variables onto a scatter plot, beyond the ones depicted on the two axes. In the plot below, we pass the ‘Precip’ variable to the scatterplot() function, via the ‘size’ parameter. This enables us to see larger points for rainier days.

It might surprise you to see that on several of the days with the highest rainfall totals, there were large numbers of unique visitors. Strange as it may seem, there are several possible explanations. For one, the park opens for 12 hours each day (from 9 a.m. to 9 p.m.) but the weather data is collected across a 24-hour period. What if torrential rains fell on some morning between 2 a.m. and 4:30 a.m.? It could also be the case that heavy rains fell mid-day, causing people to leave other places (like the beach!) and head to the Gold Zone.

Seaborn’s relplot() function gives us the opportunity to easily generate side-by-side plots in the manner shown below. By specifying col= ‘Fireworks’ we are instructing seaborn to separate the graph into these two plots, which show the Gold Zone Revenue – Unique Visitor relationship on days without fireworks, and then days with fireworks.

Alternatively, if we wish to present the ‘Fireworks’ information down the rows, rather than across the columns, we can instead use the ‘row’ parameter, as shown below.

We can also pass separate categorical variables to the ‘col’ and ‘row’ parameters to see separate plots for each combination. Since there are six different special event designations, and two fireworks designations, this makes 12 separate plots, in order to depict every possible combination.