Report an error

3.11 Visualizing Clusters

In some clustering tutorials and examples, you may encounter impressive-looking plots that depict clusters in elliptical shapes, in different colors, and with esoteric labels such as “Dim1” and “Dim2.”

These plots, which are based on Principal Components³, can be important diagnostic tools for modelers. However, they have very little expository use, as their actual meaning can often elude both presenters and audience.⁴

Simple visualizations that depict significant cluster-to-cluster differences are the best way to visually demonstrate a clustering model. Below, we will take a look at several of these. Note that there is no requirement that every visualization should include every variable, or every cluster.

After assigning the cluster assignments back to a column in the original dataset, we will examine the Clusters’ datatype.

Since the clusters are initially represented by numbers, we should instruct pandas that they are in fact categorical. Otherwise, some plot types would depict the individual clusters as shades along a continuous color gradient, rather than as discrete color values. In the following step, we will also replace the current cluster values in the dataset with the cluster nicknames – this will make the plots even more interpretable for an audience.

Perhaps management is interested in comparing the levels of pet ownership from group to group. The barplot below delivers this information.

It probably should not surprise us that “Pets First, Pets Always” leads the way here among all clusters in terms of average pet ownership. But there are other insights that we can take from this, too. We can see, for instance, that the “Busy and Blue Collar” group might be too busy for pets. The middle group of clusters tends to be more similar here, but it’s interesting to see that pet ownership is so much higher among ‘YOLO’ members than among ‘Saving for College’ members, despite the former group’s lower incomes and home square footage.

We can use visualizations to depict interrelationships among variables from cluster to cluster. Below is a scatter plot that shows average entertainment spending vs. average travel spending for all seven clusters.

First, we calculate the mean values for both of these variables for each of our clusters, storing the results in a separate data frame.

Next, we use the plotting instructions below to generate the scatterplot.

The resulting plot helps us to spot clusters that are exceptional in these areas. Bearing in mind that our axes are not 0-based, we see a group of clusters here in the bottom left part of the graph, with relatively similar values in these areas. This plot reinforces for us that the Golden Age Globetrotters stand out for travel spending (but with an average-looking level of entertainment spending), while YOLO stands out in the opposite way, with high entertainment spending but average travel spending. Although their name suggests frugality, the Saving for College group stands out here on its own frontier.

If we wish to depict all clusters and all variables in a single plot, a heatmap is a great option. Note that with the heatmap shown below, we are using the standardized values (this way, the variables can all share a common scaling system).

Perhaps management is interested in just seeing a subset of clusters. What if they ask us to take a deeper dive into the clusters whose entertainment spending falls above the dataset mean? We can start by filtering the dataset to create a new subset only containing members of “Saving for College”, “YOLO”, and “Twentysomethings On the Move.”

Now, we can use the ‘entertainment_fans’ dataframe to analyze this new subgroup. For instance, with the boxplot shown below, in which the groups are ordered by average entertainment spending, we can compare the distribution of household income among these three segments. This shows us that the ‘YOLO’ cluster is really living up to its name! Members of this cluster have the highest average entertainment spending, but the lowest average household incomes.

Earlier in this chapter, we mentioned that although categorical data would not be suitable as a k-means input, it could still be used as part of the cluster analysis process. The plot below demonstrates how this can work. By placing ‘county’ on the x-axis of this plot, with cluster counts separated by color, we can draw some interesting and valuable conclusions. We can see, for instance, that the Golden Age Globetrotters overwhelmingly tend to live in York County, as do the Saving for College members.

Meanwhile, most of our other cluster members are likely to live in Cumberland County. Sagadahoc County is far less common in the dataset, compared with either York or Cumberland. Sagadahoc contains no Golden Age Globetrotters and almost no Saving for College members.

The possibilities for generating different plot types, with different subsets of clusters, are literally infinite. You will probably be grateful to know, then, that we will not try to show them all here! With any clustering model that you build, we encourage you to explore the endless possibilities for visually depicting your results.

Each of the plots depicted in this section was unique in some way. Some showed the original variables from the dataset, whereas others showed the standardized variables. The heatmap depicted all the variables and all the clusters, whereas the others depicted more limited “slices” of the data, such as particular variables or just a subset of clusters.

All that said, each of these plots has one very important thing in common: a depiction of actual variables from the dataset. You may encounter other types of cluster visualization tools that show combinations of the dataset’s features – while these can serve an exploratory, diagnostic purpose for a modeler, they are not effective as expository plots.

³ Principal Components are linear combinations of other variables.

⁴ A good rule for presentations: if you think your audience is not likely to be familiar with some concept you plan to mention, you should probably leave it out. A great rule for presentations: if you, the presenter, are unfamiliar with some concept that you plan to mention, you should definitely leave it out!