Report an error

3.6 What About Categorical Variables?

Categorical variables should not be used as inputs in a k-means clustering model (they can, however, be used as inputs in some other modeling types – we will say more about this in the section on hierarchical clustering).

Do not simply re-code your categorical variables with numbers, as this could lead to a nonsensical result.

Imagine that you have a dataset about streaming entertainment consumers. This dataset might contain several numeric variables that could be used for k-means modeling, such as age, income, household size, minutes of content streamed each month, consecutive months with the service, etc. Suppose that this dataset also contains a variable for ‘favorite genre’ that contains the following six labels: Drama, Action, Adventure, Horror, Comedy, and Documentary.

There is not an effective way to use ‘genre’ in a k-means model. Assigning values 1 through 6 to the genres listed above, or dummifying them into 0s and 1s, will not lead to a Python syntax error – but it will lead to an unreliable result, since these values would not stand for actual numeric quantities.

Categorical variables can still play an important role here, though! After we have built the clustering model, we can use categories as a grouping variable to analyze cluster-to-cluster differences, and for generating visualizations that depict some of the different cluster placements among records. We will demonstrate this in the example used here in this chapter.

In some (rare) cases, categorical data can be replaced with numeric values. For instance, if a local company is building a segmentation model on its consumers, and it includes a ‘street address’ variable, this could be converted into something like “miles from store location.”