Report an error

3.8 Implementing k-means in Python with scikit-learn.

For this clustering implementation, we will use the dataset portland-families.csv, which can be found at lobsterland.net/datasets. This simulated dataset contains information about thousands of families living in the Greater Portland area.²

Lobster Land wishes to build a clustering model based on this data.

A dataset description is included in this table:

householdID	an identifier variable for the households in the dataset
total_ppl	the number of individuals living in the home
own_rent	a categorical variable that indicates whether the home is owned or rented by its occupants
square_foot	indicates the number of square feet in the residence
household_income	an estimate of total annual income for the household
number_pets	the total number of pets owned by members of the household
county	A categorical variable. The counties in this dataset are: Cumberland, Sagadahoc, and York.
entertainment_spend_est	an estimate of the household’s annual spending on entertainment.
travel_spend_est	an estimate of the household’s annual spending on travel.
under_12	the number of members of the household who are less than 12 years old.
LL_passholder	a categorical variable indicating whether anyone in the household is a current Lobster Land season passholder.

We will start with some standard import statements, as shown below.

Next, we will bring the dataset into our environment, and take a peek at its first five rows:

Based on the dataset description, and on the rows shown here, we can see that we have the following numeric columns: ‘total_ppl’, ‘square_foot’, ‘household_income’, ‘number_pets’, ‘entertainment_spend_est’, ‘travel_spend_est’, and ‘under_12’. The rest of the columns are categorical, including ‘householdID‘, whose values are represented by numbers, but do not represent truly numeric quantities.

Before going any further, we stop to analyze the results of the describe() function. Viewing these summary statistics helps us establish a baseline regarding these variables. It can also be a chance for the modeler to check for anything unusual, such as outliers or impossible values.

Checking for missing values is also an important step to take at this point. If there are NaNs in any of the variables that we wish to use, we will encounter a syntax error during model execution.

Thankfully, we are NaN-free!

To prepare for the modeling step, we will break the numeric variables out as a separate data frame (note: this is not required – alternatively, we could just identify the columns that we wish to include during the modeling step).

Next, we will standardize these values. As noted above, standardization is not always a required processing step for building a clustering model. Here, however, we are working with some very differentiated variables, whose scales, ranges, and units are totally unique.

Now that the data have been standardized, note what happens to the mean and standard deviation of each column. All the means have become 0, and all the standard deviations have become 1. Each value in ‘port_standard’ is a z-score, rather than an original value from the dataset. Each z-score in the data frame represents that variable’s distance from the mean, measured in standard deviations.

Looking at the first five rows of the standardized data, we can see a blend of positive and negative numbers. This is expected, as any values below their column mean become negative as a result of the transformation to z-scores.

Next, we will explore the elbow plot, a common tool used to help modelers answer the “How many clusters?” question. On an elbow plot, the values along the x-axis represent distinct k values, or numbers of clusters to use in a k-means model. On the y-axis, we have ‘inertia’, also known as the sum of squared errors. This is the sum of squared differences of all records from their clusters’ centroid values, for all variables in the model.

At k=1, the inertia value will always be highest. This is because k=1 really represents a “non-model” in which all the records are simply lumped together into a single segment. Unsurprisingly, inertia falls dramatically between k=1 and k=2, since k=2 represents the first meaningful separation of the records into different groups. As k-values increase, inertia continues to decline, but in a less-dramatic way. Stated informally, the “bang for the buck” that comes with each additional cluster tends to decrease as the k-values go higher. Note that in the plot below, the k values only range from 1 to 9 – but the modeler can adjust this by changing the second parameter in the range() function.

Note that if our only goal were to simply reduce SSE, we would build a k-means model with n clusters for the n observations in our dataset. If that idea sounds a bit silly to you, then you are in good company – we think so too! Putting every record into its own segment would defeat the purpose of clustering, which naturally involves tradeoffs.

As you view the elbow plot, always bear in mind that it does not provide an “answer” for your model – most often, it only provides a starting point for the model to consider. If a sharp bend, or “elbow” appears in such a plot at k=5, a modeler will often experiment with several nearby k-values before making a decision for the model.

The data used to build the elbow plot should be the same as the data used to build the model – note that we are passing ‘port_standard‘ into the kmeans.fit() function below.

Do not expect elbow plots to always create a sharp, easy-t0-spot “elbow.” Especially when you are working with well-balanced, standardized data, you are far likelier to see a less-dramatic dropoff in SSE as you move from k-value to k-value along the x-axis.

² Note that the dataset makes no claim regarding representativeness of Greater Portland households, in general. You might notice that the households included here have a greater number of pets, on average, than would be expected from a general population sample, or that they have higher incomes.