Select Page

14.3 Collaborative Filtering


Collaborative filtering systems  help to inform many recommendation engines in the real world.  These are based on some relatively simple concepts –  people who are similar to one another are likely to prefer similar content, and when products A and B are frequently co-purchased, someone buying A is likely to be interested in B, and vice versa.  The former type is known as user-based collaborative filtering, whereas the latter type is item-based collaborative filtering.  

We can characterize a collaborative filtering system as being “content agnostic.”  To use such a system, the recommender does not actually have to know about the items being recommended – all that really matters is the past data that suggests that the items are complementary.  

Here, we will explore two ways to perform collaborative filtering with the ride_ratings dataset.  Before we go any further, though, we’ll address the issue of missingness.  As noted previously in the chapter, there are several rides for which we have NaN values.  A table of such values is shown below.

We only have 45 rows here, so simply removing all rows that contain any NaNs would cost us a fairly sizable chunk of our data.  Alternatively, we could impute with a central value for each ride, or for each recommender, but let’s instead use a process that takes advantage of the data that we do have available to us.  For this imputation, we will replace NaN values for any particular rider with that rider’s actual value for the most correlated ride.  For example, from the correlation heatmap below, we can see that the ride most correlated with the Lobster Claw is the Dropkicker.  Therefore, for each rider whose Lobster Claw rating is missing, we will replace that NaN with the rider’s actual rating for the Dropkicker.

Now that our data contains no NaN values, we will calculate the cosine similarities for each of the riders, storing this result as user_matrix. Although not visible from the screenshot below, the user matrix measures 45 rows x 45 columns.

With the user_matrix now in place, we can now determine which riders are most similar to any other particular rider.  Suppose the park launches a new ride next season, and offers the rider in position 15 in this matrix (rider #16 in the original dataset, which was 1-based) a chance to test it out before the official opening date of the season.  

If this rider really loves the new attraction, and gives it a high rating, then it would make sense for Lobster Land to recommend this new ride to users 4, 14, 18, 7, 31, and 11, too.  

Next, we will make a slightly different type of matrix.  This time, instead of finding the most similar riders, we will find the most similar rides.  The ‘T’ notation in the code below is used to transpose the original data; now, the rides, rather than the riders, will form the index.

The item matrix can be seen below.  For each of the 12 rides included here, we can now see which rides are most, and least similar.

Again, we can put these results to practical use.  Let’s suppose that someone visits Lobster Land and has a great time on Matterhorn.  If he tells a park staffer that he loved Matterhorn, and is looking for a follow-up ride recommendation, the staffer can suggest the Ferris Bueller, the Sky Chairs, and the Twisty Slide as the three best options. 

As illustrated here with this example, item-based collaborative filtering is often more practical than user-based collaborative filtering.  The user-based system shown above is only applicable for the users whose ratings data is included here.  The item-based system, on the other hand, has far more applicability.  It could be used to generate a recommendation for any park visitor who has gone on, or expressed interest in, any particular ride.