Select Page

14.4 Content-Based Filtering


Suppose that tomorrow, someone launched a brand-new entertainment streaming service.  At the moment of launch, the service would not have collected any user feedback yet; therefore, it would not be able to produce any collaborative-based recommendations, whether user-based or item-based.  This phenomenon is sometimes referred to as the “cold start” problem.  A way to address this problem is to instead use a content-based system.  In a content-based filtering system the inherent characteristics of the items themselves form the basis for the recommendations.  
In the dataset ride_descriptions (available at lobsterland.net/datasets), we have information about 12 different rides at Lobster Land.  Seven categorical variables are included here, and they tell us whether the ride goes above 20m in height, whether it includes a sudden drop, whether riders go ‘upside down’ at any point during the ride, whether riders get wet during the ride, whether the ride includes a height requirement, whether the ride’s speed exceeds 40 miles per hour, and whether it includes sharp “whip” type of turn. 

We will set the ride_name variable as the dataset index, and then re-examine the dataframe.  

Now that all of our columns are the variables that describe the rides, the data is set up the right way for us to perform distance calculations.  Since we are working with categorical data, a good metric for us to use is Jaccard similarity.  

The Jaccard similarity between any two records, A and B, is found by taking the intersection of records A and B (how many categories do they have in common?) and dividing by the union of A and B (how many unique total categories are represented by either A or B?)  

The Jaccard similarity calculation ignores categories that are not present in either A or B, which can be especially helpful when there are a huge number of variables that could be involved.  For instance, let’s say there are two friends who both collect sneakers, and we wish to compare their collections using a similarity metric.  There may be thousands of possible types of collectible sneakers, so if we simply took the number that the two friends held in common, and divided by the total number of sneakers “out there” we would just get an infinitesimal value, even if the two collections were fairly similar.  By constraining the denominator to just include the number of sneakers that at least one of the two friends owns, we get a much more meaningful result.  

Expressed in terms of set notation, we can write the Jaccard similarity as follows:

For the 12 rides included in this dataset, we can arrive at a matrix of Jaccard similarity values as follows:

Since the pdist() function first gives us jaccard distances (1-similarity), we need to convert those to similarity values before viewing and comparing them.  

In the dataframe below, we can see the Jaccard similarity values for each ride.  

Once we have the Jaccard similarity values, we can sort the dataframe based on any column that we choose to use.  As we can see from the table below, Aqualoop, Dropkicker, and SaltPepper are the most similar rides to the Lobster Claw, based on this metric.  These would all be good recommendations for fans of the Lobster Claw who seek similar types of thrills.