Report an error

4.14 Building the Ferris Bueller

Lobster Land has an existing ferris wheel within its park, but company management is considering adding a second ferris wheel. This ride would be called the “Ferris Bueller.” To conduct market research, Lobster Land surveyed thousands of adults living near Portland, Maine. Each survey respondent was shown between 6 and 12 separate “bundles” of ferris wheel options, and asked to rate the bundles from 1 to 10. Collectively, those ratings were averaged together by Lobster Land into the “rating” variable in the ferris_bueller.csv dataset.

paxpercar	Here, “pax” is an abbreviation for passenger. The options include 2, 3, or 6 passengers per car.
height	100, 200, or 300 feet above the ground, as measured from top to bottom.
opentop	Whether each ferris wheel “car” should be completely covered, or wide open at the top.
totaltime	Length of the ride, measured in seconds, from boarding to debarkation. Options are 80, 240, or 420 seconds.
sway	Should the cars be able to “sway” if rocked by the users?
color	Options are: Green, White, Red, and Purple.
rating	The average rating given to the bundle by all surveyed consumers.

In the dataset, each observation represents one unique bundle.

The dataset has exactly 432 rows. This makes sense because we have 3 options for paxpercar, 3 options for height, two options for opentop, three options for totaltime, 2 options for sway, and four options for color, and 3 x 3 x 2 x 3 x 2 x 4 = 432.

In the next step, we will use the get_dummies() function from pandas to convert categorical variables into the 0-or-1 format that can be understood by a linear regression model.

Notice that we are dummifying all of the input variables, including the ones based on numerical values. There are two important reasons for this – first, because this is survey data, we don’t want to imply to the model that the numeric input values exist along a continuous range – instead, each was a discrete option presented to the respondents. Secondly, dummifying the inputs into separate choices would help us in the event of a non-linearity in the numeric values (in other words, if a middle option were the true preference of the respondents, we could identify this with dummified inputs, which would not be possible if we simply treated the values as a continuous range). Moreover, the objective of conjoint analysis is to determine which product feature consumers prefer e.g. do consumers prefer rides that are 200 feet above the ground or 80? If we do not dummify these, we will not know which level consumers prefer since numeric variables are assigned just one coefficient in a linear regression model.

We do not dummify rating, since it is our outcome variable.

Notice also that variable names must be specified in the ‘columns’ parameter for the get_dummies() function to work.

After running the code, it is apparent our dummified columns are missing certain inputs. For instance, we have a column ‘sway_Yes’, but ‘sway_No’ is missing. Similarly, ‘paxpercar_3’ and ‘paxpercar_6’ are present, but ‘paxpercar_2’ is missing. That is because the drop_first() argument caused Python to drop one of the categories to prevent multicollinearity. These categories are dropped alphabetically by default.

The following steps fit the input variables, shown below in the ‘X’ dataframe, to the outcome variable, shown below as ‘y’:

Alternatively, it is possible for you to select your own reference level if you feel it makes the interpretation easier. Excluding ‘drop_first=True’ will dummify everything. Note that ‘paxpercar_2’ which had previously been excluded is now present, along with ‘height_100’, ‘opentop_Y’ …etc.

As we’ll see in the next section, it does not matter which category ends up being the reference level – the results will be the same.