Report an error

8.5 Bringing it all together: implementing logistic regression in Python

The ‘nyc_historical’ dataset includes information about Greater New York City NYC households that held Lobster Land family season passes in a particular season. This dataset’s response variable, renew, indicates whether the family renewed for the following year. Our goal here is to build a model that can predict whether a household will renew its pass.

Step #1: Importing the Data, Checking for Missingness

After importing the dataset into our environment, we will perform a quick check to see if we have missing values here.

With no values missing from this dataset, we will proceed to the next step.

Step #2: Checking for class imbalance

We must ensure that we do not have an overwhelming number of customers who either renew or do not renew so that our model has a good chance of learning from both customer types.

Step #3: Partitioning the Data

Next, we will perform a data partition. We will randomly select 60 percent of the records to use as a training set, for building a model, and leave the other 40 percent as a test set for assessing the model. We will make this partition before we “peek” at any of the data.

The ‘random_state’ parameter must contain an integer so that the training set and test set will return the same results each time the code is run. If ‘random_state = none’, or is not included at all, other people will be unable to replicate your results.

We do not pass ‘household_ID’ to this model, as it’s simply a unique identifier for each observation in the dataset. It has no mathematical meaning or other value that would help us to predict renewal (it is very much akin to a university ID, or a phone number, in that it is numeric-looking, unique, and arbitrarily assigned).

The data partition leaves us with two dataframes (X_train and X_test), as well as two series (y_train and y_test) as shown below.

Step #4: Exploring the Training Set

Here is where we will start to build our familiarity with the training set. We will use the describe() function here to look for outliers, and to see the distribution of the values. We can tell that:

For our numeric variables, the mean and median values are close

Some of the high maximum values may influence the outcome of the model as logistic regression models are sensitive to outliers. You could build a box plot to identify the rows containing these values. Afterwards, you could compare the outcome of two models: one retaining the outliers, another excluding them.

Step #5: Examining the correlations among potential inputs

We check our numerical independent variables for multicollinearity. If we do not remove features that are closely related to one another, we could obtain coefficients that do not make sense.

Step #6: Dummifying to go from text to numbers

We dummify ‘homestate’, the only categorical variable in our dataset not already expressed in 0 or 1 fashion. Some of our categorical variables are already coded with 1s and 0s for binary classification (‘own_car’, ‘FB_Like’, ‘goldzone_playersclub’), and therefore do not need any further processing. We specify drop_first=True inside the function so that we do not create a linear dependency among our input columns. By default, ‘homestate_CT‘ becomes the reference level for ‘homestate’ in this model.

Next, we perform the same operation on our test set. In order to compare model performance later on, we need these datasets to be formatted similarly; if we change the original variables in one, we must also do the same thing to the other.

Step #7: First iteration of the model

The model building continues with our training data. First we import several modules, as shown above. Note that some modelers prefer to place all of their import statements together, at the top of their script – that type of organization can make it easier for someone to check whether a particular import statement has been made.

As modelers, we can decide which features to explore for determining a customer’s likelihood of renewal. Here, we will start by passing in all of our potential independent variables.

Then we add an intercept to the model using ‘sm.add_constant’ before fitting the model. When building logistic regression models, always include the constant. For a model with a constant term, the probability of “1” class membership can still be determined when all of the input variables are set to 0. If the model is fit without a constant term, the model will always predict ‘0’ when the input values are 0, which may not accurately reflect the given situation.

Step #8: Interpreting the model summary

The model summary tells us that:

The model’s classification power is statistically significant, as the LLR p-value is much lower than 0.05. The LLR p-value is the result of a likelihood-ratio test comparing a null model (a theoretical model using no predictors at all) to the results obtained from this model.

Some continuous numeric features have p-values much higher than 0.05, indicating they do not play strong roles in predicting a family’s season pass renewal.

For categorical variables, p-value interpretation is a bit trickier – for each level of a factor, the p-value is telling us about the significance of that level’s impact, relative to the reference level.

When deciding whether to keep or remove a categorical variable, some judgment on the part of the modeler may be involved. Keep in mind that the variable should either be retained, or dropped, on an “all-or-nothing” basis – in other words, do not retain particular levels while removing others. In addition, uncertainty about whether to include some variable can sometimes be resolved by comparing model iterations – are the overall model metrics, such as the LLR p-value, better or worse when that variable is included?

It can also sometimes be helpful to separately build a model using only the variable in question as your input, and your original response variable as the outcome. If that model shows an insignificant result, then the variable can be dropped.

Step #9: Running another model iteration and assessing its results

For the next step, we refit the model with a different combination of inputs. We remove ‘avgmerch_perperson’ and ‘avgfood_perperson’, as they are numeric and have high p-values. After some iteration involving sets of inputs, we have decided to drop ‘FB_Like’, as well as ‘homestate’.

After switching to this second model version, the LLR p-value has declined from 6.021e-52 to 1.201e-54. Limiting our inputs to the features that significantly impact whether a household will renew its season pass has led to a better, more compact model.

Step #10: Interpreting the coefficients

In linear regression, we learned that an average change in Y is associated with a one unit increase in X.

With logistic regression, a variable’s coefficient shows us how a one-unit change in that input impacts the log-odds of the record belonging to the “1” outcome class.

The summary above tells us that:

Having a car increases the log odds of a season pass renewal by 0.9649 compared to not having a car, holding all other variables constant.

Being a goldzone member increases the log odds of a season pass renewal by 0.5047 compared to a non-goldzone member, holding all other variables constant.

For each additional visit that a passholding family makes to Lobster Land during a particular season, the log-odds of renewal increase by 0.1286, holding other variables constant.

Incrementing a household’s average rides per person metric by one ride increases their log odds of renewal by .0578, holding other variables constant.

Incrementing a household’s average per-visit Gold Zone spending by one dollar increases their log odds of renewal by .0032, holding other variables constant.

At this point it is worth doing a ‘sniff test’ to see if these findings make sense. It is logical that owning a car would increase the chances of a family renewing their Lobster Land season pass since owning a car makes it more convenient for families to travel to the park. If the primary passholder is a member of the Gold Zone, that indicates the person or the family are die-hard fans of the Gold Zone’s games, and by extension Lobster Land. Therefore, there is a good chance the household will renew their season pass. The number of visits a household makes to the park shows us their overall level of engagement – when a family keeps coming back, it makes sense that they would want to renew for the following season, too.

It should also be noted that the feature coefficients cannot be directly compared, since they are measured in different units.

It would be tempting – but wrong – to say that Gold Zone spending does not “really” matter here, because its coefficient is just 0.0032. Indeed, that figure is much smaller than the coefficient for visits, which is 0.1286. However, ‘avggoldzone_perperson‘ spending is measured in dollars, whereas visits are incremented on a discrete count basis. As any Gold Zone fan knows, it is easy to rack up many “units” of Gold Zone spending in a single trip. In our training set, the 75th percentile of Gold Zone spenders exceed $116.60 on average, per visit. In our model, that means that this variable’s impact for such visitors would be 0.37312.

If we changed our currency units to cents, instead of dollars, the spending variables’ contributions to model significance would remain unchanged, but their coefficients would become 0.01 as large, since each unit would now be only 1/100th as large.

So if ‘avggoldzone_perperson’ was priced in British pounds, the coefficient would be greater. How much greater? We can know this if we know the exchange rate. Let’s assume that one British pound is equivalent to 1.2 US Dollars, or that 1 US Dollar is equivalent to 0.833 British pounds. We can convert our spending variable as follows:

Then, after fitting another version of the model, we find that Gold Zone spending’s coefficient is 1.2 greater in this version, since one “unit” of this variable is 1.2 times more impactful (note that all of these coefficients are rounded to four decimal places). Conversely, if we had converted US dollars to a currency for which one individual spending unit was less valuable, we would see a correspondingly smaller coefficient.

If all of our input variables were measured in the same units, it would be possible to determine the effects of the predictor variables on the outcome just by looking at the coefficient scores. Alternatively, the modeler could standardize each of the input variables in order to place them onto a common scale.⁹ Doing so would introduce an additional layer of complexity, though, since the coefficients would then be displayed in terms of z-scores – this would make the model results much tougher to explain to a general audience.

⁹ Choueiry, G. (n.d.) ‘Interpret the logistic regression intercept’. Quantifyinghealth.com. https://quantifyinghealth.com/interpret-logistic-regression-intercept/