9.1 Building and Assessing a Single Tree Model
The tree diagram shown and described above was based on the nyc_historical dataset, which you also saw in Chapter 8. The code snippet below shows several import statements. Afterwards, the dataset is brought into our environment, and the first five rows are viewed.

Since ‘homestate’ is not currently coded in a 0-or-1 format, it must be dummified. Note that since tree models are not linear, we are not required to drop one level when dummifying. Instead, we use full-rank dummification, generating separate variables for CT, NJ, and NY.

Next, we will partition our data into training and test sets, with 60 percent of the rows sent to the training set (which will be used to build the model) and the other 40 percent of the rows assigned to the test set (which will be used to assess the model).
As noted in the previous section, the model handles the variable selection for us – still, by calling the describe() function, we can get a sense of the dataset’s variables, including their types and their distributions.

Another thing for us to explore before building the model is the distribution of the response variable. In this training data, it looks like we have two renewers for every one non-renewer. This is important for us to note when assessing model accuracy, since the naive rate here would be 67%.2

If the tree is allowed to grow in an unconstrained way, it will continue to split until all of its nodes are homogeneous. Allowing a tree to grow without constraints is nearly certain to result in overfitting. Overfitting occurs in a model whenever it generates a result that is applicable to the data used to build it, but not in a generalizable way that we can expect to work well against new, yet-unseen data.

Here, we have fit the training set inputs to the training set outcome. Since we did not specify any constraints on tree growth, scikit-learn built this tree to its maximum possible size, in which every single terminal node was homogenous (containing only one type of outcome class). If you think the resulting tree model is hard to see, we don’t blame you – in fact, we agree! As shown just above, this model is 100% accurate against the data used to build it. A far better benchmark of model success, though, is its performance against new, yet-unseen data.


As we see here, the big model’s performance against the test set falls off considerably. This occurs because in the model, any splits beyond the first few are really just reflecting the “noise” in the training set, rather than the overall “signal” of the data.
The early splits in a tree model – the ones near the root node – tend to involve relatively large numbers of records. The rules associated with such splits tend to be broadly generalizable. Later on, however, an unconstrained tree might be making splits based on very small numbers of records, using any criteria that it can find to separate those last few records. This could lead to a hyper-specific rule, like “families who spend more than $48, but less than $52 per visit on merchandise, and who spend more than $45, but less than $49 per visit on the Gold Zone, are likely to renew.”
We can constrain the growth of the tree by setting a max_depth parameter.

We can then plot our tree in order to see it, as follows. Note that we pass all the input variable names in as a list, in the same order that they were passed to the model during fitting.



With max_depth set to 3, we have built a model with considerably less accuracy against the training set, compared with the overgrown tree_big model. This model, however, does much better than the overfit model against the test set. Since predictive models are most useful and valuable when they can predict results for new data, this 71% accuracy – compared with the 63.52% accuracy of the overgrown model – is the most appropriate yardstick for assessment. We can also say that the smaller model has a higher bias, but a lower variance.
In machine learning modeling, we are always balancing bias and variance because of the inherent tension between these two model characteristics. As bias goes up, variance tends to go down, and vice versa. So what is bias in this context?
Bias is a measure of how far off a model’s estimated values are from the true values, and are assumptions made by a model to make the target function easier to learn. Variance refers to the differences in model predictions when using different portions of the training dataset. If there are too many input features, a model may suffer from high variance because it fits the training data too well; but if there are too few input features, the model may be over simplified, leading to high bias. These are some of the problems associated with different bias-variance combinations:
High bias + low variance – A model with a low variance tends to make relatively consistent predictions, but its high bias means the model’s results are inaccurate. The model is not very sensitive to the specific nature of the input variables.
High bias + high variance – Predictions are inconsistent.
Low bias + low variance – This is ideal, but difficult to achieve.
Low bias + high variance – A model with high variance, by contrast, is much more specifically tailored to the data used to build it. In other words, it does not perform well with data it has not seen before. A model’s variance increases with its complexity.
2. In other words, if we simply predicted the most common outcome class each time, we would attain a 67% accuracy rate.