9. From Single Trees to Random Forests

Single Tree Models

Tree models can be used to predict numeric or categorical outcomes. A tree model consists of a series of IF/THEN rules, depicted visually. In the tree diagram shown below, we can trace any path from the top of the tree to the bottom to generate one such rule. We can view the top line in the node as its question (e.g. “Did this household visit Lobster Land fewer than 11.5 times in the previous season?”) If our answer is “Yes” then we will pass the observation to the node on the left; if our answer is “No” then we will pass the observation to the node on the right.

We can say, for instance, that IF a season pass-holding household visits Lobster Land 12 or more times in a season, and IF their average Gold Zone spending exceeds $104.50, and IF their average food spending per person is less than or equal to $63.80, they should be predicted to renew their season pass for the following year. From the diagram below, we can even know that 111 households in the data set used to build the model met those conditions; of that group, 109 of them renewed for the next season.

In the tree model above, we have a mixture of decision nodes (from which the tree makes an additional split of records, based on some particular criterion), and terminal nodes (the nodes at the bottom, from which there are no additional splits). A terminal node is sometimes referred to as a “leaf.”

Within each of the decision nodes, the graph provides us with four pieces of information – the variable and value on which the split occurs, the Gini impurity for the node, the number of total records in the node, and the class outcomes for those records.

To dive into this a bit, let’s take a closer look at that very first node, which is also called the root node.

The split at this node occurs based on a record’s number of visits. We can think of this first line within a decision node as being akin to a question: It’s essentially asking the household, “Did you visit fewer than 11.5 times during the previous season?” If that answer is “Yes”, then the record moves along to the left, where we will split based on car ownership. If that answer is “No” (in other words, if the household visited 12 or more times), then we move that record along to the right, where we split based on Gold Zone spending.

The information displayed here in this node also tells us that there are 1920 total observations within it; of these, 640 belong to the “0” class (did not renew their membership), whereas 1280 belong to the “1” class (did renew their membership).

The Gini value here is a measure of node impurity. When measuring node impurity, the model uses only one criterion – the classes of the response variable. The formula it uses is this:

1 – (proportion of records that belong to Class 0)^2 – (proportion of records that belong to class 1)^2

Since 640/1920 = 0.33, we can say that one-third of the records in our root node belong to the 0 class. Therefore, in this two-class scenario, 0.67 of the records in the node must belong to the 1 class. We can calculate the Gini impurity value as follows:

You might be wondering, “Why did the first separation of records occur based on ‘visits’ and not some other variable?” The answer lies in the Gini impurity calculation.

Before deciding to split based on whether a household had visited fewer than 11.5 times during a season, the model evaluated every possible variable & value combination in the dataset. Using 11.5 visits as the threshold for the first split created the lowest possible weighted Gini impurity in the nodes that appear one level down from the root¹. For each of the subsequent splits in the tree, the model uses the same approach – it considers all of the available variables, and it selects the best split for the remaining data to reduce impurity in the nodes immediately beneath itself. The tree’s decision making process can be called a “greedy” algorithm, as it only considers a split’s impact one step ahead.

One of the biggest benefits associated with trees is their interpretability. Because tree models generate visually-interpretable results, they can be read by non-specialized audiences. For the tree model shown above, once a reader knows that “Yes” answers move to the left, and that “No” answers move to the right, he can classify any household from the dataset as a likely renewer or as a likely non-renewer.

Another advantage of tree models is that they automate the process of variable selection. When we pass many input variables to a tree model, the tree splits on the ones that are most effective at reducing impurity. Meanwhile, other variables are simply ignored. Rather than needing to worry about parsimony by removing variables during the exploratory process, we can simply feed as many inputs to the tree as we wish to, and let it show us which ones are used.

For a tree model, there is no need to standardize the input features beforehand – since the model is just interested in identifying the ways that observations can be split, based on the response variable, we do not need to have variables in common units or scales.

Tree models are non-linear. This means that tree models can detect complex variable relationships, including situations in which some quantity of an input variable impacts the response variable in a particular way, but more of that input variable has a different effect (we will return to this topic near the end of this chapter). When we build tree models, we do not need to concern ourselves with multicollinearity, as we would when making a linear or logistic regression model.

Like all other modeling types, though, trees have their flaws.

Trees can be unstable, especially when they are built with relatively small datasets. The random assignment of records to the training and test sets can sometimes lead to very different results when those assignments change.

Trees can also be prone to overfitting. More will be said about overfitting later in this chapter.

Unlike linear models, tree models do not allow us to precisely quantify the nature of the relationship between an input variable and the response variable. With a regression model, for instance, we can use statistical inference to identify the significance of some particular input variable. Having done that, we can make statements like, “a one-unit change in this input leads to 15.79 unit change in the response variable, all else equal.” Tree models do not let us make these types of statements.

₁ In the second level of this tree, there are 1920 records. Approximately 82% of them are in the left-most node, which splits on own_car. The other 18% are in the right-most node, which splits on avggoldzone_perperson. The weighted Gini impurity for the two nodes is (0.82 * .476) + (0.18*.134) = .4144.