Report an error

9.2 What is a Random Forest?

A random forest is an ensemble learner, built from many tree models. In machine learning, the word ‘ensemble’, which literally means ‘together’, refers to any method that generates a single prediction based on inputs from multiple individual models.

The “forest” in “random forest” comes from the use of many individual trees, whose collective predictions are averaged together to generate a single predicted outcome for a record.

The “random” comes from the way the individual trees within this “forest” are constructed. At each split made by each individual tree, only a subset of features, rather than the entire feature set, is considered.

The limitation to this feature subset for each split is what makes the random forest so special. If each split in each tree were built the same available feature subsets, the trees would be highly correlated with one another. Each tree would use the best available variables for splitting records at each point; consequently, each tree would look quite similar to each of the others. Within a random forest, however, there is more diversity among the individual trees; the model, therefore, is better able to learn some of the specific quirks, or nuances, of the dataset.

To classify a record, the random forest uses the complete picture, drawn from all of its individual trees. Let’s imagine that we are predicting whether a prospective customer will subscribe to a service, and that we have 250 trees in our random forest. This record would be sent through each one of the 250 individual trees – each of which is built in a slightly different way – and then the model’s overall prediction is delivered based on the majority “vote” among those individual components. If the random forest model is instead used to predict a numeric outcome, the final ensemble prediction comes from the average of each individual tree’s predictions.