Report an error

9.8 Random Forests vs. Logistic Regression Models

At first, when comparing the two types of classification models featured in this book, it is tempting to want to use a random forest to solve every problem. After all, random forests are more complicated, more powerful, and often, more accurate.

However, logistic regression models do something that random forests cannot do – they deliver specific coefficients for each input variable that quantify the relationship between the inputs and the log-odds of “1” class membership. A logistic regression model enables us to say, for instance, that “A 20-point increase in a person’s FICO score increases the log-odds of loan approval by a factor of 1.2.”

An important advantage with random forests – and with tree models in general – is their ability to handle non-linearities in data.

Non-linearities can occur for many reasons in an input-to-outcome relationship. One of the ways in which a non-linearity can arise is when some of an input impacts the outcome in a particular way, but more of that input impacts the outcome in a different way.

For instance, a risk officer at a bank might say that someone with o-2 credit cards is a high risk, because the person could be unfamiliar with credit. The risk officer might then say that a person with 3-6 cards is a moderate risk, whereas a person with 7+ cards is high risk.

A visitor at a buffet restaurant might say that eating 0-1 slices of pizza leaves him feeling dissatisfied (he’s still hungry!), eating between 2-4 slices leaves him feeling satisfied, but eating 5+ slices leaves him dissatisfied (perhaps he is feeling sick!)

In either of the above examples (credit cards and risk, or pizza and satisfaction), a tree model could easily handle the non-linearity – it could simply split the records at the right places, and then accurately categorize the records into the correct outcome class. A logistic regression model, on the other hand, would not be appropriate for either example, given its limitation of assigning a single coefficient to each variable.