9.8 Random Forests vs. Logistic Regression Models
At first, when comparing the two types of classification models featured in this book, it is tempting to want to use a random forest to solve every problem. After all, random forests are more complicated, more powerful, and often, more accurate.
However, logistic regression models do something that random forests cannot do – they deliver specific coefficients for each input variable that quantify the relationship between the inputs and the log-odds of “1” class membership. A logistic regression model enables us to say, for instance, that “A 20-point increase in a person’s FICO score increases the log-odds of loan approval by a factor of 1.2.”
An important advantage with random forests – and with tree models in general – is their ability to handle non-linearities in data.
Non-linearities can occur for many reasons in an input-to-outcome relationship. One of the ways in which a non-linearity can arise is when some of an input impacts the outcome in a particular way, but more of that input impacts the outcome in a different way.
For instance, a risk officer at a bank might say that someone with o-2 credit cards is a high risk, because the person could be unfamiliar with credit. The risk officer might then say that a person with 3-6 cards is a moderate risk, whereas a person with 7+ cards is high risk.
A visitor at a buffet restaurant might say that eating 0-1 slices of pizza leaves him feeling dissatisfied (he’s still hungry!), eating between 2-4 slices leaves him feeling satisfied, but eating 5+ slices leaves him dissatisfied (perhaps he is feeling sick!)
In either of the above examples (credit cards and risk, or pizza and satisfaction), a tree model could easily handle the non-linearity – it could simply split the records at the right places, and then accurately categorize the records into the correct outcome class. A logistic regression model, on the other hand, would not be appropriate for either example, given its limitation of assigning a single coefficient to each variable.