Select Page

9.8 Random Forests vs. Logistic Regression Models


At first, when comparing the two types of classification models featured in this book, it is tempting to want to use a random forest to solve every problem.  After all, random forests are more complicated, more powerful, and often, more accurate.  

However, logistic regression models do something that random forests cannot do – they deliver specific coefficients for each input variable that quantify the relationship between the inputs and the log-odds of “1” class membership.  A logistic regression model enables us to say, for instance, that “A 20-point increase in a person’s FICO score increases the log-odds of loan approval by a factor of 1.2.”  

An important advantage with random forests – and with tree models in general – is their ability to handle non-linearities in data.  

Non-linearities can occur for many reasons in an input-to-outcome relationship.  One of the ways in which a non-linearity can arise is when some of an input impacts the outcome in a particular way, but more of that input impacts the outcome in a different way.   

For instance, a risk officer at a bank might say that someone with o-2 credit cards is a high risk, because the person could be unfamiliar with credit.  The risk officer might then say that a person with 3-6 cards is a moderate risk, whereas a person with 7+ cards is high risk.  

A visitor at a buffet restaurant might say that eating 0-1 slices of pizza leaves him feeling dissatisfied (he’s still hungry!), eating between 2-4 slices leaves him feeling satisfied, but eating 5+ slices leaves him dissatisfied (perhaps he is feeling sick!)  

In either of the above examples (credit cards and risk, or pizza and satisfaction), a tree model could easily handle the non-linearity – it could simply split the records at the right places, and then accurately categorize the records into the correct outcome class.  A logistic regression model, on the other hand, would not be appropriate for either example, given its limitation of assigning a single coefficient to each variable.