1.27 Demystifying the term “Machine Learning”
Machine learning is the process of building, assessing, and refining models with the application of statistical methods. Okay, but maybe that sounded like a bunch of jargon. You could have Googled your way to that. So what does it really mean?
Maybe the best way to answer that is to think about what wouldn’t be machine learning. Suppose we are trying to build a model that can predict the amount of time (measured in minutes) for your daily commute from home to work. Suppose we spend a few minutes thinking about which inputs would matter (day of week, amount of traffic, time of year, whether there is a major event in the city that day, weather, etc.) and we just write one ordinary least squares function with Python’s statsmodels. We use time as our dependent variable, all the things that we thought might matter as our inputs, and stop right there. We have an outcome variable, we have some inputs, and we have a model.
Mission accomplished…right?
At that point, we have built (but certainly not assessed or refined!) a model, and haven’t applied any statistical rigor to our decision about what to include as inputs. We have NOT engaged in machine learning.
Now suppose, alternatively, we could get our hands on 20 separate inputs that might be relevant for such a model, along with several hundreds of days worth of commuting data. We then throw ALL of those inputs into one version of the model, but then use the help of a tool like stepwise regression, or backward elimination, to eliminate variables that lack statistical significance.12 Now, p-values matter — not just our own intuition! We throw away variables with high p-values, and we try some new combinations of inputs. Maybe we even explore alternative ways to handle our categories (like collapsing a category with 8 levels into just 3 levels). We compare a few versions of the model and then we check their accuracy against additional commute time observations that we collect, in order to validate our results. Now, to circle back to the definition shown above, we have “built, assessed, and refined” our model using statistical methods. Aided by the software, and using our own judgment and intuition to steer the process, we have engaged in machine learning.
Once we have a model built, then things get fun. Now, with ease, we can pass huge datasets through the model, and spit out predictions for each record in a matter of seconds.
12 Automatic variable selection methods will be covered in Chapter 16.