Report an error

9.4 Using GridSearch to Determine Optimal Hyperparameters

First, we import the RandomForestClassifier from sklearn.ensemble, and we instantiate an instance of this class, clf. Next, we’ll identify some boundaries for our upcoming search, by setting up the param_grid dictionary as follows.

To arrive at the optimal set of hyperparameters, a modeler can use a grid search. A grid search is a computationally-intensive process that involves the creation and assessment of many models, each built with slightly different settings, in order to arrive at an effective combination of settings for the data and the modeling task at hand.

The hyperparameters shown here are n_estimators (the total number of trees within the random forest), max_depth (the maximum allowable size for each tree), max_features (the number of features that each tree is allowed to see at each split), and min_samples_split (the minimum number of records in a decision node that would be required for the model to consider making another split). The values shown here are somewhat arbitrary – much like using a grid search to look for a buried treasure, the process simply must start somewhere. An important constraint to note with max_features, though, is that it should never exceed p -1, with p representing the total number of input features in the model.

When such a process is used, separate random forest models are built using every combination of hyperparameters from the grid. As more options are included in the grid, therefore, the computational complexity increases multiplicatively. Here, there are 3 x 3 x 3 x 3 = 81 separate random forests built, each with a five-fold cross-validation process (which really means that 81 x 5 = 405 models are built). If we were to add just one more option for each of the four hyperparameters, this number would jump to 1280 total models (4x4x4x4 = 256, and 256×5 = 1280).

Through the grid search process, scikit-learn determines the best feature combination for the dataset. In the example here in this chapter, it does this by using five-fold cross-validation. This means that the training set is split into five separate folds, each representing 20% of the total data in the training set. A model is built with those first four folds, while the remaining 1/5th is used to assess the model. Next, a separate fold is used for assessment, while the other 80% is used to build a model. This process continues three more times. This is demonstrated in the image below:

The recommended settings for this model are shown here in the output, in the image below. When a modeler encounters an “edge case”, or a recommended value along one of the borders of the grid, he or she may want to expand the search grid, to be sure that the optimal solution can be found. In this example, that would mean attempting a larger max_depth option, and a larger n_estimators option, since the recommended options are the largest ones that the model could have considered.

Remember that hyperparameter tuning is part art, part science. The best_params_ results that appear above only tell us how this model performed against the folds of training data used to assess it during this cross-validation process. While it certainly makes sense for us to use these hyperparameters going forward, we cannot say that they are the “best” ones in any objective sense. First, and most importantly, we have no idea how the model will perform “in the wild” against totally new data from a different source. Second, these results should not even be expected to be entirely consistent against this dataset – tiny changes in the random assignment of records to the various cross-validation “folds” can lead to slightly different outcomes.

That said, we’ll give this grid another shot here. We will specify our max_features and min_samples_split values (to avoid having the subsequent trees built with the default settings for these), and throw some new options out there for n_estimators and max_depth.

It looks like our max_depth range can be expanded a bit further – and that we may not need such a large number of individual trees. Let’s give it another shot:

With this set of results, we have landed in the middle of our search ranges for max_depth, max_features, and min_samples_split. We can stick with 75 trees for our n_estimators value; even though fewer trees might be slightly more efficient, this number will be okay for us.

A modeler wishing to speed up this process can use something called a Randomized GridSearch, which works considerably faster, by searching areas of the grid in a way that is not completely exhaustive and comprehensive.

Also, there are other hyperparameters that can be adjusted, beyond the ones shown here. To see a complete list of options, along with the default settings built into scikit-learn, just call help(RandomForestClassifier) in your Google Colab or Jupyter Notebook environment, after first importing the RandomForestClassifier module.