Share @ LinkedIn Facebook  sklearn, cross-validation, grid-search
Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

Table of Contents

1. Cross Validation

We generally split our dataset into train and test sets. We then train our model with train data and evaluate it on test data. This kind of approach lets our model only see a training dataset which is generally around 4/5 of the data.

A better way to generalize the performance of the model is cross-validation as it lets us use more data. In cross-validation, various models are built using different training and non-overlapping test sets. Performance on test sets is then aggregated for better results.

Image Explaining 5-Fold Cross Validation

Image Explaining 5-Fold Cross Validation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

from collections import Counter

np.set_printoptions(precision=2)

%matplotlib inline

Default Classification Tasks Approach

Below we are trying the default approach to classification tasks where we divide data into train/test sets, train model, and evaluate it on the test set. We are trying only one combination of the dataset without any kind of cross-validation. It does not explore data fully hence can result in the less generic model.

In [2]:
from sklearn import datasets

iris = datasets.load_iris()
X_iris, Y_iris = iris.data, iris.target
print('Dataset Size : ', X_iris.shape, Y_iris.shape)
Dataset Size :  (150, 4) (150,)

Splitting Datasets Into Train/Test Sets

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_iris, Y_iris, train_size=0.80, test_size=0.20, random_state=12, stratify=Y_iris)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sizes :  (120, 4) (30, 4) (120,) (30,)

Training Model

In [4]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
Out[4]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Evaluating Model On Test Set.

In [5]:
print('Train Accuracy : %.2f'%knn.score(X_train, Y_train))
print('Test Accuracy : %.2f'%knn.score(X_test, Y_test))
Train Accuracy : 0.96
Test Accuracy : 1.00

Default Regression Tasks Approach

Below we are trying the default approach to regression tasks where we divide data into train/test sets, train model, and evaluate it on the test set. We are trying only one combination of the dataset without any kind of cross-validation. It does not explore data fully hence can result in the less generic model.

In [6]:
boston = datasets.load_boston()
X_boston, Y_boston = boston.data, boston.target
print('Dataset Size : ', X_boston.shape, Y_boston.shape)
Dataset Size :  (506, 13) (506,)

Splitting Datasets Into Train/Test Sets

In [7]:
from sklearn.neighbors import KNeighborsRegressor

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston, train_size=0.80, test_size=0.20, random_state=12)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sizes :  (404, 13) (102, 13) (404,) (102,)

Training Model

In [8]:
knn = KNeighborsRegressor()
knn.fit(X_train, Y_train)
Out[8]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

Evaluating Model On Test Set.

In [9]:
print('Train R^2 Score : %.2f'%knn.score(X_train, Y_train))
print('Test R^2 Score : %.2f'%knn.score(X_test, Y_test))
Train R^2 Score : 0.71
Test R^2 Score : 0.54

The above implementation considers only one set of train and test sets. It has not seen the whole dataset. We might get even better results if we try a few other possible combinations of train/test splits. Hence it’s worth trying various combinations to find out good results that generalize well.

sklearn also provides various splitting strategies as mentioned below:

  • KFold
  • StratifiedKFold
  • ShuffleSplit
  • StratifiedShuffleSPlit

sklearn provides cross_val_score method which tries various combinations of train/test splits and produces results of each split test score as output.

sklearn also provides a cross_validate method which is exactly the same as cross_val_score except that it returns a dictionary which has fit time, score time and test scores for each splits.

We are trying below StratifiedKFold and StratifiedShuffleSplit for classification dataset(iris) and KFold and ShuffleSplit for regression dataset(boston).

KFold

K-Fold cross-validation is quite common cross-validation. In K-Fold CV, the total dataset is generally divided into 5/10 folds and then for each iteration of model training, one fold is taken as the test set and remaining folds are combined to the created train set.

In [10]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import KFold,StratifiedKFold, ShuffleSplit, StratifiedShuffleSplit
In [11]:
print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('Classifying With KFold Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=KFold(n_splits=5)))
Classifying Without Any Cross Validation :  [-1.11  0.15 -0.43 -0.01 -0.17]
Classifying With KFold Cross Validation :  [-1.11  0.15 -0.43 -0.01 -0.17]
In [12]:
print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('\nClassifying With KFold Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=KFold(n_splits=5)))
Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0.  , 0.01, 0.  , 0.  , 0.  ]), 'test_score': array([-1.11,  0.15, -0.43, -0.01, -0.17])}

Classifying With KFold Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([-1.11,  0.15, -0.43, -0.01, -0.17])}

We are trying to split the IRIS classification dataset with KFold. Notice that we are also printing each class distribution in train and test sets after splits. Please make a note that class distribution is not proper in training and test sets. By class distribution, we mean that each class of classification dataset has the same amount of presence in both train and test sets. It means that if one class is representing 30% samples of the whole dataset then in both train and test sets it should have 30% representation.

Hence we should generally use StratifiedKFold for classification datasets and KFold for regression datasets.

In [13]:
kfold = KFold(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(kfold.split(X_iris)):
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    masks.append(mask)
Split[1] Train Index Distribution by class :  [0.13 0.33 0.33]
Split[1] Test Index Distribution by class :  [0.2]
Split[2] Train Index Distribution by class :  [0.2  0.27 0.33]
Split[2] Test Index Distribution by class :  [0.13 0.07]
Split[3] Train Index Distribution by class :  [0.33 0.13 0.33]
Split[3] Test Index Distribution by class :  [0.  0.2]
Split[4] Train Index Distribution by class :  [0.33 0.27 0.2 ]
Split[4] Test Index Distribution by class :  [0.   0.07 0.13]
Split[5] Train Index Distribution by class :  [0.33 0.33 0.13]
Split[5] Test Index Distribution by class :  [0.  0.  0.2]

Visualizing Splits Of KFold

Below we are visualizing splits created by KFold from the previous step. We had maintained how it split data at each step into train and test data. Please make a note from the plot that Y-axis represents a split number. We can notice that in the first split it took the first 30 samples as the test set and remaining 120 samples as a train set. We then select the next 30 samples as the train set in the next iteration and so on.

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues, fignum=1)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

StratifiedKFold

The StratifiedKFold is commonly used for classification tasks. It works almost like KFold with the only difference that it maintains class distribution the same in train/test sets as that of original dataset distribution. So if we have one class which has a 30% sample in the original dataset then when we split it into train/test sets, both train and test sets will also have a 30% distribution of this class.

In [22]:
print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('Classifying With Stratified KFold Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedKFold(n_splits=5)))
Classifying Without Any Cross Validation :  [0.97 1.   0.93 0.97 1.  ]
Classifying With Stratified KFold Cross Validation :  [0.97 1.   0.93 0.97 1.  ]
In [23]:
print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('\nClassifying With Stratified KFold Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedKFold(n_splits=5)))
Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.97, 1.  , 0.93, 0.97, 1.  ])}

Classifying With Stratified KFold Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.97, 1.  , 0.93, 0.97, 1.  ])}

cross_val_score method will first divide the dataset into the first 5 folds and for each iteration, it takes one of the fold as the test set and other folds as a train set. It generally uses KFold by default for creating folds for regression problems and StratifiedKFold for classification problems.

We are trying to split the classification dataset with StratifiedKFold. Notice that we are also printing each class distribution in train and test sets after splits. Here we can note that class distribution is proper in train and test sets.

In [24]:
skfold = StratifiedKFold(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(skfold.split(X_iris, Y_iris)):
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    masks.append(mask)
Split[1] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[1] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[2] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[2] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[3] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[3] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[4] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[4] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[5] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[5] Test Index Distribution by class :  [0.07 0.07 0.07]

Visualizing Splits Of StratifiedKFold

Below we are visualizing splits created by StratifiedKFold from the previous step. We had maintained how it split data at each step into train and test data. Please make a note from the plot that Y-axis represents a split number. We can notice that in the first split it took the first 30 samples as the test set and remaining 120 samples as train set while maintaining class proportion as well. We then select the next 30 samples as the train set in the next iteration and so on.

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

ShuffleSplit

The ShuffleSplit as its name suggests splits dataset based on randomly selected indices. It's commonly used for regression tasks.

In [26]:
print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('Classifying With ShuffleSplit Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=ShuffleSplit(n_splits=5)))
Classifying Without Any Cross Validation :  [-1.11  0.15 -0.43 -0.01 -0.17]
Classifying With ShuffleSplit Cross Validation :  [0.48 0.64 0.54 0.68 0.54]
In [27]:
print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('\nClassifying With ShuffleSplit Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=ShuffleSplit(n_splits=5)))
Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([-1.11,  0.15, -0.43, -0.01, -0.17])}

Classifying With ShuffleSplit Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.24, 0.17, 0.65, 0.55, 0.53])}

We are trying to split the classification dataset with ShuffleSplit. Notice that we are also printing each class distribution in train and test sets after splits. Please make a note that class distribution is not proper in training and test sets. Hence we should generally use StratifiedShuffleSplit for classification datasets and ShuffleSplit for regression datasets.

In [28]:
shuffle_split = ShuffleSplit(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(shuffle_split.split(X_iris)):
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    masks.append(mask)
Split[1] Train Index Distribution by class :  [0.31 0.31 0.28]
Split[1] Test Index Distribution by class :  [0.02 0.03 0.05]
Split[2] Train Index Distribution by class :  [0.29 0.29 0.31]
Split[2] Test Index Distribution by class :  [0.04 0.04 0.02]
Split[3] Train Index Distribution by class :  [0.31 0.29 0.29]
Split[3] Test Index Distribution by class :  [0.02 0.04 0.04]
Split[4] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[4] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[5] Train Index Distribution by class :  [0.29 0.28 0.33]
Split[5] Test Index Distribution by class :  [0.04 0.05 0.01]

Visualizing Splits Of ShuffleSplit

We can notice from below visualization that ShuffleSplit selected samples randomly unlike KFold which selects samples serially.

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

StratifiedShuffleSplit

The StratifiedShuffleSplit works exactly like ShuffleSplit but designed for classification tasks where we need to maintain class proportion after splitting of data.

In [30]:
print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('Classifying With StratifiedShuffleSplit Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedShuffleSplit(n_splits=5)))
Classifying Without Any Cross Validation :  [0.97 1.   0.93 0.97 1.  ]
Classifying With StratifiedShuffleSplit Cross Validation :  [1. 1. 1. 1. 1.]
In [31]:
print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('\nClassifying With StratifiedShuffleSplit Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedShuffleSplit(n_splits=5)))
Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.97, 1.  , 0.93, 0.97, 1.  ])}

Classifying With StratifiedShuffleSplit Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.93, 1.  , 0.87, 1.  , 1.  ])}

We are trying to split the classification dataset with StratifiedShuffleSplit. Notice that we are also printing each class distribution in train and test sets after splits. Here we can note that class distribution is proper in train and test sets.

In [32]:
shuffle_split = StratifiedShuffleSplit(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(shuffle_split.split(X_iris, Y_iris)):
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    masks.append(mask)
Split[1] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[1] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[2] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[2] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[3] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[3] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[4] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[4] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[5] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[5] Test Index Distribution by class :  [0.03 0.03 0.03]

Visualising Splits Of StratifiedShuffleSplit

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

sklearn also provides validatation_curve method which can take single hyperparameters and list of various values for that hyperparameters, then it returns train and test scores for various cross-validation folds. It's generally used for plotting purposes.

In [ ]:
from sklearn.model_selection import validation_curve

n_neighbors = [1, 3, 5, 10, 20, 50]
train_scores, test_scores = validation_curve(KNeighborsRegressor(), X_iris, Y_iris, param_name="n_neighbors",
                                             param_range=n_neighbors, cv=StratifiedShuffleSplit(n_splits=5, random_state=123))

with plt.style.context(('seaborn', 'ggplot')):
    plt.plot(n_neighbors, train_scores.mean(axis=1), label="train accuracy")
    plt.plot(n_neighbors, test_scores.mean(axis=1), label="test accuracy")
    plt.ylabel('Accuracy')
    plt.xlabel('Number of neighbors')
    #plt.xlim([50, 0])
    plt.legend(loc="best");

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

2. Hyperparameter Tuning Using GridSearch

All complex machine learning model has more than one hyperparameters. Most of the models have default values set for these parameters. If we fit train data with the default model then it might happen that it does not fit data well. It can overfit data or underfit data as well. We need to find a proper trade-off between overfitting & underfit by doing grid search through various values of hyperparameters of the model.

Grid Search does try the list of all combinations of values given for a list of hyperparameters with model and records the performance of model based on evaluation metrics and keeps track of the best model and hyperparameters as well. We can try all parameters by writing a loop inside a loop for each hyperparameter values.

In [35]:
X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=12)
In [36]:
from sklearn.ensemble import RandomForestRegressor

best_score = 0.0
best_params = {'max_depth': None, 'max_features': 'auto','n_estimators': 10}
for max_depth in [None, 2,3,5]:
    for max_features in ['auto','sqrt', 'log2']:
        for n_estimators in [10,100]:
            score = cross_val_score(RandomForestRegressor(n_estimators=n_estimators,
                                                          max_features=max_features,
                                                          max_depth=max_depth,
                                                          random_state=123
                                                          ),
                                    X_train,
                                    Y_train,
                                    cv=ShuffleSplit(n_splits=5, random_state=123),
                                    n_jobs=-1).mean()
            if score > best_score:
                best_score= score
                best_params['max_depth'],best_params['max_features'], best_params['n_estimators'] = max_depth, max_features, n_estimators

            print('max_depth : %s, max_features : %s, n_estimators : %s , Average R^2 Score : %.2f'%(str(max_depth), max_features, str(n_estimators), score))

print('\nBest Score : %.2f, Best Params : %s'%(best_score, str(best_params)))
max_depth : None, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.89
max_depth : None, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.90
max_depth : None, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.85
max_depth : None, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.88
max_depth : None, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.85
max_depth : None, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.88
max_depth : 2, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.68
max_depth : 2, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.68
max_depth : 2, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.57
max_depth : 2, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.60
max_depth : 2, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.57
max_depth : 2, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.60
max_depth : 3, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.81
max_depth : 3, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.83
max_depth : 3, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.76
max_depth : 3, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.73
max_depth : 3, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.76
max_depth : 3, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.73
max_depth : 5, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.87
max_depth : 5, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.89
max_depth : 5, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.78
max_depth : 5, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.81
max_depth : 5, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.78
max_depth : 5, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.81

Best Score : 0.90, Best Params : {'max_depth': None, 'max_features': 'auto', 'n_estimators': 100}
In [38]:
rf_best = RandomForestRegressor(**best_params)
rf_best.fit(X_train, Y_train)

print("Test R^2 Score : ", rf_best.score(X_test, Y_test))
Test R^2 Score :  0.8749413705189064

GridSearchCV

sklearn provides GridSearchCV class which takes a list of hyperparameters and their values as a dictionary and will try all combinations on the model and also will keep track of results as well for each Cross-Validation Folds.

In [39]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

grid = GridSearchCV(RandomForestRegressor(random_state=123),
                    param_grid = {'max_depth': [None, 2,3,5], 'max_features' : ['auto','sqrt', 'log2'], 'n_estimators': [10,100],},
                    cv = ShuffleSplit(n_splits=5, random_state=123),
                    verbose=50,
                    n_jobs=-1)

grid.fit(X_train, Y_train)

print('\nBest R^2 Score : %.2f'%grid.best_score_, ' Best Params : ', str(grid.best_params_))
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0339s.) Setting batch_size=10.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done  68 out of 120 | elapsed:    1.3s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done 110 out of 120 | elapsed:    1.8s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    1.9s finished

Best R^2 Score : 0.90  Best Params :  {'max_depth': None, 'max_features': 'auto', 'n_estimators': 100}

Grid objects also keep tracks of all hyperparameters tried on all cross-validation splits along with information about their score, fit times, mean scores, standard scores, mean fit times, standard fit times. It also ranks models best on performance with best models ranked 1 and next one 2 and so on.

In [40]:
grid.cv_results_.keys()
Out[40]:
dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])
In [41]:
pd.DataFrame(grid.cv_results_)[['param_max_depth', 'param_max_features', 'param_n_estimators','mean_test_score', 'rank_test_score']]
Out[41]:
param_max_depth param_max_features param_n_estimators mean_test_score rank_test_score
0 None auto 10 0.890995 2
1 None auto 100 0.902970 1
2 None sqrt 10 0.848199 7
3 None sqrt 100 0.875427 4
4 None log2 10 0.848199 7
5 None log2 100 0.875427 4
6 2 auto 10 0.684550 19
7 2 auto 100 0.681664 20
8 2 sqrt 10 0.566652 23
9 2 sqrt 100 0.598139 21
10 2 log2 10 0.566652 23
11 2 log2 100 0.598139 21
12 3 auto 10 0.812761 10
13 3 auto 100 0.825652 9
14 3 sqrt 10 0.757299 15
15 3 sqrt 100 0.727955 17
16 3 log2 10 0.757299 15
17 3 log2 100 0.727955 17
18 5 auto 10 0.873375 6
19 5 auto 100 0.885054 3
20 5 sqrt 10 0.783386 13
21 5 sqrt 100 0.811359 11
22 5 log2 10 0.783386 13
23 5 log2 100 0.811359 11

Grid object also keeps the best model available as the best_estimator_ parameter so that it can be used for prediction purposes further.

In [42]:
grid.best_estimator_
Out[42]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)
In [43]:
print('First Few preds : ', grid.predict(X_boston)[:5])
print('Actual Values   : ', Y_boston[:5])
First Few preds :  [25.36 23.09 35.49 33.91 35.45]
Actual Values   :  [24.  21.6 34.7 33.4 36.2]
In [44]:
print("Test R^2 Score : ", grid.score(X_test, Y_test))
Test R^2 Score :  0.8723191006047755

RandomizedSearchCV

The RandomizedSearchCV is another approach of performing hyperparameter tunning. Unlike GridSearchCV which tries all possible parameter settings passed to it, RandomizedSearchCV tries only a specified number of parameter settings from total parameter search space. It accepts a parameter named n_iter (integer) which lets RandomizedSearchCV select that many parameter settings from all possible parameter settings to try on model. Below we are explaining the usage of it using Boston housing dataset that was split into train/test sets when explaining GridSearchCV.

In [45]:
from sklearn.model_selection import RandomizedSearchCV

grid = RandomizedSearchCV(RandomForestRegressor(random_state=123), n_iter=5,
                    param_distributions = {'max_depth': [None, 2,3,5], 'max_features' : ['auto','sqrt', 'log2'], 'n_estimators': [10,100],},
                    cv = ShuffleSplit(n_splits=5, random_state=123),
                    verbose=50,
                    n_jobs=-1)

grid.fit(X_train, Y_train)

print('\nBest R^2 Score : %.2f'%grid.best_score_, ' Best Params : ', str(grid.best_params_))
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0148s.) Setting batch_size=26.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   4 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   5 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   7 out of  25 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   8 out of  25 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.2s finished

Best R^2 Score : 0.85  Best Params :  {'n_estimators': 10, 'max_features': 'log2', 'max_depth': None}

We can notice from the above output that even though a possible number of parameter settings is quite high but it only tries 5 different parameter settings. It’s showing total 25 fits because it'll do cross-validation with 5 splits per each parameter setting.

Below we are printing results of each parameter setting converted to pandas dataframe.

In [46]:
pd.DataFrame(grid.cv_results_)[['param_max_depth', 'param_max_features', 'param_n_estimators','mean_test_score', 'rank_test_score']]
Out[46]:
param_max_depth param_max_features param_n_estimators mean_test_score rank_test_score
0 3 sqrt 10 0.757299 4
1 5 log2 10 0.783386 3
2 2 sqrt 10 0.566652 5
3 None log2 10 0.848199 1
4 3 auto 10 0.812761 2
In [47]:
grid.best_estimator_
Out[47]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='log2', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)
In [48]:
print('First Few preds : ', grid.predict(X_boston)[:5])
print('Actual Values   : ', Y_boston[:5])
First Few preds :  [23.9  25.11 37.14 33.9  34.87]
Actual Values   :  [24.  21.6 34.7 33.4 36.2]
In [49]:
print("Test R^2 Score : ", grid.score(X_test, Y_test))
Test R^2 Score :  0.8680488725558099

This ends our small tutorial on cross-validation and hyperparameter tunning using a grid search using scikit-learn. Please feel free to let us know your views in the comments section.

References


Sunny Solanki  Sunny Solanki