Share @ LinkedIn Facebook  sklearn, ensemble-learning, bagging, random-forests
Scikit-Learn - Ensemble Learning : Bootstrap Aggregation(Bagging) & Random Forests

Scikit-Learn - Ensemble Learning : Bootstrap Aggregation(Bagging) & Random Forests

Table of Contents

Introduction

We already discussed decision trees in our tutorial about it in-depth. We noticed over there that a single decision tree generally over-fits train data very easily hence it's a better idea to combine many decision trees to make a decision. The basic idea is that multiple overfitting estimators can be combined together to reduce the effect of overfitting and produce better predictions which generalize well. This idea is generally referred to as ensemble learning in the machine learning community.

There are 2 ways to combine decision trees to make better decisions:

  • Averaging (Bootstrap Aggregation - Bagging & Random Forests) - Idea is that we create many individual estimators and average predictions of these estimators to make the final predictions. Averaging estimators reduce variance hence avoids overfitting.
  • Boosting - Base estimators are trained sequentially where we try to reduce the bias of combined estimator hence avoid underfitting. The main idea is to combine a few weak estimators to create a powerful estimator.

In this tutorial, we'll be discussing bagging and random forests. We'll cover boosting in-depth in separate tutorial.

In [1]:
import numpy as np
import pandas as pd

import sklearn
from sklearn import ensemble, datasets, tree
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV

import sys
import warnings

warnings.filterwarnings("ignore")

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

Bootstrap Aggregation (Bagging)

Bagging starts with many sub-sample of original data with replacement and then trains various decision trees on these sub-samples. When the prediction is to be made on new data, it votes or averages prediction from each decision tree. The basic idea is to solve the overfitting problem (reducing high variance) by introducing some randomization.

Scikit-Learn provides BagginRegressor and BaggingClassifier.

BaggingRegressor

We'll be explaining the usage of BaggingRegressor by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned bagging estimator with decision tree and extra tree estimator of scikit-learn.

Load BOSTON Housing Dataset

In [2]:
from sklearn import datasets

boston = datasets.load_boston()
X_boston, Y_boston = boston.data, boston.target
print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))
Dataset features names : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Dataset features size : (506, 13)
Dataset target size : (506,)

Splitting Dataset into Train & Test sets

We'll split the dataset into two parts:

  • Training data which will be used for the training model.
  • Test data against which accuracy of the trained model will be checked.

train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston , train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sets Sizes :  (404, 13) (102, 13) (404,) (102,)

Fitting Model To Train Data

We can fit() method on estimator passing it train features and train target. It'll then train a model using that data.

In [4]:
from sklearn.ensemble import BaggingRegressor

bag_regressor = BaggingRegressor(random_state=1)
bag_regressor.fit(X_train, Y_train)
Out[4]:
BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=1.0, n_estimators=10,
                 n_jobs=None, oob_score=False, random_state=1, verbose=0,
                 warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [5]:
Y_preds = bag_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%bag_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%bag_regressor.score(X_test, Y_test))
[21.49 25.87 48.19 21.28 29.36 41.65 24.    8.18 19.12 31.  ]
[15.  26.6 45.4 20.8 34.9 21.9 28.7  7.2 20.  32.2]
Training Coefficient of R^2 : 0.980
Test Coefficient of R^2 : 0.812

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below are list of common hyperparameters which needs tuning for getting best fit for our data. We'll try various hyperparemters settings to various splits of train/test data to find out best fit which will have almost same accuracy for both train & test dataset or have quite less different between accuracy.

  • base_estimator(object or None) - Base Estimator whose many instances will be created. If None is provided then DecisionTree wil be used as base estimator.It accepts object or None. default=None
  • n_estimators(int) - Number of base estimators whose results will be combined to produce final prediction. default=10
  • bootstrap(bool) - Decides whether samples are drawn with replacement. True = With Replacement. False = Without Replacement.default=True
  • bootstrap_features(bool) - Decides whether features are drawn with replacement. True = With Replacement. False = Without Replacement.default=False
  • max_samples(int/float) - It accepts int(1-n_samples) or float(0.0-1.0] values. It represents number of samples to draw from train data to train particular estimator.
  • max_features(int/float) - It accepts int(1-n_features) or float(0.0-1.0] values. It represents number of features to draw from train data to train particular estimator.

GridSearchCV

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.

NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by doing 3-fold cross-validation on data.

In [6]:
%%time

n_samples = boston.data.shape[0]
n_features = boston.data.shape[1]

params = {'base_estimator': [None, LinearRegression(), KNeighborsRegressor()],
          'n_estimators': [20,50,100],
          'max_samples': [0.5,1.0, n_samples//2,],
          'max_features': [0.5,1.0, n_features//2,],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

bagging_regressor_grid = GridSearchCV(BaggingRegressor(random_state=1, n_jobs=-1), param_grid =params, cv=3, n_jobs=-1, verbose=1)
bagging_regressor_grid.fit(X_train, Y_train)

print('Train R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%bagging_regressor_grid.best_score_)
print('Best Parameters : ',bagging_regressor_grid.best_params_)
Fitting 3 folds for each of 324 candidates, totalling 972 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   36.2s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.0min
Train R^2 Score : 0.983
Test R^2 Score : 0.802
Best R^2 Score Through Grid Search : 0.870
Best Parameters :  {'base_estimator': None, 'bootstrap': True, 'bootstrap_features': False, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 50}
CPU times: user 2.21 s, sys: 199 ms, total: 2.41 s
Wall time: 1min 11s
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  1.2min finished

Printing First Few Cross-Validation Results

GridSearchCV maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling cv_results_ attribute on it. We are converting it to pandas dataframe for better visuals.

In [7]:
cross_val_results = pd.DataFrame(bagging_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 324
Out[7]:
mean_fit_time std_fit_time mean_score_time std_score_time param_base_estimator param_bootstrap param_bootstrap_features param_max_features param_max_samples param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.102222 0.000492 0.102727 0.000575 None True True 0.5 0.5 20 {'base_estimator': None, 'bootstrap': True, 'b... 0.807748 0.818529 0.761464 0.795999 0.024725 92
1 0.176272 0.052850 0.102964 0.000274 None True True 0.5 0.5 50 {'base_estimator': None, 'bootstrap': True, 'b... 0.834376 0.818371 0.769611 0.807546 0.027514 74
2 0.272792 0.049073 0.103111 0.000387 None True True 0.5 0.5 100 {'base_estimator': None, 'bootstrap': True, 'b... 0.834629 0.826096 0.768551 0.809860 0.029310 69
3 0.104769 0.002391 0.103690 0.001265 None True True 0.5 1 20 {'base_estimator': None, 'bootstrap': True, 'b... 0.799064 0.809131 0.777896 0.795407 0.013005 94
4 0.206177 0.000921 0.103582 0.000535 None True True 0.5 1 50 {'base_estimator': None, 'bootstrap': True, 'b... 0.825602 0.829979 0.784795 0.813530 0.020322 64

Comparing Performance Of Bagging With Decision Tree/Extra Tree

Below we are comparing the performance of various bagging regression estimators with a decision tree and extra tree estimators. We can notice that bagging estimators do not over-fit like a decision tree.

In [8]:
## Bagging Regressor with Default Params
bag_regressor = ensemble.BaggingRegressor(random_state=1)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
                                                     bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))

## Bagging Regressor with KNeighborsRegressor as base estimator
bag_regressor = ensemble.BaggingRegressor(base_estimator=KNeighborsRegressor(), random_state=1)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
                                                          bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))

## Above Hyper-peramter tuned Bagging Regressor
bag_regressor = ensemble.BaggingRegressor(random_state=1, **bagging_regressor_grid.best_params_)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
                                                     bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))

## Decision Tree with Default Parameters
dtree_regressor = tree.DecisionTreeRegressor(random_state=1)
dtree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_regressor.__class__.__name__,
                                                     dtree_regressor.score(X_train, Y_train),dtree_regressor.score(X_test, Y_test)))

## Decision Tree with Default Parameters
extra_tree_regressor = tree.ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_regressor.__class__.__name__,
                                                     extra_tree_regressor.score(X_train, Y_train),extra_tree_regressor.score(X_test, Y_test)))
BaggingRegressor : Train Accuracy : 0.98, Test Accuracy : 0.81
BaggingRegressor : Train Accuracy : 0.69, Test Accuracy : 0.58
BaggingRegressor : Train Accuracy : 0.98, Test Accuracy : 0.80
DecisionTreeRegressor : Train Accuracy : 1.00, Test Accuracy : 0.44
ExtraTreeRegressor : Train Accuracy : 1.00, Test Accuracy : 0.51

BaggingClassifier

We'll be explaining the usage of BaggingClassifier by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned bagging estimator with decision tree and extra tree estimator of scikit-learn.

Load DIGITS Dataset

In [9]:
digits = datasets.load_digits()
X_digits, Y_digits = digits.data, digits.target
print('Dataset Size : ', X_digits.shape, Y_digits.shape)
Dataset Size :  (1797, 64) (1797,)

Splitting Dataset into Train & Test sets

Below we are splitting the Boston dataset into train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.

NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

In [10]:
X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, stratify=Y_digits, random_state=123)
print('Train/Test Set Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Set Sizes :  (1437, 64) (360, 64) (1437,) (360,)

Fitting Model To Train Data

In [11]:
from sklearn.ensemble import BaggingClassifier

bag_classifier = BaggingClassifier(random_state=1)
bag_classifier.fit(X_train, Y_train)
Out[11]:
BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=10,
                  n_jobs=None, oob_score=False, random_state=1, verbose=0,
                  warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [12]:
Y_preds = bag_classifier.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%bag_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%bag_classifier.score(X_train, Y_train))
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.956
Test Accuracy : 0.956
Training Accuracy : 0.999

Finetuning Model By Doing Grid Search On Various Hyperparameters.

BaggingClassifier has the same parameters to tune as that of BaggingRegressor.

In [13]:
%%time

n_samples = digits.data.shape[0]
n_features = digits.data.shape[1]

params = {'base_estimator': [None, LogisticRegression(), KNeighborsClassifier()],
          'n_estimators': [20,50,100],
          'max_samples': [0.5, 1.0, n_samples//2, ],
          'max_features': [0.5, 1.0, n_features//2, ],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), param_grid =params, cv=3, n_jobs=-1, verbose=1)
bagging_classifier_grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%bagging_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%bagging_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%bagging_classifier_grid.best_score_)
print('Best Parameters : ',bagging_classifier_grid.best_params_)
Fitting 3 folds for each of 324 candidates, totalling 972 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  7.7min finished
Train Accuracy : 0.995
Test Accurqacy : 0.989
Best Accuracy Through Grid Search : 0.984
Best Parameters :  {'base_estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'), 'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0, 'n_estimators': 50}
CPU times: user 1.96 s, sys: 448 ms, total: 2.41 s
Wall time: 7min 46s

Printing First Few Cross Validation Results

In [14]:
cross_val_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 324
Out[14]:
mean_fit_time std_fit_time mean_score_time std_score_time param_base_estimator param_bootstrap param_bootstrap_features param_max_features param_max_samples param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.207989 0.000359 0.102366 0.000742 None True True 0.5 0.5 20 {'base_estimator': None, 'bootstrap': True, 'b... 0.948347 0.916318 0.947368 0.937370 0.014868 301
1 0.277387 0.048700 0.103260 0.001007 None True True 0.5 0.5 50 {'base_estimator': None, 'bootstrap': True, 'b... 0.960744 0.930962 0.966316 0.952679 0.015500 187
2 0.374678 0.050667 0.105099 0.004744 None True True 0.5 0.5 100 {'base_estimator': None, 'bootstrap': True, 'b... 0.969008 0.937238 0.962105 0.956159 0.013652 157
3 0.136924 0.048241 0.101918 0.000354 None True True 0.5 1 20 {'base_estimator': None, 'bootstrap': True, 'b... 0.956612 0.943515 0.955789 0.951983 0.005988 193
4 0.269504 0.047858 0.108643 0.005364 None True True 0.5 1 50 {'base_estimator': None, 'bootstrap': True, 'b... 0.962810 0.958159 0.964211 0.961726 0.002582 129

Comparing Performance Of Bagging With Decision Tree/Extra Tree

In [15]:
bag_classifier = ensemble.BaggingClassifier(random_state=1)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
                                                     bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))

bag_classifier = ensemble.BaggingClassifier(base_estimator=KNeighborsClassifier(), random_state=1)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
                                                     bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))

bag_classifier = ensemble.BaggingClassifier(random_state=1, **bagging_classifier_grid.best_params_)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
                                                     bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))

dtree_classifier = tree.DecisionTreeClassifier(random_state=1)
dtree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_classifier.__class__.__name__,
                                                     dtree_classifier.score(X_train, Y_train),dtree_classifier.score(X_test, Y_test)))

extra_tree_classifier = tree.ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_classifier.__class__.__name__,
                                                     extra_tree_classifier.score(X_train, Y_train),extra_tree_classifier.score(X_test, Y_test)))
BaggingClassifier : Train Accuracy : 1.00, Test Accuracy : 0.96
BaggingClassifier : Train Accuracy : 0.99, Test Accuracy : 0.99
BaggingClassifier : Train Accuracy : 1.00, Test Accuracy : 0.99
DecisionTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.88
ExtraTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.82

Random Forests

Random Forests are slight improvements over bagging. Combining predictions from various decision trees works well when these decision trees predictions are as less correlated as possible. In a sense, each sub-tree is predicting some class of problem very well then all other sub-trees. The problem with bagging is that it’s a greedy algorithm like a single decision tree hence it tries to minimize error without looking for the optimal split. Due to this greedy approach, it fails to split data in a way that results in generating sub-trees which predicts uncorrelated results. When splitting a node during the construction of a tree, the split that is chosen is not best among all features. Instead split which is picked will be best on a random subset of features. It does not choose split which is best among all features.

Random Forests changes algorithm in a way that when doing split it looks for all possible split and chooses optimal split which generates sub-trees that have less correlation. Random forests also average results of various sub-trees when doing prediction but it’s during training when doing an optimal split of data, it differs from Bagging.

Extremely Randomized Trees

Scikit-Learn also provides another version of Random Forests which is further randomized in selecting split. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.

RandomForestRegressor

We'll be explaining the usage of RandomForestRegressor by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning.

Train/Test Split Boston Dataset

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sets Sizes :  (404, 13) (102, 13) (404,) (102,)

Fitting Model To Train Data

In [17]:
from sklearn.ensemble import RandomForestRegressor

rforest_regressor = RandomForestRegressor(random_state=1)
rforest_regressor.fit(X_train, Y_train)
Out[17]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=1, verbose=0,
                      warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target varible on Test Set passed to it.

In [18]:
Y_preds = rforest_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%rforest_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%rforest_regressor.score(X_test, Y_test))
[21.16 25.88 48.8  21.4  29.51 40.95 23.39  7.97 20.29 30.91]
[15.  26.6 45.4 20.8 34.9 21.9 28.7  7.2 20.  32.2]
Training Coefficient of R^2 : 0.981
Test Coefficient of R^2 : 0.815

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below is a list of common hyperparameters that need tuning for getting the best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  • n_estimators - Number of base estimators whose results will be combined to produce final prediction. default=10
  • max_depth - It defines how finely tree can separate samples (list of "if-else" questions to ask deciding target variable). As we increase max_depth, model overfits and less value of max_depth results in underfit. We need to find best value. If no value is provided then by default None is used.
  • min_samples_split - Number of samples required to split internal node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_split * n_samples) features.
  • min_samples_leaf - Minimum number of samples required to be at leaf node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_leaf * n_samples) features.
  • criterion - Cost function which we algorithm tries to minimize. Currently it supports mse(mean squared error) & mae(mean absolute error).
  • max_features - Number of features to consider when doing split. It accepts int(0-n_features), float(0.0-0.5], string(sqrt, log2, auto) or None as value.
    • None - n_features are used as value if None is provided.
    • sqrt - sqrt(n_features) features are used for split.
    • auto - sqrt(n_features) features are used for split.
    • log2 - log2(n_features) features are used for split.
  • bootstrap - Decides whether samples are drawn with replacement. True = With Replacement. False = Without Replacement.default=True #* max_leaf_nodes -

We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by doing 3-fold cross-validation on data.

In [19]:
%%time

n_samples = X_boston.shape[0]
n_features = X_boston.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2, 5],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'criterion': ['mse', 'mae'],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2,  ],
          'bootstrap':[True, False]
         }

rf_regressor_grid = GridSearchCV(RandomForestRegressor(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
rf_regressor_grid.fit(X_train,Y_train)

print('Train R^2 Score : %.3f'%rf_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%rf_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%rf_regressor_grid.best_score_)
print('Best Parameters : ',rf_regressor_grid.best_params_)
Fitting 3 folds for each of 2268 candidates, totalling 6804 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 212 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done 1112 tasks      | elapsed:   12.7s
[Parallel(n_jobs=-1)]: Done 2612 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 4712 tasks      | elapsed:   57.1s
[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed:  1.5min finished
Train R^2 Score : 1.000
Test R^2 Score : 0.827
Best R^2 Score Through Grid Search : 0.882
Best Parameters :  {'bootstrap': False, 'criterion': 'mae', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
CPU times: user 3.83 s, sys: 74.9 ms, total: 3.91 s
Wall time: 1min 29s

Printing First Few Cross Validation Results

In [20]:
cross_val_results = pd.DataFrame(rf_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 2268
Out[20]:
mean_fit_time std_fit_time mean_score_time std_score_time param_bootstrap param_criterion param_max_depth param_max_features param_min_samples_leaf param_min_samples_split param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.053245 0.001387 0.001341 0.000004 True mse None None 1 2 20 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.859288 0.879591 0.832987 0.857349 0.019064 73
1 0.094573 0.010596 0.002523 0.000043 True mse None None 1 2 50 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.879421 0.886612 0.830498 0.865597 0.024901 49
2 0.170163 0.010898 0.005014 0.000892 True mse None None 1 2 100 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.879800 0.889290 0.830718 0.866692 0.025638 41
3 0.026885 0.012239 0.002204 0.001538 True mse None None 1 0.5 20 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.774189 0.667366 0.684763 0.708832 0.046841 209
4 0.053009 0.017270 0.001979 0.000048 True mse None None 1 0.5 50 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.768447 0.662649 0.659679 0.697017 0.050617 236

ExtraTreesRegressor

We'll be explaining the usage of ExtraTreesRegressor by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned extra trees regression estimator with random forest, decision tree, and extra tree estimator of scikit-learn.

Fitting Model To Train Data

In [21]:
from sklearn.ensemble import ExtraTreesRegressor

extra_forest_regressor = ExtraTreesRegressor(random_state=1)
extra_forest_regressor.fit(X_train, Y_train)
Out[21]:
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                    oob_score=False, random_state=1, verbose=0,
                    warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [22]:
Y_preds = extra_forest_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%extra_forest_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_forest_regressor.score(X_test, Y_test))
[27.62 26.92 45.88 19.21 30.91 41.58 25.28  7.82 19.16 30.99]
[15.  26.6 45.4 20.8 34.9 21.9 28.7  7.2 20.  32.2]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.829

Finetuning Model By Doing Grid Search On Various Hyperparameters.

ExtraTreesRegressor has the same parameters to tune as that of RandomForestRegressor.

In [23]:
%%time

n_samples = X_boston.shape[0]
n_features = X_boston.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2,5,],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'criterion': ['mse', 'mae'],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3, 0.5, n_features//2],
          'bootstrap':[True, False]
         }

ef_regressor_grid = GridSearchCV(ExtraTreesRegressor(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
ef_regressor_grid.fit(X_train,Y_train)

print('Train R^2 Score : %.3f'%ef_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%ef_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%ef_regressor_grid.best_score_)
print('Best Parameters : ',ef_regressor_grid.best_params_)
Fitting 3 folds for each of 2268 candidates, totalling 6804 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 212 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 1112 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 2612 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 4712 tasks      | elapsed:   46.9s
[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed:  1.2min finished
Train R^2 Score : 1.000
Test R^2 Score : 0.870
Best R^2 Score Through Grid Search : 0.886
Best Parameters :  {'bootstrap': False, 'criterion': 'mae', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
CPU times: user 3.11 s, sys: 58.2 ms, total: 3.16 s
Wall time: 1min 9s

Printing First Few Cross Validation Results

In [24]:
cross_val_results = pd.DataFrame(ef_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 2268
Out[24]:
mean_fit_time std_fit_time mean_score_time std_score_time param_bootstrap param_criterion param_max_depth param_max_features param_min_samples_leaf param_min_samples_split param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.053850 0.002915 0.001736 0.000338 True mse None None 1 2 20 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.883367 0.863108 0.811594 0.852792 0.030181 64
1 0.058733 0.011814 0.003005 0.000750 True mse None None 1 2 50 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.882857 0.878594 0.831779 0.864491 0.023111 44
2 0.091989 0.000384 0.004380 0.000034 True mse None None 1 2 100 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.880460 0.882770 0.832957 0.865476 0.022928 37
3 0.016946 0.002753 0.001093 0.000013 True mse None None 1 0.5 20 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.612855 0.573655 0.526824 0.571221 0.035142 255
4 0.032264 0.000045 0.001915 0.000008 True mse None None 1 0.5 50 {'bootstrap': True, 'criterion': 'mse', 'max_d... 0.608944 0.607546 0.560247 0.592325 0.022606 233

Comparing Performance Of Random Forest With Decision Tree/Extra Tree

In [25]:
rforest_regressor = ensemble.RandomForestRegressor(random_state=1)
rforest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_regressor.__class__.__name__,
                                                     rforest_regressor.score(X_train, Y_train),rforest_regressor.score(X_test, Y_test)))

rforest_regressor = ensemble.RandomForestRegressor(random_state=1, **rf_regressor_grid.best_params_)
rforest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_regressor.__class__.__name__,
                                                     rforest_regressor.score(X_train, Y_train),rforest_regressor.score(X_test, Y_test)))


extra_forest_regressor = ensemble.ExtraTreesRegressor(random_state=1)
extra_forest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
                                                     extra_forest_regressor.score(X_train, Y_train),extra_forest_regressor.score(X_test, Y_test)))

extra_forest_regressor = ensemble.ExtraTreesRegressor(random_state=1, **ef_regressor_grid.best_params_)
extra_forest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
                                                     extra_forest_regressor.score(X_train, Y_train),extra_forest_regressor.score(X_test, Y_test)))

dtree_regressor = tree.DecisionTreeRegressor(random_state=1)
dtree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_regressor.__class__.__name__,
                                                     dtree_regressor.score(X_train, Y_train),dtree_regressor.score(X_test, Y_test)))

extra_tree_regressor = tree.ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
                                                     extra_tree_regressor.score(X_train, Y_train),extra_tree_regressor.score(X_test, Y_test)))
RandomForestRegressor : Train Accuracy : 0.98, Test Accuracy : 0.81
RandomForestRegressor : Train Accuracy : 1.00, Test Accuracy : 0.83
ExtraTreesRegressor : Train Accuracy : 1.00, Test Accuracy : 0.83
ExtraTreesRegressor : Train Accuracy : 1.00, Test Accuracy : 0.87
DecisionTreeRegressor : Train Accuracy : 1.00, Test Accuracy : 0.44
ExtraTreesRegressor : Train Accuracy : 1.00, Test Accuracy : 0.51

RandomForestClassifier

We'll be explaining the usage of RandomForestClassifier by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning.

Train/Test Split

In [26]:
X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sets Sizes :  (1437, 64) (360, 64) (1437,) (360,)

Fitting Model To Train Data

In [27]:
from sklearn.ensemble import RandomForestClassifier

rforest_classifier = RandomForestClassifier(random_state=1)
rforest_classifier.fit(X_train, Y_train)
Out[27]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [28]:
Y_preds = rforest_classifier.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%rforest_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%rforest_classifier.score(X_train, Y_train))
[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
Test Accuracy : 0.944
Test Accuracy : 0.944
Training Accuracy : 0.999

Finetuning Model By Doing Grid Search On Various Hyperparameters.

RandomForestClassifier has the same parameters to tune as that of RandomForestRegressor.

In [31]:
%%time

n_samples = X_digits.shape[0]
n_features = X_digits.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2, 5,],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
          'bootstrap':[True, False]
         }

rf_classifier_grid = GridSearchCV(RandomForestClassifier(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
rf_classifier_grid.fit(X_train,Y_train)

print('Train Accuracy : %.3f'%rf_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%rf_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%rf_classifier_grid.best_score_)
print('Best Parameters : ',rf_classifier_grid.best_params_)
Fitting 3 folds for each of 1134 candidates, totalling 3402 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   18.8s
[Parallel(n_jobs=-1)]: Done 625 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done 1625 tasks      | elapsed:   44.4s
[Parallel(n_jobs=-1)]: Done 3025 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 3402 out of 3402 | elapsed:  1.2min finished
Train Accuracy : 1.000
Test Accurqacy : 0.975
Best Accuracy Through Grid Search : 0.978
Best Parameters :  {'bootstrap': False, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
CPU times: user 2.82 s, sys: 261 ms, total: 3.09 s
Wall time: 1min 12s

Printing First Few Cross Validation Results

In [32]:
cross_val_results = pd.DataFrame(rf_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 1134
Out[32]:
mean_fit_time std_fit_time mean_score_time std_score_time param_bootstrap param_max_depth param_max_features param_min_samples_leaf param_min_samples_split param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.219051 0.001484 0.002320 0.000180 True None None 1 2 20 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.939959 0.937238 0.949580 0.942241 0.005284 39
1 0.468140 0.020043 0.005235 0.000978 True None None 1 2 50 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.946170 0.941423 0.951681 0.946416 0.004183 38
2 0.686423 0.064943 0.009341 0.000884 True None None 1 2 100 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.950311 0.945607 0.960084 0.951983 0.006017 37
3 0.034605 0.000435 0.001700 0.000004 True None None 1 0.5 20 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.616977 0.608787 0.441176 0.556019 0.080894 263
4 0.088249 0.003297 0.003656 0.000107 True None None 1 0.5 50 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.633540 0.518828 0.457983 0.537230 0.072874 271

ExtraTreesClassifier

We'll be explaining the usage of ExtraTreesClassifier by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned extra trees regression estimator with random forest, decision tree, and extra tree estimator of scikit-learn.

Fitting Model To Train Data

In [33]:
from sklearn.ensemble import ExtraTreesClassifier

extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1)
extra_forest_classifier.fit(X_train, Y_train)
Out[33]:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                     oob_score=False, random_state=1, verbose=0,
                     warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [34]:
Y_preds = extra_forest_classifier.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%extra_forest_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_forest_classifier.score(X_train, Y_train))
[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
Test Accuracy : 0.950
Test Accuracy : 0.950
Training Accuracy : 1.000

Finetuning Model By Doing Grid Search On Various Hyperparameters.

ExtraTreesClassifier has the same parameters to tune as that of RandomForestRegressor/ExtraTreesRegressor.

In [35]:
%%time

n_samples = X_digits.shape[0]
n_features = X_digits.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2, 5,],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
          'bootstrap':[True, False]
         }

ef_classifier_grid = GridSearchCV(ExtraTreesClassifier(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
ef_classifier_grid.fit(X_train,Y_train)

print('Train Accuracy : %.3f'%ef_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%ef_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%ef_classifier_grid.best_score_)
print('Best Parameters : ',ef_classifier_grid.best_params_)
Fitting 3 folds for each of 1134 candidates, totalling 3402 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 144 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 744 tasks      | elapsed:   11.5s
[Parallel(n_jobs=-1)]: Done 1744 tasks      | elapsed:   26.8s
[Parallel(n_jobs=-1)]: Done 3144 tasks      | elapsed:   44.7s
[Parallel(n_jobs=-1)]: Done 3402 out of 3402 | elapsed:   48.2s finished
Train Accuracy : 1.000
Test Accurqacy : 0.983
Best Accuracy Through Grid Search : 0.982
Best Parameters :  {'bootstrap': False, 'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
CPU times: user 2.94 s, sys: 125 ms, total: 3.07 s
Wall time: 48.6 s

Printing First Few Cross Validation Results

In [37]:
cross_val_results = pd.DataFrame(ef_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 1134
Out[37]:
mean_fit_time std_fit_time mean_score_time std_score_time param_bootstrap param_max_depth param_max_features param_min_samples_leaf param_min_samples_split param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.053711 0.001229 0.002628 0.000549 True None None 1 2 20 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.950311 0.960251 0.966387 0.958942 0.006631 39
1 0.130861 0.000975 0.009696 0.004870 True None None 1 2 50 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.958592 0.962343 0.978992 0.966597 0.008857 30
2 0.291812 0.011959 0.009114 0.000343 True None None 1 2 100 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.966874 0.962343 0.978992 0.969381 0.007013 28
3 0.022173 0.000474 0.001731 0.000020 True None None 1 0.5 20 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.631470 0.558577 0.550420 0.580376 0.036507 307
4 0.060601 0.009223 0.003610 0.000019 True None None 1 0.5 50 {'bootstrap': True, 'max_depth': None, 'max_fe... 0.643892 0.587866 0.632353 0.621434 0.024163 301

Comparing Performance Of Random Forest With Decision Tree/Extra Tree

In [38]:
rforest_classifier = ensemble.RandomForestClassifier(random_state=1)
rforest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_classifier.__class__.__name__,
                                                     rforest_classifier.score(X_train, Y_train),rforest_classifier.score(X_test, Y_test)))

rforest_classifier = ensemble.RandomForestClassifier(random_state=1, **rf_classifier_grid.best_params_)
rforest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_classifier.__class__.__name__,
                                                     rforest_classifier.score(X_train, Y_train),rforest_classifier.score(X_test, Y_test)))

extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1)
extra_forest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_classifier.__class__.__name__,
                                                     extra_forest_classifier.score(X_train, Y_train),extra_forest_classifier.score(X_test, Y_test)))

extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1, **ef_classifier_grid.best_params_)
extra_forest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_classifier.__class__.__name__,
                                                     extra_forest_classifier.score(X_train, Y_train),extra_forest_classifier.score(X_test, Y_test)))

dtree_classifier = tree.DecisionTreeClassifier(random_state=1)
dtree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_classifier.__class__.__name__,
                                                     dtree_classifier.score(X_train, Y_train),dtree_classifier.score(X_test, Y_test)))

extra_tree_classifier = tree.ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_classifier.__class__.__name__,
                                                     extra_tree_classifier.score(X_train, Y_train),extra_tree_classifier.score(X_test, Y_test)))
RandomForestClassifier : Train Accuracy : 1.00, Test Accuracy : 0.94
RandomForestClassifier : Train Accuracy : 1.00, Test Accuracy : 0.97
ExtraTreesClassifier : Train Accuracy : 1.00, Test Accuracy : 0.95
ExtraTreesClassifier : Train Accuracy : 1.00, Test Accuracy : 0.98
DecisionTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.83
ExtraTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.83

Sunny Solanki  Sunny Solanki