Updated On : May-30,2020 Time Investment : ~30 mins

Scikit-Learn - Ensemble Learning : Bootstrap Aggregation(Bagging) & Random Forests¶

Table of Contents¶

Introduction
Bootstrap Aggregation (Bagging)
- BaggingRegressor
- BaggingClassifier
Random Forests
References

Introduction ¶

We already discussed decision trees in our tutorial about it in-depth. We noticed over there that a single decision tree generally over-fits train data very easily hence it's a better idea to combine many decision trees to make a decision. The basic idea is that multiple overfitting estimators can be combined together to reduce the effect of overfitting and produce better predictions which generalize well. This idea is generally referred to as ensemble learning in the machine learning community.

There are 2 ways to combine decision trees to make better decisions:

Averaging (Bootstrap Aggregation - Bagging & Random Forests) - Idea is that we create many individual estimators and average predictions of these estimators to make the final predictions. Averaging estimators reduce variance hence avoids overfitting.
Boosting - Base estimators are trained sequentially where we try to reduce the bias of combined estimator hence avoid underfitting. The main idea is to combine a few weak estimators to create a powerful estimator.

In this tutorial, we'll be discussing bagging and random forests. We'll cover boosting in-depth in separate tutorial.

import numpy as np
import pandas as pd

import sklearn
from sklearn import ensemble, datasets, tree
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV

import sys
import warnings

warnings.filterwarnings("ignore")

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)

Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

Bootstrap Aggregation (Bagging) ¶

Bagging starts with many sub-sample of original data with replacement and then trains various decision trees on these sub-samples. When the prediction is to be made on new data, it votes or averages prediction from each decision tree. The basic idea is to solve the overfitting problem (reducing high variance) by introducing some randomization.

Scikit-Learn provides BagginRegressor and BaggingClassifier.

BaggingRegressor ¶

We'll be explaining the usage of BaggingRegressor by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned bagging estimator with decision tree and extra tree estimator of scikit-learn.

Load BOSTON Housing Dataset¶

from sklearn import datasets

boston = datasets.load_boston()
X_boston, Y_boston = boston.data, boston.target
print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))

Dataset features names : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Dataset features size : (506, 13)
Dataset target size : (506,)

Splitting Dataset into Train & Test sets¶

We'll split the dataset into two parts:

Training data which will be used for the training model.
Test data against which accuracy of the trained model will be checked.

train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston , train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sets Sizes :  (404, 13) (102, 13) (404,) (102,)

Fitting Model To Train Data¶

We can fit() method on estimator passing it train features and train target. It'll then train a model using that data.

from sklearn.ensemble import BaggingRegressor

bag_regressor = BaggingRegressor(random_state=1)
bag_regressor.fit(X_train, Y_train)

BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=1.0, n_estimators=10,
                 n_jobs=None, oob_score=False, random_state=1, verbose=0,
                 warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = bag_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%bag_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%bag_regressor.score(X_test, Y_test))

[21.49 25.87 48.19 21.28 29.36 41.65 24.    8.18 19.12 31.  ]
[15.  26.6 45.4 20.8 34.9 21.9 28.7  7.2 20.  32.2]
Training Coefficient of R^2 : 0.980
Test Coefficient of R^2 : 0.812

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below are list of common hyperparameters which needs tuning for getting best fit for our data. We'll try various hyperparemters settings to various splits of train/test data to find out best fit which will have almost same accuracy for both train & test dataset or have quite less different between accuracy.

base_estimator(object or None) - Base Estimator whose many instances will be created. If None is provided then DecisionTree wil be used as base estimator.It accepts object or None. default=None
n_estimators(int) - Number of base estimators whose results will be combined to produce final prediction. default=10
bootstrap(bool) - Decides whether samples are drawn with replacement. True = With Replacement. False = Without Replacement.default=True
bootstrap_features(bool) - Decides whether features are drawn with replacement. True = With Replacement. False = Without Replacement.default=False
max_samples(int/float) - It accepts int(1-n_samples) or float(0.0-1.0] values. It represents number of samples to draw from train data to train particular estimator.
max_features(int/float) - It accepts int(1-n_features) or float(0.0-1.0] values. It represents number of features to draw from train data to train particular estimator.

GridSearchCV¶

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.

NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by doing 3-fold cross-validation on data.

%%time

n_samples = boston.data.shape[0]
n_features = boston.data.shape[1]

params = {'base_estimator': [None, LinearRegression(), KNeighborsRegressor()],
          'n_estimators': [20,50,100],
          'max_samples': [0.5,1.0, n_samples//2,],
          'max_features': [0.5,1.0, n_features//2,],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

bagging_regressor_grid = GridSearchCV(BaggingRegressor(random_state=1, n_jobs=-1), param_grid =params, cv=3, n_jobs=-1, verbose=1)
bagging_regressor_grid.fit(X_train, Y_train)

print('Train R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%bagging_regressor_grid.best_score_)
print('Best Parameters : ',bagging_regressor_grid.best_params_)

Fitting 3 folds for each of 324 candidates, totalling 972 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   36.2s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.0min

Train R^2 Score : 0.983
Test R^2 Score : 0.802
Best R^2 Score Through Grid Search : 0.870
Best Parameters :  {'base_estimator': None, 'bootstrap': True, 'bootstrap_features': False, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 50}
CPU times: user 2.21 s, sys: 199 ms, total: 2.41 s
Wall time: 1min 11s

[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  1.2min finished

Printing First Few Cross-Validation Results¶

GridSearchCV maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling cv_results_ attribute on it. We are converting it to pandas dataframe for better visuals.

cross_val_results = pd.DataFrame(bagging_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 324

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_base_estimator	param_bootstrap	param_bootstrap_features	param_max_features	param_max_samples	param_n_estimators	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.102222	0.000492	0.102727	0.000575	None	True	True	0.5	0.5	20	{'base_estimator': None, 'bootstrap': True, 'b...	0.807748	0.818529	0.761464	0.795999	0.024725	92
1	0.176272	0.052850	0.102964	0.000274	None	True	True	0.5	0.5	50	{'base_estimator': None, 'bootstrap': True, 'b...	0.834376	0.818371	0.769611	0.807546	0.027514	74
2	0.272792	0.049073	0.103111	0.000387	None	True	True	0.5	0.5	100	{'base_estimator': None, 'bootstrap': True, 'b...	0.834629	0.826096	0.768551	0.809860	0.029310	69
3	0.104769	0.002391	0.103690	0.001265	None	True	True	0.5	1	20	{'base_estimator': None, 'bootstrap': True, 'b...	0.799064	0.809131	0.777896	0.795407	0.013005	94
4	0.206177	0.000921	0.103582	0.000535	None	True	True	0.5	1	50	{'base_estimator': None, 'bootstrap': True, 'b...	0.825602	0.829979	0.784795	0.813530	0.020322	64

Comparing Performance Of Bagging With Decision Tree/Extra Tree¶

Below we are comparing the performance of various bagging regression estimators with a decision tree and extra tree estimators. We can notice that bagging estimators do not over-fit like a decision tree.

## Bagging Regressor with Default Params
bag_regressor = ensemble.BaggingRegressor(random_state=1)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
                                                     bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))

## Bagging Regressor with KNeighborsRegressor as base estimator
bag_regressor = ensemble.BaggingRegressor(base_estimator=KNeighborsRegressor(), random_state=1)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
                                                          bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))

## Above Hyper-peramter tuned Bagging Regressor
bag_regressor = ensemble.BaggingRegressor(random_state=1, **bagging_regressor_grid.best_params_)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
                                                     bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))

## Decision Tree with Default Parameters
dtree_regressor = tree.DecisionTreeRegressor(random_state=1)
dtree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_regressor.__class__.__name__,
                                                     dtree_regressor.score(X_train, Y_train),dtree_regressor.score(X_test, Y_test)))

## Decision Tree with Default Parameters
extra_tree_regressor = tree.ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_regressor.__class__.__name__,
                                                     extra_tree_regressor.score(X_train, Y_train),extra_tree_regressor.score(X_test, Y_test)))

BaggingRegressor : Train Accuracy : 0.98, Test Accuracy : 0.81
BaggingRegressor : Train Accuracy : 0.69, Test Accuracy : 0.58
BaggingRegressor : Train Accuracy : 0.98, Test Accuracy : 0.80
DecisionTreeRegressor : Train Accuracy : 1.00, Test Accuracy : 0.44
ExtraTreeRegressor : Train Accuracy : 1.00, Test Accuracy : 0.51

BaggingClassifier ¶

We'll be explaining the usage of BaggingClassifier by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned bagging estimator with decision tree and extra tree estimator of scikit-learn.

Load DIGITS Dataset¶

digits = datasets.load_digits()
X_digits, Y_digits = digits.data, digits.target
print('Dataset Size : ', X_digits.shape, Y_digits.shape)

Dataset Size :  (1797, 64) (1797,)

Splitting Dataset into Train & Test sets¶

Below we are splitting the Boston dataset into train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.

NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, stratify=Y_digits, random_state=123)
print('Train/Test Set Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Set Sizes :  (1437, 64) (360, 64) (1437,) (360,)

Fitting Model To Train Data¶

from sklearn.ensemble import BaggingClassifier

bag_classifier = BaggingClassifier(random_state=1)
bag_classifier.fit(X_train, Y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=10,
                  n_jobs=None, oob_score=False, random_state=1, verbose=0,
                  warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = bag_classifier.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%bag_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%bag_classifier.score(X_train, Y_train))

[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.956
Test Accuracy : 0.956
Training Accuracy : 0.999

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

BaggingClassifier has the same parameters to tune as that of BaggingRegressor.

%%time

n_samples = digits.data.shape[0]
n_features = digits.data.shape[1]

params = {'base_estimator': [None, LogisticRegression(), KNeighborsClassifier()],
          'n_estimators': [20,50,100],
          'max_samples': [0.5, 1.0, n_samples//2, ],
          'max_features': [0.5, 1.0, n_features//2, ],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), param_grid =params, cv=3, n_jobs=-1, verbose=1)
bagging_classifier_grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%bagging_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%bagging_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%bagging_classifier_grid.best_score_)
print('Best Parameters : ',bagging_classifier_grid.best_params_)

Fitting 3 folds for each of 324 candidates, totalling 972 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  7.7min finished

Train Accuracy : 0.995
Test Accurqacy : 0.989
Best Accuracy Through Grid Search : 0.984
Best Parameters :  {'base_estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'), 'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0, 'n_estimators': 50}
CPU times: user 1.96 s, sys: 448 ms, total: 2.41 s
Wall time: 7min 46s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 324

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_base_estimator	param_bootstrap	param_bootstrap_features	param_max_features	param_max_samples	param_n_estimators	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.207989	0.000359	0.102366	0.000742	None	True	True	0.5	0.5	20	{'base_estimator': None, 'bootstrap': True, 'b...	0.948347	0.916318	0.947368	0.937370	0.014868	301
1	0.277387	0.048700	0.103260	0.001007	None	True	True	0.5	0.5	50	{'base_estimator': None, 'bootstrap': True, 'b...	0.960744	0.930962	0.966316	0.952679	0.015500	187
2	0.374678	0.050667	0.105099	0.004744	None	True	True	0.5	0.5	100	{'base_estimator': None, 'bootstrap': True, 'b...	0.969008	0.937238	0.962105	0.956159	0.013652	157
3	0.136924	0.048241	0.101918	0.000354	None	True	True	0.5	1	20	{'base_estimator': None, 'bootstrap': True, 'b...	0.956612	0.943515	0.955789	0.951983	0.005988	193
4	0.269504	0.047858	0.108643	0.005364	None	True	True	0.5	1	50	{'base_estimator': None, 'bootstrap': True, 'b...	0.962810	0.958159	0.964211	0.961726	0.002582	129

Comparing Performance Of Bagging With Decision Tree/Extra Tree¶

bag_classifier = ensemble.BaggingClassifier(random_state=1)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
                                                     bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))

bag_classifier = ensemble.BaggingClassifier(base_estimator=KNeighborsClassifier(), random_state=1)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
                                                     bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))

bag_classifier = ensemble.BaggingClassifier(random_state=1, **bagging_classifier_grid.best_params_)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
                                                     bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))

dtree_classifier = tree.DecisionTreeClassifier(random_state=1)
dtree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_classifier.__class__.__name__,
                                                     dtree_classifier.score(X_train, Y_train),dtree_classifier.score(X_test, Y_test)))

extra_tree_classifier = tree.ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_classifier.__class__.__name__,
                                                     extra_tree_classifier.score(X_train, Y_train),extra_tree_classifier.score(X_test, Y_test)))

BaggingClassifier : Train Accuracy : 1.00, Test Accuracy : 0.96
BaggingClassifier : Train Accuracy : 0.99, Test Accuracy : 0.99
BaggingClassifier : Train Accuracy : 1.00, Test Accuracy : 0.99
DecisionTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.88
ExtraTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.82

Random Forests ¶

Random Forests are slight improvements over bagging. Combining predictions from various decision trees works well when these decision trees predictions are as less correlated as possible. In a sense, each sub-tree is predicting some class of problem very well then all other sub-trees. The problem with bagging is that it’s a greedy algorithm like a single decision tree hence it tries to minimize error without looking for the optimal split. Due to this greedy approach, it fails to split data in a way that results in generating sub-trees which predicts uncorrelated results. When splitting a node during the construction of a tree, the split that is chosen is not best among all features. Instead split which is picked will be best on a random subset of features. It does not choose split which is best among all features.

Random Forests changes algorithm in a way that when doing split it looks for all possible split and chooses optimal split which generates sub-trees that have less correlation. Random forests also average results of various sub-trees when doing prediction but it’s during training when doing an optimal split of data, it differs from Bagging.

Extremely Randomized Trees ¶

Scikit-Learn also provides another version of Random Forests which is further randomized in selecting split. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.

RandomForestRegressor ¶

We'll be explaining the usage of RandomForestRegressor by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning.

Train/Test Split Boston Dataset¶

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sets Sizes :  (404, 13) (102, 13) (404,) (102,)

Fitting Model To Train Data¶

from sklearn.ensemble import RandomForestRegressor

rforest_regressor = RandomForestRegressor(random_state=1)
rforest_regressor.fit(X_train, Y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=1, verbose=0,
                      warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target varible on Test Set passed to it.

Y_preds = rforest_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%rforest_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%rforest_regressor.score(X_test, Y_test))

[21.16 25.88 48.8  21.4  29.51 40.95 23.39  7.97 20.29 30.91]
[15.  26.6 45.4 20.8 34.9 21.9 28.7  7.2 20.  32.2]
Training Coefficient of R^2 : 0.981
Test Coefficient of R^2 : 0.815

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below is a list of common hyperparameters that need tuning for getting the best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

n_estimators - Number of base estimators whose results will be combined to produce final prediction. default=10
max_depth - It defines how finely tree can separate samples (list of "if-else" questions to ask deciding target variable). As we increase max_depth, model overfits and less value of max_depth results in underfit. We need to find best value. If no value is provided then by default None is used.
min_samples_split - Number of samples required to split internal node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_split * n_samples) features.
min_samples_leaf - Minimum number of samples required to be at leaf node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_leaf * n_samples) features.
criterion - Cost function which we algorithm tries to minimize. Currently it supports mse(mean squared error) & mae(mean absolute error).
max_features - Number of features to consider when doing split. It accepts int(0-n_features), float(0.0-0.5], string(sqrt, log2, auto) or None as value.
- None - n_features are used as value if None is provided.
- sqrt - sqrt(n_features) features are used for split.
- auto - sqrt(n_features) features are used for split.
- log2 - log2(n_features) features are used for split.
bootstrap - Decides whether samples are drawn with replacement. True = With Replacement. False = Without Replacement.default=True #* max_leaf_nodes -

We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by doing 3-fold cross-validation on data.

%%time

n_samples = X_boston.shape[0]
n_features = X_boston.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2, 5],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'criterion': ['mse', 'mae'],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2,  ],
          'bootstrap':[True, False]
         }

rf_regressor_grid = GridSearchCV(RandomForestRegressor(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
rf_regressor_grid.fit(X_train,Y_train)

print('Train R^2 Score : %.3f'%rf_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%rf_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%rf_regressor_grid.best_score_)
print('Best Parameters : ',rf_regressor_grid.best_params_)

Fitting 3 folds for each of 2268 candidates, totalling 6804 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 212 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done 1112 tasks      | elapsed:   12.7s
[Parallel(n_jobs=-1)]: Done 2612 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 4712 tasks      | elapsed:   57.1s
[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed:  1.5min finished

Train R^2 Score : 1.000
Test R^2 Score : 0.827
Best R^2 Score Through Grid Search : 0.882
Best Parameters :  {'bootstrap': False, 'criterion': 'mae', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
CPU times: user 3.83 s, sys: 74.9 ms, total: 3.91 s
Wall time: 1min 29s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(rf_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 2268

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_bootstrap	param_criterion	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	param_n_estimators	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.053245	0.001387	0.001341	0.000004	True	mse	None	None	1	2	20	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.859288	0.879591	0.832987	0.857349	0.019064	73
1	0.094573	0.010596	0.002523	0.000043	True	mse	None	None	1	2	50	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.879421	0.886612	0.830498	0.865597	0.024901	49
2	0.170163	0.010898	0.005014	0.000892	True	mse	None	None	1	2	100	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.879800	0.889290	0.830718	0.866692	0.025638	41
3	0.026885	0.012239	0.002204	0.001538	True	mse	None	None	1	0.5	20	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.774189	0.667366	0.684763	0.708832	0.046841	209
4	0.053009	0.017270	0.001979	0.000048	True	mse	None	None	1	0.5	50	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.768447	0.662649	0.659679	0.697017	0.050617	236

ExtraTreesRegressor ¶

We'll be explaining the usage of ExtraTreesRegressor by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned extra trees regression estimator with random forest, decision tree, and extra tree estimator of scikit-learn.

Fitting Model To Train Data¶

from sklearn.ensemble import ExtraTreesRegressor

extra_forest_regressor = ExtraTreesRegressor(random_state=1)
extra_forest_regressor.fit(X_train, Y_train)

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                    oob_score=False, random_state=1, verbose=0,
                    warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = extra_forest_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%extra_forest_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_forest_regressor.score(X_test, Y_test))

[27.62 26.92 45.88 19.21 30.91 41.58 25.28  7.82 19.16 30.99]
[15.  26.6 45.4 20.8 34.9 21.9 28.7  7.2 20.  32.2]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.829

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

ExtraTreesRegressor has the same parameters to tune as that of RandomForestRegressor.

%%time

n_samples = X_boston.shape[0]
n_features = X_boston.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2,5,],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'criterion': ['mse', 'mae'],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3, 0.5, n_features//2],
          'bootstrap':[True, False]
         }

ef_regressor_grid = GridSearchCV(ExtraTreesRegressor(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
ef_regressor_grid.fit(X_train,Y_train)

print('Train R^2 Score : %.3f'%ef_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%ef_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%ef_regressor_grid.best_score_)
print('Best Parameters : ',ef_regressor_grid.best_params_)

Fitting 3 folds for each of 2268 candidates, totalling 6804 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 212 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 1112 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 2612 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 4712 tasks      | elapsed:   46.9s
[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed:  1.2min finished

Train R^2 Score : 1.000
Test R^2 Score : 0.870
Best R^2 Score Through Grid Search : 0.886
Best Parameters :  {'bootstrap': False, 'criterion': 'mae', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
CPU times: user 3.11 s, sys: 58.2 ms, total: 3.16 s
Wall time: 1min 9s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(ef_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 2268

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_bootstrap	param_criterion	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	param_n_estimators	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.053850	0.002915	0.001736	0.000338	True	mse	None	None	1	2	20	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.883367	0.863108	0.811594	0.852792	0.030181	64
1	0.058733	0.011814	0.003005	0.000750	True	mse	None	None	1	2	50	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.882857	0.878594	0.831779	0.864491	0.023111	44
2	0.091989	0.000384	0.004380	0.000034	True	mse	None	None	1	2	100	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.880460	0.882770	0.832957	0.865476	0.022928	37
3	0.016946	0.002753	0.001093	0.000013	True	mse	None	None	1	0.5	20	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.612855	0.573655	0.526824	0.571221	0.035142	255
4	0.032264	0.000045	0.001915	0.000008	True	mse	None	None	1	0.5	50	{'bootstrap': True, 'criterion': 'mse', 'max_d...	0.608944	0.607546	0.560247	0.592325	0.022606	233

Comparing Performance Of Random Forest With Decision Tree/Extra Tree¶

rforest_regressor = ensemble.RandomForestRegressor(random_state=1)
rforest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_regressor.__class__.__name__,
                                                     rforest_regressor.score(X_train, Y_train),rforest_regressor.score(X_test, Y_test)))

rforest_regressor = ensemble.RandomForestRegressor(random_state=1, **rf_regressor_grid.best_params_)
rforest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_regressor.__class__.__name__,
                                                     rforest_regressor.score(X_train, Y_train),rforest_regressor.score(X_test, Y_test)))


extra_forest_regressor = ensemble.ExtraTreesRegressor(random_state=1)
extra_forest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
                                                     extra_forest_regressor.score(X_train, Y_train),extra_forest_regressor.score(X_test, Y_test)))

extra_forest_regressor = ensemble.ExtraTreesRegressor(random_state=1, **ef_regressor_grid.best_params_)
extra_forest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
                                                     extra_forest_regressor.score(X_train, Y_train),extra_forest_regressor.score(X_test, Y_test)))

dtree_regressor = tree.DecisionTreeRegressor(random_state=1)
dtree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_regressor.__class__.__name__,
                                                     dtree_regressor.score(X_train, Y_train),dtree_regressor.score(X_test, Y_test)))

extra_tree_regressor = tree.ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
                                                     extra_tree_regressor.score(X_train, Y_train),extra_tree_regressor.score(X_test, Y_test)))

RandomForestRegressor : Train Accuracy : 0.98, Test Accuracy : 0.81
RandomForestRegressor : Train Accuracy : 1.00, Test Accuracy : 0.83
ExtraTreesRegressor : Train Accuracy : 1.00, Test Accuracy : 0.83
ExtraTreesRegressor : Train Accuracy : 1.00, Test Accuracy : 0.87
DecisionTreeRegressor : Train Accuracy : 1.00, Test Accuracy : 0.44
ExtraTreesRegressor : Train Accuracy : 1.00, Test Accuracy : 0.51

RandomForestClassifier ¶

We'll be explaining the usage of RandomForestClassifier by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning.

Train/Test Split¶

X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sets Sizes :  (1437, 64) (360, 64) (1437,) (360,)

Fitting Model To Train Data¶

from sklearn.ensemble import RandomForestClassifier

rforest_classifier = RandomForestClassifier(random_state=1)
rforest_classifier.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = rforest_classifier.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%rforest_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%rforest_classifier.score(X_train, Y_train))

[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
Test Accuracy : 0.944
Test Accuracy : 0.944
Training Accuracy : 0.999

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

RandomForestClassifier has the same parameters to tune as that of RandomForestRegressor.

%%time

n_samples = X_digits.shape[0]
n_features = X_digits.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2, 5,],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
          'bootstrap':[True, False]
         }

rf_classifier_grid = GridSearchCV(RandomForestClassifier(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
rf_classifier_grid.fit(X_train,Y_train)

print('Train Accuracy : %.3f'%rf_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%rf_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%rf_classifier_grid.best_score_)
print('Best Parameters : ',rf_classifier_grid.best_params_)

Fitting 3 folds for each of 1134 candidates, totalling 3402 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   18.8s
[Parallel(n_jobs=-1)]: Done 625 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done 1625 tasks      | elapsed:   44.4s
[Parallel(n_jobs=-1)]: Done 3025 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 3402 out of 3402 | elapsed:  1.2min finished

Train Accuracy : 1.000
Test Accurqacy : 0.975
Best Accuracy Through Grid Search : 0.978
Best Parameters :  {'bootstrap': False, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
CPU times: user 2.82 s, sys: 261 ms, total: 3.09 s
Wall time: 1min 12s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(rf_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 1134

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_bootstrap	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	param_n_estimators	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.219051	0.001484	0.002320	0.000180	True	None	None	1	2	20	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.939959	0.937238	0.949580	0.942241	0.005284	39
1	0.468140	0.020043	0.005235	0.000978	True	None	None	1	2	50	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.946170	0.941423	0.951681	0.946416	0.004183	38
2	0.686423	0.064943	0.009341	0.000884	True	None	None	1	2	100	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.950311	0.945607	0.960084	0.951983	0.006017	37
3	0.034605	0.000435	0.001700	0.000004	True	None	None	1	0.5	20	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.616977	0.608787	0.441176	0.556019	0.080894	263
4	0.088249	0.003297	0.003656	0.000107	True	None	None	1	0.5	50	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.633540	0.518828	0.457983	0.537230	0.072874	271

ExtraTreesClassifier ¶

We'll be explaining the usage of ExtraTreesClassifier by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned extra trees regression estimator with random forest, decision tree, and extra tree estimator of scikit-learn.

Fitting Model To Train Data¶

from sklearn.ensemble import ExtraTreesClassifier

extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1)
extra_forest_classifier.fit(X_train, Y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                     oob_score=False, random_state=1, verbose=0,
                     warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = extra_forest_classifier.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%extra_forest_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_forest_classifier.score(X_train, Y_train))

[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
[3 3 4 4 1 3 1 0 7 4 0 0 5 1 6]
Test Accuracy : 0.950
Test Accuracy : 0.950
Training Accuracy : 1.000

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

ExtraTreesClassifier has the same parameters to tune as that of RandomForestRegressor/ExtraTreesRegressor.

%%time

n_samples = X_digits.shape[0]
n_features = X_digits.shape[1]

params = {'n_estimators': [20,50,100],
          'max_depth': [None, 2, 5,],
          'min_samples_split': [2, 0.5, n_samples//2, ],
          'min_samples_leaf': [1, 0.5, n_samples//2, ],
          'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
          'bootstrap':[True, False]
         }

ef_classifier_grid = GridSearchCV(ExtraTreesClassifier(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
ef_classifier_grid.fit(X_train,Y_train)

print('Train Accuracy : %.3f'%ef_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%ef_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%ef_classifier_grid.best_score_)
print('Best Parameters : ',ef_classifier_grid.best_params_)

Fitting 3 folds for each of 1134 candidates, totalling 3402 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 144 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 744 tasks      | elapsed:   11.5s
[Parallel(n_jobs=-1)]: Done 1744 tasks      | elapsed:   26.8s
[Parallel(n_jobs=-1)]: Done 3144 tasks      | elapsed:   44.7s
[Parallel(n_jobs=-1)]: Done 3402 out of 3402 | elapsed:   48.2s finished

Train Accuracy : 1.000
Test Accurqacy : 0.983
Best Accuracy Through Grid Search : 0.982
Best Parameters :  {'bootstrap': False, 'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
CPU times: user 2.94 s, sys: 125 ms, total: 3.07 s
Wall time: 48.6 s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(ef_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 1134

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_bootstrap	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	param_n_estimators	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.053711	0.001229	0.002628	0.000549	True	None	None	1	2	20	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.950311	0.960251	0.966387	0.958942	0.006631	39
1	0.130861	0.000975	0.009696	0.004870	True	None	None	1	2	50	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.958592	0.962343	0.978992	0.966597	0.008857	30
2	0.291812	0.011959	0.009114	0.000343	True	None	None	1	2	100	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.966874	0.962343	0.978992	0.969381	0.007013	28
3	0.022173	0.000474	0.001731	0.000020	True	None	None	1	0.5	20	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.631470	0.558577	0.550420	0.580376	0.036507	307
4	0.060601	0.009223	0.003610	0.000019	True	None	None	1	0.5	50	{'bootstrap': True, 'max_depth': None, 'max_fe...	0.643892	0.587866	0.632353	0.621434	0.024163	301

Comparing Performance Of Random Forest With Decision Tree/Extra Tree¶

rforest_classifier = ensemble.RandomForestClassifier(random_state=1)
rforest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_classifier.__class__.__name__,
                                                     rforest_classifier.score(X_train, Y_train),rforest_classifier.score(X_test, Y_test)))

rforest_classifier = ensemble.RandomForestClassifier(random_state=1, **rf_classifier_grid.best_params_)
rforest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_classifier.__class__.__name__,
                                                     rforest_classifier.score(X_train, Y_train),rforest_classifier.score(X_test, Y_test)))

extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1)
extra_forest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_classifier.__class__.__name__,
                                                     extra_forest_classifier.score(X_train, Y_train),extra_forest_classifier.score(X_test, Y_test)))

extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1, **ef_classifier_grid.best_params_)
extra_forest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_classifier.__class__.__name__,
                                                     extra_forest_classifier.score(X_train, Y_train),extra_forest_classifier.score(X_test, Y_test)))

dtree_classifier = tree.DecisionTreeClassifier(random_state=1)
dtree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_classifier.__class__.__name__,
                                                     dtree_classifier.score(X_train, Y_train),dtree_classifier.score(X_test, Y_test)))

extra_tree_classifier = tree.ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_classifier.__class__.__name__,
                                                     extra_tree_classifier.score(X_train, Y_train),extra_tree_classifier.score(X_test, Y_test)))

RandomForestClassifier : Train Accuracy : 1.00, Test Accuracy : 0.94
RandomForestClassifier : Train Accuracy : 1.00, Test Accuracy : 0.97
ExtraTreesClassifier : Train Accuracy : 1.00, Test Accuracy : 0.95
ExtraTreesClassifier : Train Accuracy : 1.00, Test Accuracy : 0.98
DecisionTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.83
ExtraTreeClassifier : Train Accuracy : 1.00, Test Accuracy : 0.83

This ends our small tutorial on ensemble learning method bagging and random forests using scikit-learn. Please let us know your views in the comments section.

References ¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

sklearn, ensemble-learning, bagging, random-forests

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Scikit-Learn - Ensemble Learning : Bootstrap Aggregation(Bagging) & Random Forests¶

Table of Contents¶

Introduction ¶

Bootstrap Aggregation (Bagging) ¶

BaggingRegressor ¶

Load BOSTON Housing Dataset¶

Splitting Dataset into Train & Test sets¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

GridSearchCV¶

Printing First Few Cross-Validation Results¶

Comparing Performance Of Bagging With Decision Tree/Extra Tree¶

BaggingClassifier ¶

Load DIGITS Dataset¶

Splitting Dataset into Train & Test sets¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Comparing Performance Of Bagging With Decision Tree/Extra Tree¶

Random Forests ¶

Extremely Randomized Trees ¶

RandomForestRegressor ¶

Train/Test Split Boston Dataset¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

ExtraTreesRegressor ¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Comparing Performance Of Random Forest With Decision Tree/Extra Tree¶

RandomForestClassifier ¶

Train/Test Split¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

ExtraTreesClassifier ¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Comparing Performance Of Random Forest With Decision Tree/Extra Tree¶

References ¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription