We already discussed decision trees in our tutorial about it in-depth. We noticed over there that a single decision tree generally over-fits train data very easily hence it's a better idea to combine many decision trees to make a decision. The basic idea is that multiple overfitting estimators can be combined together to reduce the effect of overfitting and produce better predictions which generalize well. This idea is generally referred to as ensemble learning
in the machine learning community.
There are 2 ways to combine decision trees to make better decisions:
In this tutorial, we'll be discussing bagging and random forests. We'll cover boosting in-depth in separate tutorial.
import numpy as np
import pandas as pd
import sklearn
from sklearn import ensemble, datasets, tree
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV
import sys
import warnings
warnings.filterwarnings("ignore")
print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
Bagging starts with many sub-sample of original data with replacement and then trains various decision trees on these sub-samples. When the prediction is to be made on new data, it votes or averages prediction from each decision tree. The basic idea is to solve the overfitting problem (reducing high variance) by introducing some randomization.
Scikit-Learn provides BagginRegressor
and BaggingClassifier
.
We'll be explaining the usage of BaggingRegressor
by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned bagging estimator with decision tree and extra tree estimator of scikit-learn.
from sklearn import datasets
boston = datasets.load_boston()
X_boston, Y_boston = boston.data, boston.target
print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))
We'll split the dataset into two parts:
Training data
which will be used for the training model.Test data
against which accuracy of the trained model will be checked.train_test_split
function of model_selection
module of sklearn will help us split data into two sets with 80%
for training and 20%
for test purposes. We are also using seed(random_state=123)
with train_test_split so that we always get the same split and can reproduce results in the future as well.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston , train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
We can fit()
method on estimator passing it train features and train target. It'll then train a model using that data.
from sklearn.ensemble import BaggingRegressor
bag_regressor = BaggingRegressor(random_state=1)
bag_regressor.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = bag_regressor.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
print('Training Coefficient of R^2 : %.3f'%bag_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%bag_regressor.score(X_test, Y_test))
Below are list of common hyperparameters which needs tuning for getting best fit for our data. We'll try various hyperparemters settings to various splits of train/test data to find out best fit which will have almost same accuracy for both train & test dataset or have quite less different between accuracy.
object
or None
. default=None
default=10
default=True
default=False
int(1-n_samples)
or float(0.0-1.0]
values. It represents number of samples to draw from train data to train particular estimator.int(1-n_features)
or float(0.0-1.0]
values. It represents number of features to draw from train data to train particular estimator.It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid
parameter with a number of cross-validation folds provided as cv
parameter, evaluates model performance on all combinations and stores all results in cv_results_
attribute. It also stores model which performs best in all cross-validation folds in best_estimator_
attribute and best score in best_score_
attribute.
n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.
We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by doing 3-fold cross-validation on data.
%%time
n_samples = boston.data.shape[0]
n_features = boston.data.shape[1]
params = {'base_estimator': [None, LinearRegression(), KNeighborsRegressor()],
'n_estimators': [20,50,100],
'max_samples': [0.5,1.0, n_samples//2,],
'max_features': [0.5,1.0, n_features//2,],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}
bagging_regressor_grid = GridSearchCV(BaggingRegressor(random_state=1, n_jobs=-1), param_grid =params, cv=3, n_jobs=-1, verbose=1)
bagging_regressor_grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%bagging_regressor_grid.best_score_)
print('Best Parameters : ',bagging_regressor_grid.best_params_)
GridSearchCV
maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling cv_results_
attribute on it. We are converting it to pandas dataframe for better visuals.
cross_val_results = pd.DataFrame(bagging_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Below we are comparing the performance of various bagging regression estimators with a decision tree and extra tree estimators. We can notice that bagging estimators do not over-fit like a decision tree.
## Bagging Regressor with Default Params
bag_regressor = ensemble.BaggingRegressor(random_state=1)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))
## Bagging Regressor with KNeighborsRegressor as base estimator
bag_regressor = ensemble.BaggingRegressor(base_estimator=KNeighborsRegressor(), random_state=1)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))
## Above Hyper-peramter tuned Bagging Regressor
bag_regressor = ensemble.BaggingRegressor(random_state=1, **bagging_regressor_grid.best_params_)
bag_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_regressor.__class__.__name__,
bag_regressor.score(X_train, Y_train),bag_regressor.score(X_test, Y_test)))
## Decision Tree with Default Parameters
dtree_regressor = tree.DecisionTreeRegressor(random_state=1)
dtree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_regressor.__class__.__name__,
dtree_regressor.score(X_train, Y_train),dtree_regressor.score(X_test, Y_test)))
## Decision Tree with Default Parameters
extra_tree_regressor = tree.ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_regressor.__class__.__name__,
extra_tree_regressor.score(X_train, Y_train),extra_tree_regressor.score(X_test, Y_test)))
We'll be explaining the usage of BaggingClassifier
by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned bagging estimator with decision tree and extra tree estimator of scikit-learn.
digits = datasets.load_digits()
X_digits, Y_digits = digits.data, digits.target
print('Dataset Size : ', X_digits.shape, Y_digits.shape)
Below we are splitting the Boston dataset into train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.
Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.
X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, stratify=Y_digits, random_state=123)
print('Train/Test Set Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
from sklearn.ensemble import BaggingClassifier
bag_classifier = BaggingClassifier(random_state=1)
bag_classifier.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = bag_classifier.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%bag_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%bag_classifier.score(X_train, Y_train))
BaggingClassifier
has the same parameters to tune as that of BaggingRegressor
.
%%time
n_samples = digits.data.shape[0]
n_features = digits.data.shape[1]
params = {'base_estimator': [None, LogisticRegression(), KNeighborsClassifier()],
'n_estimators': [20,50,100],
'max_samples': [0.5, 1.0, n_samples//2, ],
'max_features': [0.5, 1.0, n_features//2, ],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), param_grid =params, cv=3, n_jobs=-1, verbose=1)
bagging_classifier_grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%bagging_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%bagging_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%bagging_classifier_grid.best_score_)
print('Best Parameters : ',bagging_classifier_grid.best_params_)
cross_val_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
bag_classifier = ensemble.BaggingClassifier(random_state=1)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))
bag_classifier = ensemble.BaggingClassifier(base_estimator=KNeighborsClassifier(), random_state=1)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))
bag_classifier = ensemble.BaggingClassifier(random_state=1, **bagging_classifier_grid.best_params_)
bag_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(bag_classifier.__class__.__name__,
bag_classifier.score(X_train, Y_train),bag_classifier.score(X_test, Y_test)))
dtree_classifier = tree.DecisionTreeClassifier(random_state=1)
dtree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_classifier.__class__.__name__,
dtree_classifier.score(X_train, Y_train),dtree_classifier.score(X_test, Y_test)))
extra_tree_classifier = tree.ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_classifier.__class__.__name__,
extra_tree_classifier.score(X_train, Y_train),extra_tree_classifier.score(X_test, Y_test)))
Random Forests are slight improvements over bagging. Combining predictions from various decision trees works well when these decision trees predictions are as less correlated as possible. In a sense, each sub-tree is predicting some class of problem very well then all other sub-trees. The problem with bagging is that it’s a greedy algorithm like a single decision tree hence it tries to minimize error without looking for the optimal split. Due to this greedy approach, it fails to split data in a way that results in generating sub-trees which predicts uncorrelated results. When splitting a node during the construction of a tree, the split that is chosen is not best among all features. Instead split which is picked will be best on a random subset of features. It does not choose split which is best among all features.
Random Forests changes algorithm in a way that when doing split it looks for all possible split and chooses optimal split which generates sub-trees that have less correlation. Random forests also average results of various sub-trees when doing prediction but it’s during training when doing an optimal split of data, it differs from Bagging.
Scikit-Learn also provides another version of Random Forests which is further randomized in selecting split. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.
We'll be explaining the usage of RandomForestRegressor
by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning.
X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
from sklearn.ensemble import RandomForestRegressor
rforest_regressor = RandomForestRegressor(random_state=1)
rforest_regressor.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target varible on Test Set passed to it.
Y_preds = rforest_regressor.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
print('Training Coefficient of R^2 : %.3f'%rforest_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%rforest_regressor.score(X_test, Y_test))
Below is a list of common hyperparameters that need tuning for getting the best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.
default=10
None
is used.mse(mean squared error)
& mae(mean absolute error)
.None
as value. None
- n_features
are used as value if None is provided.sqrt
- sqrt(n_features)
features are used for split.auto
- sqrt(n_features)
features are used for split.log2
- log2(n_features)
features are used for split.default=True
#* max_leaf_nodes - We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by doing 3-fold cross-validation on data.
%%time
n_samples = X_boston.shape[0]
n_features = X_boston.shape[1]
params = {'n_estimators': [20,50,100],
'max_depth': [None, 2, 5],
'min_samples_split': [2, 0.5, n_samples//2, ],
'min_samples_leaf': [1, 0.5, n_samples//2, ],
'criterion': ['mse', 'mae'],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
'bootstrap':[True, False]
}
rf_regressor_grid = GridSearchCV(RandomForestRegressor(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
rf_regressor_grid.fit(X_train,Y_train)
print('Train R^2 Score : %.3f'%rf_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%rf_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%rf_regressor_grid.best_score_)
print('Best Parameters : ',rf_regressor_grid.best_params_)
cross_val_results = pd.DataFrame(rf_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
We'll be explaining the usage of ExtraTreesRegressor
by using the Boston housing data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned extra trees regression estimator with random forest, decision tree, and extra tree estimator of scikit-learn.
from sklearn.ensemble import ExtraTreesRegressor
extra_forest_regressor = ExtraTreesRegressor(random_state=1)
extra_forest_regressor.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = extra_forest_regressor.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
print('Training Coefficient of R^2 : %.3f'%extra_forest_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_forest_regressor.score(X_test, Y_test))
ExtraTreesRegressor
has the same parameters to tune as that of RandomForestRegressor
.
%%time
n_samples = X_boston.shape[0]
n_features = X_boston.shape[1]
params = {'n_estimators': [20,50,100],
'max_depth': [None, 2,5,],
'min_samples_split': [2, 0.5, n_samples//2, ],
'min_samples_leaf': [1, 0.5, n_samples//2, ],
'criterion': ['mse', 'mae'],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3, 0.5, n_features//2],
'bootstrap':[True, False]
}
ef_regressor_grid = GridSearchCV(ExtraTreesRegressor(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
ef_regressor_grid.fit(X_train,Y_train)
print('Train R^2 Score : %.3f'%ef_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%ef_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%ef_regressor_grid.best_score_)
print('Best Parameters : ',ef_regressor_grid.best_params_)
cross_val_results = pd.DataFrame(ef_regressor_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
rforest_regressor = ensemble.RandomForestRegressor(random_state=1)
rforest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_regressor.__class__.__name__,
rforest_regressor.score(X_train, Y_train),rforest_regressor.score(X_test, Y_test)))
rforest_regressor = ensemble.RandomForestRegressor(random_state=1, **rf_regressor_grid.best_params_)
rforest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_regressor.__class__.__name__,
rforest_regressor.score(X_train, Y_train),rforest_regressor.score(X_test, Y_test)))
extra_forest_regressor = ensemble.ExtraTreesRegressor(random_state=1)
extra_forest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
extra_forest_regressor.score(X_train, Y_train),extra_forest_regressor.score(X_test, Y_test)))
extra_forest_regressor = ensemble.ExtraTreesRegressor(random_state=1, **ef_regressor_grid.best_params_)
extra_forest_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
extra_forest_regressor.score(X_train, Y_train),extra_forest_regressor.score(X_test, Y_test)))
dtree_regressor = tree.DecisionTreeRegressor(random_state=1)
dtree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_regressor.__class__.__name__,
dtree_regressor.score(X_train, Y_train),dtree_regressor.score(X_test, Y_test)))
extra_tree_regressor = tree.ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_regressor.__class__.__name__,
extra_tree_regressor.score(X_train, Y_train),extra_tree_regressor.score(X_test, Y_test)))
We'll be explaining the usage of RandomForestClassifier
by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning.
X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
from sklearn.ensemble import RandomForestClassifier
rforest_classifier = RandomForestClassifier(random_state=1)
rforest_classifier.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = rforest_classifier.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%rforest_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%rforest_classifier.score(X_train, Y_train))
RandomForestClassifier
has the same parameters to tune as that of RandomForestRegressor
.
%%time
n_samples = X_digits.shape[0]
n_features = X_digits.shape[1]
params = {'n_estimators': [20,50,100],
'max_depth': [None, 2, 5,],
'min_samples_split': [2, 0.5, n_samples//2, ],
'min_samples_leaf': [1, 0.5, n_samples//2, ],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
'bootstrap':[True, False]
}
rf_classifier_grid = GridSearchCV(RandomForestClassifier(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
rf_classifier_grid.fit(X_train,Y_train)
print('Train Accuracy : %.3f'%rf_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%rf_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%rf_classifier_grid.best_score_)
print('Best Parameters : ',rf_classifier_grid.best_params_)
cross_val_results = pd.DataFrame(rf_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
We'll be explaining the usage of ExtraTreesClassifier
by using digits data set. We'll first train the model with default parameters and then do hyper-parameter tuning. We'll also be comparing the performance of tuned extra trees regression estimator with random forest, decision tree, and extra tree estimator of scikit-learn.
from sklearn.ensemble import ExtraTreesClassifier
extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1)
extra_forest_classifier.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = extra_forest_classifier.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%extra_forest_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_forest_classifier.score(X_train, Y_train))
ExtraTreesClassifier
has the same parameters to tune as that of RandomForestRegressor/ExtraTreesRegressor
.
%%time
n_samples = X_digits.shape[0]
n_features = X_digits.shape[1]
params = {'n_estimators': [20,50,100],
'max_depth': [None, 2, 5,],
'min_samples_split': [2, 0.5, n_samples//2, ],
'min_samples_leaf': [1, 0.5, n_samples//2, ],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5, n_features//2, ],
'bootstrap':[True, False]
}
ef_classifier_grid = GridSearchCV(ExtraTreesClassifier(random_state=1), param_grid=params, n_jobs=-1, cv=3, verbose=1)
ef_classifier_grid.fit(X_train,Y_train)
print('Train Accuracy : %.3f'%ef_classifier_grid.best_estimator_.score(X_train, Y_train))
print('Test Accurqacy : %.3f'%ef_classifier_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%ef_classifier_grid.best_score_)
print('Best Parameters : ',ef_classifier_grid.best_params_)
cross_val_results = pd.DataFrame(ef_classifier_grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
rforest_classifier = ensemble.RandomForestClassifier(random_state=1)
rforest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_classifier.__class__.__name__,
rforest_classifier.score(X_train, Y_train),rforest_classifier.score(X_test, Y_test)))
rforest_classifier = ensemble.RandomForestClassifier(random_state=1, **rf_classifier_grid.best_params_)
rforest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(rforest_classifier.__class__.__name__,
rforest_classifier.score(X_train, Y_train),rforest_classifier.score(X_test, Y_test)))
extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1)
extra_forest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_classifier.__class__.__name__,
extra_forest_classifier.score(X_train, Y_train),extra_forest_classifier.score(X_test, Y_test)))
extra_forest_classifier = ensemble.ExtraTreesClassifier(random_state=1, **ef_classifier_grid.best_params_)
extra_forest_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_forest_classifier.__class__.__name__,
extra_forest_classifier.score(X_train, Y_train),extra_forest_classifier.score(X_test, Y_test)))
dtree_classifier = tree.DecisionTreeClassifier(random_state=1)
dtree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(dtree_classifier.__class__.__name__,
dtree_classifier.score(X_train, Y_train),dtree_classifier.score(X_test, Y_test)))
extra_tree_classifier = tree.ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
print("%s : Train Accuracy : %.2f, Test Accuracy : %.2f"%(extra_tree_classifier.__class__.__name__,
extra_tree_classifier.score(X_train, Y_train),extra_tree_classifier.score(X_test, Y_test)))
This ends our small tutorial on ensemble learning method bagging and random forests using scikit-learn. Please let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to