Share @ LinkedIn Facebook  sklearn, decision-trees
Scikit-Learn - Decision Trees

Scikit-Learn - Decision Trees

Table of Contents

Introduction

Decision Trees are a class of algorithms that are based on "if" and "else" conditions. Based on these conditions, decisions are made to the task at hand. These conditions are decided by an algorithm based on data at hand. How many conditions, kind of conditions, and answers to that conditions are based on data and will be different for each dataset. We'll be covering the usage of decision tree implementation available in scikit-learn for classification and regression tasks below.

Below we have highlighted some characteristics of decision tree

Characteristics of decision trees:

  • Fast to train and easy to understand & interpret.
  • Binary splitting of questions is the essence of decision tree models.
  • Requires little preprocessing of data.
  • Can work with variables of different types (continuous & discrete)
  • Invariant to feature scaling.
  • Models are called "nonparametric" because there are no hyper-parameters to tune.
  • If given more data then the model becomes more flexible.
  • Number of tree parameters (conditions) grows with the number of samples covering as much domain of data as possible.

We'll start by importing the necessary modules needed for our tutorial. We'll need pydotplus library installed as it'll be used to plot decision trees trained by scikit-learn.

In [1]:
## We need to install pydotplus for this tutorial.
!pip install pydotplus
Requirement already satisfied: pydotplus in ./anaconda3/lib/python3.7/site-packages (2.0.2)
Requirement already satisfied: pyparsing>=2.0.1 in ./anaconda3/lib/python3.7/site-packages (from pydotplus) (2.4.7)
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import sys
import warnings

warnings.filterwarnings('ignore')
np.set_printoptions(precision=2)

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

DecisionTreeClassifier

Below we are loading classic IRIS classification dataset provided by scikit-learn which has 150 samples of 3 categories of flowers containing 50 samples for each category (iris-setosa, iris-virginica, iris-versicolor). We'll use DecisionTreeClassifier provided by scikit-learn for the classification tasks.

Loading Data

Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns Bunch object which is almost the same as the dictionary.

In [2]:
from sklearn import datasets

iris = datasets.load_iris()
X, Y = iris.data, iris.target

print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target size : '+str(iris.target.shape))
Dataset features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Dataset features size : (150, 4)
Dataset target names : ['setosa' 'versicolor' 'virginica']
Dataset target size : (150,)

Splitting Dataset into Train & Test sets

We'll split the dataset into two parts:

  • Training data which will be used for the training model.
  • Test data against which accuracy of the trained model will be checked.

train_test_split function of the model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.


NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, stratify=Y, random_state=123)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sizes :  (112, 4) (38, 4) (112,) (38,)

Fitting Model To Train Data

In [4]:
from sklearn.tree import DecisionTreeClassifier

tree_classifier = DecisionTreeClassifier(random_state=1)
tree_classifier.fit(X_train, Y_train)
Out[4]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it. We'll use score() which returns the accuracy of the model to check model accuracy on test data.

In [5]:
Y_preds = tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%tree_classifier.score(X_train, Y_train))
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 2 2 2 2 1 1 2 1 1 2 1 2
 1]
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
 1]
Test Accuracy : 0.974
Test Accuracy : 0.974
Training Accuracy : 1.000

DecisionTreeClassifier instance provides predict_proba() method which returns probability returned by model for each class. We'll try to print probabilities predicted by the model for the first few test samples.

In [6]:
tree_classifier.predict_proba(X_test)[:10]
Out[6]:
array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below is a list of common hyper-parameters that needs tuning for getting best fit for our data. We'll try various hyper-parameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  • criterion: It accepts string argument specifying which function to use to measure the quality of a split.
    • gini - Gini Impurity. This is the default value.
    • entropy - Information Gain.
  • max_depth - It defines how finely tree can separate samples (list of "if-else" questions to ask deciding target variable). As we increase max_depth, model over-fits, and less value of max_depth results in under-fit. We need to find the best value. If no value is provided then by default None is used.
  • max_features - Number of features to consider when doing split. It accepts int(0-n_features), float(0.0-0.5], string(sqrt, log2, auto) or None as value.
    • None - n_features are used as value if None is provided.
    • sqrt - sqrt(n_features) features are used for split.
    • auto - sqrt(n_features) features are used for split.
    • log2 - log2(n_features) features are used for split.
  • min_samples_split - Number of samples required to split internal node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_split * n_samples) features.
  • min_samples_leaf - Minimum number of samples required to be at leaf node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_leaf * n_samples) features.

GridSearchCV

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.


NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [7]:
from sklearn.model_selection import GridSearchCV

n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(DecisionTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'criterion': ['gini', 'entropy'],
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Fitting 3 folds for each of 5184 candidates, totalling 15552 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 3681 tasks      | elapsed:    5.9s
Train Accuracy : 1.000
Test Accuracy : 0.974
Best Score Through Grid Search : 0.964
Best Parameters :  {'criterion': 'gini', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:    8.8s finished

Printing First Few Cross-Validation Results

GridSearchCV maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling cv_results_ attribute on it. We are converting it to pandas dataframe for better visuals.

In [8]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 5184
Out[8]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.001922 0.000061 0.000724 0.000022 gini None None 1 2 {'criterion': 'gini', 'max_depth': None, 'max_... 0.923077 0.972973 1.000000 0.964286 0.032035 1
1 0.001439 0.000390 0.000535 0.000123 gini None None 1 0.3 {'criterion': 'gini', 'max_depth': None, 'max_... 0.923077 0.972973 0.972222 0.955357 0.023596 9
2 0.001997 0.000706 0.000468 0.000039 gini None None 1 0.5 {'criterion': 'gini', 'max_depth': None, 'max_... 0.923077 0.918919 0.972222 0.937500 0.023959 666
3 0.000884 0.000095 0.000354 0.000020 gini None None 1 75 {'criterion': 'gini', 'max_depth': None, 'max_... 0.333333 0.675676 0.666667 0.553571 0.161018 2161
4 0.001103 0.000361 0.000462 0.000139 gini None None 1 50 {'criterion': 'gini', 'max_depth': None, 'max_... 0.666667 0.918919 0.972222 0.848214 0.134430 979

Plotting Feature Importance

We can access the feature importance of each feature in the decision tree through feature_importances_ attributes. We have plotted it as well for better understanding.

In [9]:
print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
Feature Importance : [0.   0.02 0.42 0.56]
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(10,4))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(4), iris.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

Scikit-Learn - Decision Trees

Visualizing Decision Tree Using GraphViz & PyDotPlus

We can visualize the decision tree by using graphviz. Scikit-learn provides export_graphviz() function which can let us convert tree trained to graphviz format. We can then generate a graph from it using the pydotplus library using its method graph_from_dot_data.

We can easily ask questions about flower type based on flower features and get an answer from the decision tree based on True or False answer to the question.

In [ ]:
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus

dot_data = StringIO()

export_graphviz(grid.best_estimator_, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,
                class_names=iris.target_names,
                feature_names=iris.feature_names)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())

Scikit-Learn - Decision Trees

ExtraTreeClassifier

ExtraTreeClassifier is commonly referred to as an extremely randomized decision tree. When deciding to split samples into 2 groups based on a feature, random splits are drawn for each of randomly selected features and the best of them is selected.

Fitting Model To Train Data

In [12]:
from sklearn.tree import ExtraTreeClassifier

extra_tree_classifier = ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
Out[12]:
ExtraTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, random_state=1,
                    splitter='random')

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [13]:
Y_preds = extra_tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%extra_tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_tree_classifier.score(X_train, Y_train))
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
 1]
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
 1]
Test Accuracy : 1.000
Test Accuracy : 1.000
Training Accuracy : 1.000
In [14]:
extra_tree_classifier.predict_proba(X_test)[:10]
Out[14]:
array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Finetuning Model By Doing Grid Search On Various Hyperparameters.

ExtraTreeClassifier has same hyperparameters as that of DecisionTreeClassifier

In [15]:
n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(ExtraTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'criterion': ['gini', 'entropy'],
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Fitting 3 folds for each of 5184 candidates, totalling 15552 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 11224 tasks      | elapsed:    3.1s
Train Accuracy : 0.982
Test Accuracy : 0.974
Best Score Through Grid Search : 0.946
Best Parameters :  {'criterion': 'gini', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}
[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:    4.1s finished

Printing First Few Cross Validation Results

In [16]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 5184
Out[16]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.001576 0.000044 0.000667 0.000030 gini None None 1 2 {'criterion': 'gini', 'max_depth': None, 'max_... 0.948718 0.864865 0.972222 0.928571 0.045766 173
1 0.001509 0.000022 0.000781 0.000148 gini None None 1 0.3 {'criterion': 'gini', 'max_depth': None, 'max_... 0.948718 0.891892 0.972222 0.937500 0.033444 13
2 0.001464 0.000040 0.000617 0.000024 gini None None 1 0.5 {'criterion': 'gini', 'max_depth': None, 'max_... 0.871795 0.864865 0.972222 0.901786 0.048562 269
3 0.001186 0.000010 0.000561 0.000011 gini None None 1 75 {'criterion': 'gini', 'max_depth': None, 'max_... 0.333333 0.675676 0.666667 0.553571 0.161018 1237
4 0.001218 0.000011 0.000639 0.000121 gini None None 1 50 {'criterion': 'gini', 'max_depth': None, 'max_... 0.666667 0.756757 0.750000 0.723214 0.041422 545

Plotting Feature Importance

In [17]:
print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
Feature Importance : [0.07 0.   0.66 0.27]
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(10,4))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(4), iris.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

Scikit-Learn - Decision Trees

NOTE

Please make a note that even though decision trees provides a way to measure target in nonparametric way, it sometimes over-fits data and sometimes under-fits data. hence decision trees are not efficient for dataset with more features and less samples to properly set tree rules/conditions.

DecisionTreeRegressor

We'll now try loading the Boston dataset provided by sklearn and will try DecisionTreeRegressor on it as well with different depth of the decision tree. We'll also visualize results letter comparing performance on train and test sets with different tree depths.

Loading Data

Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns Bunch object which is almost the same as the dictionary.

In [19]:
boston = datasets.load_boston()
X, Y  = boston.data, boston.target

print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))
Dataset features names : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Dataset features size : (506, 13)
Dataset target size : (506,)

Splitting Dataset into Train & Test sets

Below we are splitting the Boston dataset into the train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.

In [20]:
X_train, X_test,Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, random_state=1)
print('Train/Test Set Sizes : ', X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)
Train/Test Set Sizes :  (379, 13) (379,) (127, 13) (127,)

Fitting Model To Train Data

In [21]:
from sklearn.tree import DecisionTreeRegressor

tree_regressor = DecisionTreeRegressor(random_state=1)
tree_regressor.fit(X_train, Y_train)
Out[21]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [22]:
Y_preds = tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%tree_regressor.score(X_test, Y_test))
[36.1 27.5 21.7 18.6 21.7 21.7 30.8 21.7 17.8 24.6]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.699

Finetuning Model By Doing Grid Search On Various Hyperparameters.

DecisionTreeRegressor has same hyperparameters as DecisionTreeClassifier. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [24]:
n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(DecisionTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Fitting 3 folds for each of 2592 candidates, totalling 7776 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 152 tasks      | elapsed:    0.3s
Train R^2 Score : 0.909
Test R^2 Score : 0.768
Best R^2 Score Through Grid Search : 0.780
Best Parameters :  {'max_depth': 5, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2}
[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:    2.9s finished

Printing First Few Cross Validation Results

In [25]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 2592
Out[25]:
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.009543 0.002699 0.002643 0.000548 None None 1 2 {'max_depth': None, 'max_features': None, 'min... 0.624971 0.514967 0.765272 0.635043 0.102302 39
1 0.003043 0.000175 0.001470 0.000079 None None 1 0.3 {'max_depth': None, 'max_features': None, 'min... 0.706766 0.577495 0.786565 0.690319 0.086036 16
2 0.002141 0.000282 0.001283 0.000041 None None 1 0.5 {'max_depth': None, 'max_features': None, 'min... 0.428475 0.659663 0.457726 0.515059 0.102745 123
3 0.001505 0.000201 0.001882 0.000942 None None 1 253 {'max_depth': None, 'max_features': None, 'min... -0.008807 0.343799 0.425772 0.252896 0.188767 833
4 0.002236 0.000361 0.001326 0.000029 None None 1 168 {'max_depth': None, 'max_features': None, 'min... 0.402911 0.576992 0.425772 0.468385 0.077212 242

Plotting Feature Importance

In [26]:
print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
Feature Importance : [0.01 0.   0.   0.   0.05 0.56 0.   0.06 0.04 0.   0.02 0.   0.25]
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(12,8))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(13), boston.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

Scikit-Learn - Decision Trees

Visualizing Decision Tree Using GraphViz & PyDotPlus

In [ ]:
dot_data = StringIO()
export_graphviz(grid.best_estimator_, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,
                feature_names=boston.feature_names,)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Scikit-Learn - Decision Trees

ExtraTreeRegressor

ExtraTreeRegressor like ExtraTreeClassifier is an extremely randomized decision tree for regression problems. We'll follow the same process as previous examples to explain its usage.

Fitting Model To Train Data

In [29]:
from sklearn.tree import ExtraTreeRegressor

extra_tree_regressor = ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
Out[29]:
ExtraTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                   max_leaf_nodes=None, min_impurity_decrease=0.0,
                   min_impurity_split=None, min_samples_leaf=1,
                   min_samples_split=2, min_weight_fraction_leaf=0.0,
                   random_state=1, splitter='random')

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [30]:
Y_preds = extra_tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_test, Y_test))
[25.2 23.6 16.8 20.9 18.4 23.  22.8 19.6 18.8 22. ]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.752

Finetuning Model By Doing Grid Search On Various Hyperparameters.

ExtraTreeRegressor has same hyperparameters as ExtraTreeClassifier. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [31]:
n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(ExtraTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Fitting 3 folds for each of 2592 candidates, totalling 7776 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  60 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 5880 tasks      | elapsed:    2.1s
Train R^2 Score : 0.907
Test R^2 Score : 0.780
Best R^2 Score Through Grid Search : 0.707
Best Parameters :  {'max_depth': 7, 'max_features': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 2}
[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:    2.8s finished

Printing First Few Cross Validation Results

In [32]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 2592
Out[32]:
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.003376 0.001150 0.001233 0.000013 None None 1 2 {'max_depth': None, 'max_features': None, 'min... 0.663471 0.554314 0.644248 0.620790 0.047566 13
1 0.001512 0.000076 0.001171 0.000010 None None 1 0.3 {'max_depth': None, 'max_features': None, 'min... 0.503981 0.649523 0.534968 0.562669 0.062587 42
2 0.001167 0.000215 0.001085 0.000347 None None 1 0.5 {'max_depth': None, 'max_features': None, 'min... 0.455598 0.625403 0.435738 0.505448 0.085041 98
3 0.000550 0.000021 0.000526 0.000014 None None 1 253 {'max_depth': None, 'max_features': None, 'min... -0.008807 0.429324 0.388414 0.268909 0.197857 599
4 0.000525 0.000048 0.000460 0.000016 None None 1 168 {'max_depth': None, 'max_features': None, 'min... 0.455598 0.552272 0.435738 0.481135 0.050853 144

Plotting Feature Importance

In [33]:
print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
Feature Importance : [5.70e-02 2.24e-04 1.62e-02 1.03e-02 5.83e-02 5.92e-02 1.37e-02 2.91e-03
 1.30e-02 4.84e-02 6.10e-02 2.57e-02 6.34e-01]
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(12,8))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(13), boston.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

Scikit-Learn - Decision Trees

The single tree generally overfits data and hence in practice, it's a good idea to combine various decision trees to predict results. The two most common ways to combine multiple decision trees are random forests and gradient boosted trees.

References


Sunny Solanki  Sunny Solanki