Updated On : May-29,2020 Time Investment : ~30 mins

Scikit-Learn - Decision Trees¶

Table of Contents¶

Introduction
DecisionTreeClassifier
ExtraTreeClassifier
DecisionTreeRegressor
ExtraTreeRegressor
References

Introduction ¶

Decision Trees are a class of algorithms that are based on "if" and "else" conditions. Based on these conditions, decisions are made to the task at hand. These conditions are decided by an algorithm based on data at hand. How many conditions, kind of conditions, and answers to that conditions are based on data and will be different for each dataset. We'll be covering the usage of decision tree implementation available in scikit-learn for classification and regression tasks below.

Below we have highlighted some characteristics of decision tree

Characteristics of decision trees:

Fast to train and easy to understand & interpret.
Binary splitting of questions is the essence of decision tree models.
Requires little preprocessing of data.
Can work with variables of different types (continuous & discrete)
Invariant to feature scaling.
Models are called "nonparametric" because there are no hyper-parameters to tune.
If given more data then the model becomes more flexible.
Number of tree parameters (conditions) grows with the number of samples covering as much domain of data as possible.

We'll start by importing the necessary modules needed for our tutorial. We'll need pydotplus library installed as it'll be used to plot decision trees trained by scikit-learn.

## We need to install pydotplus for this tutorial.
!pip install pydotplus

Requirement already satisfied: pydotplus in ./anaconda3/lib/python3.7/site-packages (2.0.2)
Requirement already satisfied: pyparsing>=2.0.1 in ./anaconda3/lib/python3.7/site-packages (from pydotplus) (2.4.7)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import sys
import warnings

warnings.filterwarnings('ignore')
np.set_printoptions(precision=2)

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)

Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

DecisionTreeClassifier ¶

Below we are loading classic IRIS classification dataset provided by scikit-learn which has 150 samples of 3 categories of flowers containing 50 samples for each category (iris-setosa, iris-virginica, iris-versicolor). We'll use DecisionTreeClassifier provided by scikit-learn for the classification tasks.

Loading Data¶

Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns Bunch object which is almost the same as the dictionary.

from sklearn import datasets

iris = datasets.load_iris()
X, Y = iris.data, iris.target

print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target size : '+str(iris.target.shape))

Dataset features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Dataset features size : (150, 4)
Dataset target names : ['setosa' 'versicolor' 'virginica']
Dataset target size : (150,)

Splitting Dataset into Train & Test sets¶

We'll split the dataset into two parts:

Training data which will be used for the training model.
Test data against which accuracy of the trained model will be checked.

train_test_split function of the model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, stratify=Y, random_state=123)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sizes :  (112, 4) (38, 4) (112,) (38,)

Fitting Model To Train Data¶

from sklearn.tree import DecisionTreeClassifier

tree_classifier = DecisionTreeClassifier(random_state=1)
tree_classifier.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it. We'll use score() which returns the accuracy of the model to check model accuracy on test data.

Y_preds = tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%tree_classifier.score(X_train, Y_train))

[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 2 2 2 2 1 1 2 1 1 2 1 2
 1]
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
 1]
Test Accuracy : 0.974
Test Accuracy : 0.974
Training Accuracy : 1.000

DecisionTreeClassifier instance provides predict_proba() method which returns probability returned by model for each class. We'll try to print probabilities predicted by the model for the first few test samples.

tree_classifier.predict_proba(X_test)[:10]

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below is a list of common hyper-parameters that needs tuning for getting best fit for our data. We'll try various hyper-parameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

criterion: It accepts string argument specifying which function to use to measure the quality of a split.
- gini - Gini Impurity. This is the default value.
- entropy - Information Gain.
max_depth - It defines how finely tree can separate samples (list of "if-else" questions to ask deciding target variable). As we increase max_depth, model over-fits, and less value of max_depth results in under-fit. We need to find the best value. If no value is provided then by default None is used.
max_features - Number of features to consider when doing split. It accepts int(0-n_features), float(0.0-0.5], string(sqrt, log2, auto) or None as value.
- None - n_features are used as value if None is provided.
- sqrt - sqrt(n_features) features are used for split.
- auto - sqrt(n_features) features are used for split.
- log2 - log2(n_features) features are used for split.
min_samples_split - Number of samples required to split internal node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_split * n_samples) features.
min_samples_leaf - Minimum number of samples required to be at leaf node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_leaf * n_samples) features.

GridSearchCV¶

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.

NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

from sklearn.model_selection import GridSearchCV

n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(DecisionTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'criterion': ['gini', 'entropy'],
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 5184 candidates, totalling 15552 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 3681 tasks      | elapsed:    5.9s

Train Accuracy : 1.000
Test Accuracy : 0.974
Best Score Through Grid Search : 0.964
Best Parameters :  {'criterion': 'gini', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}

[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:    8.8s finished

Printing First Few Cross-Validation Results¶

GridSearchCV maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling cv_results_ attribute on it. We are converting it to pandas dataframe for better visuals.

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 5184

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_criterion	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.001922	0.000061	0.000724	0.000022	gini	None	None	1	2	{'criterion': 'gini', 'max_depth': None, 'max_...	0.923077	0.972973	1.000000	0.964286	0.032035	1
1	0.001439	0.000390	0.000535	0.000123	gini	None	None	1	0.3	{'criterion': 'gini', 'max_depth': None, 'max_...	0.923077	0.972973	0.972222	0.955357	0.023596	9
2	0.001997	0.000706	0.000468	0.000039	gini	None	None	1	0.5	{'criterion': 'gini', 'max_depth': None, 'max_...	0.923077	0.918919	0.972222	0.937500	0.023959	666
3	0.000884	0.000095	0.000354	0.000020	gini	None	None	1	75	{'criterion': 'gini', 'max_depth': None, 'max_...	0.333333	0.675676	0.666667	0.553571	0.161018	2161
4	0.001103	0.000361	0.000462	0.000139	gini	None	None	1	50	{'criterion': 'gini', 'max_depth': None, 'max_...	0.666667	0.918919	0.972222	0.848214	0.134430	979

Plotting Feature Importance¶

We can access the feature importance of each feature in the decision tree through feature_importances_ attributes. We have plotted it as well for better understanding.

print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))

Feature Importance : [0.   0.02 0.42 0.56]

with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(10,4))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(4), iris.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

Visualizing Decision Tree Using GraphViz & PyDotPlus¶

We can visualize the decision tree by using graphviz. Scikit-learn provides export_graphviz() function which can let us convert tree trained to graphviz format. We can then generate a graph from it using the pydotplus library using its method graph_from_dot_data.

We can easily ask questions about flower type based on flower features and get an answer from the decision tree based on True or False answer to the question.

from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus

dot_data = StringIO()

export_graphviz(grid.best_estimator_, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,
                class_names=iris.target_names,
                feature_names=iris.feature_names)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())

ExtraTreeClassifier ¶

ExtraTreeClassifier is commonly referred to as an extremely randomized decision tree. When deciding to split samples into 2 groups based on a feature, random splits are drawn for each of randomly selected features and the best of them is selected.

Fitting Model To Train Data¶

from sklearn.tree import ExtraTreeClassifier

extra_tree_classifier = ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)

ExtraTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, random_state=1,
                    splitter='random')

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = extra_tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%extra_tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_tree_classifier.score(X_train, Y_train))

[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
 1]
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
 1]
Test Accuracy : 1.000
Test Accuracy : 1.000
Training Accuracy : 1.000

extra_tree_classifier.predict_proba(X_test)[:10]

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

ExtraTreeClassifier has same hyperparameters as that of DecisionTreeClassifier

n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(ExtraTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'criterion': ['gini', 'entropy'],
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 5184 candidates, totalling 15552 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 11224 tasks      | elapsed:    3.1s

Train Accuracy : 0.982
Test Accuracy : 0.974
Best Score Through Grid Search : 0.946
Best Parameters :  {'criterion': 'gini', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}

[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:    4.1s finished

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 5184

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_criterion	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.001576	0.000044	0.000667	0.000030	gini	None	None	1	2	{'criterion': 'gini', 'max_depth': None, 'max_...	0.948718	0.864865	0.972222	0.928571	0.045766	173
1	0.001509	0.000022	0.000781	0.000148	gini	None	None	1	0.3	{'criterion': 'gini', 'max_depth': None, 'max_...	0.948718	0.891892	0.972222	0.937500	0.033444	13
2	0.001464	0.000040	0.000617	0.000024	gini	None	None	1	0.5	{'criterion': 'gini', 'max_depth': None, 'max_...	0.871795	0.864865	0.972222	0.901786	0.048562	269
3	0.001186	0.000010	0.000561	0.000011	gini	None	None	1	75	{'criterion': 'gini', 'max_depth': None, 'max_...	0.333333	0.675676	0.666667	0.553571	0.161018	1237
4	0.001218	0.000011	0.000639	0.000121	gini	None	None	1	50	{'criterion': 'gini', 'max_depth': None, 'max_...	0.666667	0.756757	0.750000	0.723214	0.041422	545

Plotting Feature Importance¶

print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))

Feature Importance : [0.07 0.   0.66 0.27]

with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(10,4))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(4), iris.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

NOTE

Please make a note that even though decision trees provides a way to measure target in nonparametric way, it sometimes over-fits data and sometimes under-fits data. hence decision trees are not efficient for dataset with more features and less samples to properly set tree rules/conditions.

DecisionTreeRegressor ¶

We'll now try loading the Boston dataset provided by sklearn and will try DecisionTreeRegressor on it as well with different depth of the decision tree. We'll also visualize results letter comparing performance on train and test sets with different tree depths.

Loading Data¶

Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns Bunch object which is almost the same as the dictionary.

boston = datasets.load_boston()
X, Y  = boston.data, boston.target

print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))

Dataset features names : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Dataset features size : (506, 13)
Dataset target size : (506,)

Splitting Dataset into Train & Test sets¶

Below we are splitting the Boston dataset into the train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.

X_train, X_test,Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, random_state=1)
print('Train/Test Set Sizes : ', X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

Train/Test Set Sizes :  (379, 13) (379,) (127, 13) (127,)

Fitting Model To Train Data¶

from sklearn.tree import DecisionTreeRegressor

tree_regressor = DecisionTreeRegressor(random_state=1)
tree_regressor.fit(X_train, Y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%tree_regressor.score(X_test, Y_test))

[36.1 27.5 21.7 18.6 21.7 21.7 30.8 21.7 17.8 24.6]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.699

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

DecisionTreeRegressor has same hyperparameters as DecisionTreeClassifier. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(DecisionTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 2592 candidates, totalling 7776 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 152 tasks      | elapsed:    0.3s

Train R^2 Score : 0.909
Test R^2 Score : 0.768
Best R^2 Score Through Grid Search : 0.780
Best Parameters :  {'max_depth': 5, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2}

[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:    2.9s finished

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 2592

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.009543	0.002699	0.002643	0.000548	None	None	1	2	{'max_depth': None, 'max_features': None, 'min...	0.624971	0.514967	0.765272	0.635043	0.102302	39
1	0.003043	0.000175	0.001470	0.000079	None	None	1	0.3	{'max_depth': None, 'max_features': None, 'min...	0.706766	0.577495	0.786565	0.690319	0.086036	16
2	0.002141	0.000282	0.001283	0.000041	None	None	1	0.5	{'max_depth': None, 'max_features': None, 'min...	0.428475	0.659663	0.457726	0.515059	0.102745	123
3	0.001505	0.000201	0.001882	0.000942	None	None	1	253	{'max_depth': None, 'max_features': None, 'min...	-0.008807	0.343799	0.425772	0.252896	0.188767	833
4	0.002236	0.000361	0.001326	0.000029	None	None	1	168	{'max_depth': None, 'max_features': None, 'min...	0.402911	0.576992	0.425772	0.468385	0.077212	242

Plotting Feature Importance¶

print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))

Feature Importance : [0.01 0.   0.   0.   0.05 0.56 0.   0.06 0.04 0.   0.02 0.   0.25]

with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(12,8))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(13), boston.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

Visualizing Decision Tree Using GraphViz & PyDotPlus¶

dot_data = StringIO()
export_graphviz(grid.best_estimator_, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,
                feature_names=boston.feature_names,)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

ExtraTreeRegressor ¶

ExtraTreeRegressor like ExtraTreeClassifier is an extremely randomized decision tree for regression problems. We'll follow the same process as previous examples to explain its usage.

Fitting Model To Train Data¶

from sklearn.tree import ExtraTreeRegressor

extra_tree_regressor = ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)

ExtraTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                   max_leaf_nodes=None, min_impurity_decrease=0.0,
                   min_impurity_split=None, min_samples_leaf=1,
                   min_samples_split=2, min_weight_fraction_leaf=0.0,
                   random_state=1, splitter='random')

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

Y_preds = extra_tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_test, Y_test))

[25.2 23.6 16.8 20.9 18.4 23.  22.8 19.6 18.8 22. ]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.752

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

ExtraTreeRegressor has same hyperparameters as ExtraTreeClassifier. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(ExtraTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 2592 candidates, totalling 7776 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  60 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 5880 tasks      | elapsed:    2.1s

Train R^2 Score : 0.907
Test R^2 Score : 0.780
Best R^2 Score Through Grid Search : 0.707
Best Parameters :  {'max_depth': 7, 'max_features': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 2}

[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:    2.8s finished

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 2592

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_max_features	param_min_samples_leaf	param_min_samples_split	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.003376	0.001150	0.001233	0.000013	None	None	1	2	{'max_depth': None, 'max_features': None, 'min...	0.663471	0.554314	0.644248	0.620790	0.047566	13
1	0.001512	0.000076	0.001171	0.000010	None	None	1	0.3	{'max_depth': None, 'max_features': None, 'min...	0.503981	0.649523	0.534968	0.562669	0.062587	42
2	0.001167	0.000215	0.001085	0.000347	None	None	1	0.5	{'max_depth': None, 'max_features': None, 'min...	0.455598	0.625403	0.435738	0.505448	0.085041	98
3	0.000550	0.000021	0.000526	0.000014	None	None	1	253	{'max_depth': None, 'max_features': None, 'min...	-0.008807	0.429324	0.388414	0.268909	0.197857	599
4	0.000525	0.000048	0.000460	0.000016	None	None	1	168	{'max_depth': None, 'max_features': None, 'min...	0.455598	0.552272	0.435738	0.481135	0.050853	144

Plotting Feature Importance¶

print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))

Feature Importance : [5.70e-02 2.24e-04 1.62e-02 1.03e-02 5.83e-02 5.92e-02 1.37e-02 2.91e-03
 1.30e-02 4.84e-02 6.10e-02 2.57e-02 6.34e-01]

with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(12,8))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(13), boston.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

The single tree generally overfits data and hence in practice, it's a good idea to combine various decision trees to predict results. The two most common ways to combine multiple decision trees are random forests and gradient boosted trees.

References ¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

sklearn, decision-trees

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Scikit-Learn - Decision Trees¶

Table of Contents¶

Introduction ¶

DecisionTreeClassifier ¶

Loading Data¶

Splitting Dataset into Train & Test sets¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

GridSearchCV¶

Printing First Few Cross-Validation Results¶

Plotting Feature Importance¶

Visualizing Decision Tree Using GraphViz & PyDotPlus¶

ExtraTreeClassifier ¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Plotting Feature Importance¶

DecisionTreeRegressor ¶

Loading Data¶

Splitting Dataset into Train & Test sets¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Plotting Feature Importance¶

Visualizing Decision Tree Using GraphViz & PyDotPlus¶

ExtraTreeRegressor ¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Plotting Feature Importance¶

References ¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription