Updated On : May-29,2020 Time Investment : ~30 mins

# Scikit-Learn - Decision Trees¶

## Introduction ¶

Decision Trees are a class of algorithms that are based on "if" and "else" conditions. Based on these conditions, decisions are made to the task at hand. These conditions are decided by an algorithm based on data at hand. How many conditions, kind of conditions, and answers to that conditions are based on data and will be different for each dataset. We'll be covering the usage of decision tree implementation available in scikit-learn for classification and regression tasks below.

Below we have highlighted some characteristics of decision tree

Characteristics of decision trees:

• Fast to train and easy to understand & interpret.
• Binary splitting of questions is the essence of decision tree models.
• Requires little preprocessing of data.
• Can work with variables of different types (continuous & discrete)
• Invariant to feature scaling.
• Models are called "nonparametric" because there are no hyper-parameters to tune.
• If given more data then the model becomes more flexible.
• Number of tree parameters (conditions) grows with the number of samples covering as much domain of data as possible.

We'll start by importing the necessary modules needed for our tutorial. We'll need `pydotplus` library installed as it'll be used to plot decision trees trained by scikit-learn.

```## We need to install pydotplus for this tutorial.
!pip install pydotplus
```
```Requirement already satisfied: pydotplus in ./anaconda3/lib/python3.7/site-packages (2.0.2)
Requirement already satisfied: pyparsing>=2.0.1 in ./anaconda3/lib/python3.7/site-packages (from pydotplus) (2.4.7)
```
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import sys
import warnings

warnings.filterwarnings('ignore')
np.set_printoptions(precision=2)

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
```
```Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2
```

## DecisionTreeClassifier ¶

Below we are loading classic IRIS classification dataset provided by scikit-learn which has 150 samples of 3 categories of flowers containing 50 samples for each category (iris-setosa, iris-virginica, iris-versicolor). We'll use DecisionTreeClassifier provided by scikit-learn for the classification tasks.

Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns `Bunch` object which is almost the same as the dictionary.

```from sklearn import datasets

X, Y = iris.data, iris.target

print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target size : '+str(iris.target.shape))
```
```Dataset features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Dataset features size : (150, 4)
Dataset target names : ['setosa' 'versicolor' 'virginica']
Dataset target size : (150,)
```

### Splitting Dataset into Train & Test sets¶

We'll split the dataset into two parts:

• `Training data` which will be used for the training model.
• `Test data` against which accuracy of the trained model will be checked.

`train_test_split` function of the `model_selection` module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with `train_test_split` so that we always get the same split and can reproduce results in the future as well.

NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

```from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, stratify=Y, random_state=123)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
```
```Train/Test Sizes :  (112, 4) (38, 4) (112,) (38,)
```

### Fitting Model To Train Data¶

```from sklearn.tree import DecisionTreeClassifier

tree_classifier = DecisionTreeClassifier(random_state=1)
tree_classifier.fit(X_train, Y_train)
```
```DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1, splitter='best')```

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it. We'll use `score()` which returns the accuracy of the model to check model accuracy on test data.

```Y_preds = tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%tree_classifier.score(X_train, Y_train))
```
```[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 2 2 2 2 1 1 2 1 1 2 1 2
1]
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
1]
Test Accuracy : 0.974
Test Accuracy : 0.974
Training Accuracy : 1.000
```

`DecisionTreeClassifier` instance provides `predict_proba()` method which returns probability returned by model for each class. We'll try to print probabilities predicted by the model for the first few test samples.

```tree_classifier.predict_proba(X_test)[:10]
```
```array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below is a list of common hyper-parameters that needs tuning for getting best fit for our data. We'll try various hyper-parameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

• criterion: It accepts string argument specifying which function to use to measure the quality of a split.
• `gini` - Gini Impurity. This is the default value.
• `entropy` - Information Gain.
• max_depth - It defines how finely tree can separate samples (list of "if-else" questions to ask deciding target variable). As we increase max_depth, model over-fits, and less value of max_depth results in under-fit. We need to find the best value. If no value is provided then by default `None` is used.
• max_features - Number of features to consider when doing split. It accepts int(0-n_features), float(0.0-0.5], string(sqrt, log2, auto) or `None` as value.
• `None` - `n_features` are used as value if None is provided.
• `sqrt` - `sqrt(n_features)` features are used for split.
• `auto` - `sqrt(n_features)` features are used for split.
• `log2` - `log2(n_features)` features are used for split.
• min_samples_split - Number of samples required to split internal node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_split * n_samples) features.
• min_samples_leaf - Minimum number of samples required to be at leaf node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_leaf * n_samples) features.

### GridSearchCV¶

It's a wrapper class provided by sklearn which loops through all parameters provided as `params_grid` parameter with a number of cross-validation folds provided as `cv` parameter, evaluates model performance on all combinations and stores all results in `cv_results_` attribute. It also stores model which performs best in all cross-validation folds in `best_estimator_` attribute and best score in `best_score_` attribute.

NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

```from sklearn.model_selection import GridSearchCV

n_features = X.shape
n_samples = X.shape

grid = GridSearchCV(DecisionTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
param_grid ={
'criterion': ['gini', 'entropy'],
'max_depth': [None,1,2,3,4,5,6,7],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
)

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
```
```Fitting 3 folds for each of 5184 candidates, totalling 15552 fits
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 3681 tasks      | elapsed:    5.9s
```
```Train Accuracy : 1.000
Test Accuracy : 0.974
Best Score Through Grid Search : 0.964
Best Parameters :  {'criterion': 'gini', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
```
```[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:    8.8s finished
```

### Printing First Few Cross-Validation Results¶

`GridSearchCV` maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling `cv_results_` attribute on it. We are converting it to pandas dataframe for better visuals.

```cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
```
```Number of Various Combinations of Parameters Tried : 5184
```
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.001922 0.000061 0.000724 0.000022 gini None None 1 2 {'criterion': 'gini', 'max_depth': None, 'max_... 0.923077 0.972973 1.000000 0.964286 0.032035 1
1 0.001439 0.000390 0.000535 0.000123 gini None None 1 0.3 {'criterion': 'gini', 'max_depth': None, 'max_... 0.923077 0.972973 0.972222 0.955357 0.023596 9
2 0.001997 0.000706 0.000468 0.000039 gini None None 1 0.5 {'criterion': 'gini', 'max_depth': None, 'max_... 0.923077 0.918919 0.972222 0.937500 0.023959 666
3 0.000884 0.000095 0.000354 0.000020 gini None None 1 75 {'criterion': 'gini', 'max_depth': None, 'max_... 0.333333 0.675676 0.666667 0.553571 0.161018 2161
4 0.001103 0.000361 0.000462 0.000139 gini None None 1 50 {'criterion': 'gini', 'max_depth': None, 'max_... 0.666667 0.918919 0.972222 0.848214 0.134430 979

### Plotting Feature Importance¶

We can access the feature importance of each feature in the decision tree through `feature_importances_` attributes. We have plotted it as well for better understanding.

```print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
```
```Feature Importance : [0.   0.02 0.42 0.56]
```
```with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(10,4))
plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
plt.xticks(range(4), iris.feature_names)
plt.yticks([])
plt.grid(None)
plt.colorbar();
```

### Visualizing Decision Tree Using GraphViz & PyDotPlus¶

We can visualize the decision tree by using `graphviz`. Scikit-learn provides `export_graphviz()` function which can let us convert tree trained to graphviz format. We can then generate a graph from it using the `pydotplus` library using its method `graph_from_dot_data`.

We can easily ask questions about flower type based on flower features and get an answer from the decision tree based on `True` or `False` answer to the question.

```from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus

dot_data = StringIO()

export_graphviz(grid.best_estimator_, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,
class_names=iris.target_names,
feature_names=iris.feature_names)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())
```

## ExtraTreeClassifier ¶

ExtraTreeClassifier is commonly referred to as an extremely randomized decision tree. When deciding to split samples into 2 groups based on a feature, random splits are drawn for each of randomly selected features and the best of them is selected.

### Fitting Model To Train Data¶

```from sklearn.tree import ExtraTreeClassifier

extra_tree_classifier = ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)
```
```ExtraTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, random_state=1,
splitter='random')```

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it.

```Y_preds = extra_tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%extra_tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_tree_classifier.score(X_train, Y_train))
```
```[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
1]
[2 0 1 2 0 0 1 2 1 0 1 0 2 2 1 2 0 0 0 0 0 0 1 2 0 1 2 2 2 1 1 2 1 1 2 1 2
1]
Test Accuracy : 1.000
Test Accuracy : 1.000
Training Accuracy : 1.000
```
```extra_tree_classifier.predict_proba(X_test)[:10]
```
```array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

`ExtraTreeClassifier` has same hyperparameters as that of `DecisionTreeClassifier`

```n_features = X.shape
n_samples = X.shape

grid = GridSearchCV(ExtraTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
param_grid ={
'criterion': ['gini', 'entropy'],
'max_depth': [None,1,2,3,4,5,6,7],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
)

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
```
```Fitting 3 folds for each of 5184 candidates, totalling 15552 fits
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 11224 tasks      | elapsed:    3.1s
```
```Train Accuracy : 0.982
Test Accuracy : 0.974
Best Score Through Grid Search : 0.946
Best Parameters :  {'criterion': 'gini', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}
```
```[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:    4.1s finished
```

### Printing First Few Cross Validation Results¶

```cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
```
```Number of Various Combinations of Parameters Tried : 5184
```
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.001576 0.000044 0.000667 0.000030 gini None None 1 2 {'criterion': 'gini', 'max_depth': None, 'max_... 0.948718 0.864865 0.972222 0.928571 0.045766 173
1 0.001509 0.000022 0.000781 0.000148 gini None None 1 0.3 {'criterion': 'gini', 'max_depth': None, 'max_... 0.948718 0.891892 0.972222 0.937500 0.033444 13
2 0.001464 0.000040 0.000617 0.000024 gini None None 1 0.5 {'criterion': 'gini', 'max_depth': None, 'max_... 0.871795 0.864865 0.972222 0.901786 0.048562 269
3 0.001186 0.000010 0.000561 0.000011 gini None None 1 75 {'criterion': 'gini', 'max_depth': None, 'max_... 0.333333 0.675676 0.666667 0.553571 0.161018 1237
4 0.001218 0.000011 0.000639 0.000121 gini None None 1 50 {'criterion': 'gini', 'max_depth': None, 'max_... 0.666667 0.756757 0.750000 0.723214 0.041422 545

### Plotting Feature Importance¶

```print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
```
```Feature Importance : [0.07 0.   0.66 0.27]
```
```with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(10,4))
plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
plt.xticks(range(4), iris.feature_names)
plt.yticks([])
plt.grid(None)
plt.colorbar();
```

NOTE

Please make a note that even though decision trees provides a way to measure target in nonparametric way, it sometimes over-fits data and sometimes under-fits data. hence decision trees are not efficient for dataset with more features and less samples to properly set tree rules/conditions.

## DecisionTreeRegressor ¶

We'll now try loading the Boston dataset provided by sklearn and will try DecisionTreeRegressor on it as well with different depth of the decision tree. We'll also visualize results letter comparing performance on train and test sets with different tree depths.

Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns Bunch object which is almost the same as the dictionary.

```boston = datasets.load_boston()
X, Y  = boston.data, boston.target

print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))
```
```Dataset features names : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Dataset features size : (506, 13)
Dataset target size : (506,)
```

### Splitting Dataset into Train & Test sets¶

Below we are splitting the Boston dataset into the train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.

```X_train, X_test,Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, random_state=1)
print('Train/Test Set Sizes : ', X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)
```
```Train/Test Set Sizes :  (379, 13) (379,) (127, 13) (127,)
```

### Fitting Model To Train Data¶

```from sklearn.tree import DecisionTreeRegressor

tree_regressor = DecisionTreeRegressor(random_state=1)
tree_regressor.fit(X_train, Y_train)
```
```DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, splitter='best')```

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it.

```Y_preds = tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%tree_regressor.score(X_test, Y_test))
```
```[36.1 27.5 21.7 18.6 21.7 21.7 30.8 21.7 17.8 24.6]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.699
```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

`DecisionTreeRegressor` has same hyperparameters as `DecisionTreeClassifier`. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

```n_features = X.shape
n_samples = X.shape

grid = GridSearchCV(DecisionTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
param_grid ={
'max_depth': [None,1,2,3,4,5,6,7],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
)

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
```
```Fitting 3 folds for each of 2592 candidates, totalling 7776 fits
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 152 tasks      | elapsed:    0.3s
```
```Train R^2 Score : 0.909
Test R^2 Score : 0.768
Best R^2 Score Through Grid Search : 0.780
Best Parameters :  {'max_depth': 5, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2}
```
```[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:    2.9s finished
```

### Printing First Few Cross Validation Results¶

```cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
```
```Number of Various Combinations of Parameters Tried : 2592
```
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.009543 0.002699 0.002643 0.000548 None None 1 2 {'max_depth': None, 'max_features': None, 'min... 0.624971 0.514967 0.765272 0.635043 0.102302 39
1 0.003043 0.000175 0.001470 0.000079 None None 1 0.3 {'max_depth': None, 'max_features': None, 'min... 0.706766 0.577495 0.786565 0.690319 0.086036 16
2 0.002141 0.000282 0.001283 0.000041 None None 1 0.5 {'max_depth': None, 'max_features': None, 'min... 0.428475 0.659663 0.457726 0.515059 0.102745 123
3 0.001505 0.000201 0.001882 0.000942 None None 1 253 {'max_depth': None, 'max_features': None, 'min... -0.008807 0.343799 0.425772 0.252896 0.188767 833
4 0.002236 0.000361 0.001326 0.000029 None None 1 168 {'max_depth': None, 'max_features': None, 'min... 0.402911 0.576992 0.425772 0.468385 0.077212 242

### Plotting Feature Importance¶

```print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
```
```Feature Importance : [0.01 0.   0.   0.   0.05 0.56 0.   0.06 0.04 0.   0.02 0.   0.25]
```
```with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(12,8))
plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
plt.xticks(range(13), boston.feature_names)
plt.yticks([])
plt.grid(None)
plt.colorbar();
```

### Visualizing Decision Tree Using GraphViz & PyDotPlus¶

```dot_data = StringIO()
export_graphviz(grid.best_estimator_, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,
feature_names=boston.feature_names,)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
```

## ExtraTreeRegressor ¶

ExtraTreeRegressor like ExtraTreeClassifier is an extremely randomized decision tree for regression problems. We'll follow the same process as previous examples to explain its usage.

### Fitting Model To Train Data¶

```from sklearn.tree import ExtraTreeRegressor

extra_tree_regressor = ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)
```
```ExtraTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=1, splitter='random')```

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it.

```Y_preds = extra_tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_test, Y_test))
```
```[25.2 23.6 16.8 20.9 18.4 23.  22.8 19.6 18.8 22. ]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Training Coefficient of R^2 : 1.000
Test Coefficient of R^2 : 0.752
```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

`ExtraTreeRegressor` has same hyperparameters as `ExtraTreeClassifier`. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

```n_features = X.shape
n_samples = X.shape

grid = GridSearchCV(ExtraTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
param_grid ={
'max_depth': [None,1,2,3,4,5,6,7],
'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
)

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
```
```Fitting 3 folds for each of 2592 candidates, totalling 7776 fits
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  60 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 5880 tasks      | elapsed:    2.1s
```
```Train R^2 Score : 0.907
Test R^2 Score : 0.780
Best R^2 Score Through Grid Search : 0.707
Best Parameters :  {'max_depth': 7, 'max_features': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 2}
```
```[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:    2.8s finished
```

### Printing First Few Cross Validation Results¶

```cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
```
```Number of Various Combinations of Parameters Tried : 2592
```
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_max_features param_min_samples_leaf param_min_samples_split params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.003376 0.001150 0.001233 0.000013 None None 1 2 {'max_depth': None, 'max_features': None, 'min... 0.663471 0.554314 0.644248 0.620790 0.047566 13
1 0.001512 0.000076 0.001171 0.000010 None None 1 0.3 {'max_depth': None, 'max_features': None, 'min... 0.503981 0.649523 0.534968 0.562669 0.062587 42
2 0.001167 0.000215 0.001085 0.000347 None None 1 0.5 {'max_depth': None, 'max_features': None, 'min... 0.455598 0.625403 0.435738 0.505448 0.085041 98
3 0.000550 0.000021 0.000526 0.000014 None None 1 253 {'max_depth': None, 'max_features': None, 'min... -0.008807 0.429324 0.388414 0.268909 0.197857 599
4 0.000525 0.000048 0.000460 0.000016 None None 1 168 {'max_depth': None, 'max_features': None, 'min... 0.455598 0.552272 0.435738 0.481135 0.050853 144

### Plotting Feature Importance¶

```print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))
```
```Feature Importance : [5.70e-02 2.24e-04 1.62e-02 1.03e-02 5.83e-02 5.92e-02 1.37e-02 2.91e-03
1.30e-02 4.84e-02 6.10e-02 2.57e-02 6.34e-01]
```
```with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(12,8))
plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
plt.xticks(range(13), boston.feature_names)
plt.yticks([])
plt.grid(None)
plt.colorbar();
```

The single tree generally overfits data and hence in practice, it's a good idea to combine various decision trees to predict results. The two most common ways to combine multiple decision trees are random forests and gradient boosted trees.

Sunny Solanki

## Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

## Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

## Want to Share Your Views? Have Any Suggestions?

If you want to

• provide some suggestions on topic