Share @ LinkedIn Facebook  treeinterpreter, tree-models
Treeinterpreter - Interpreting Tree-Based Model's Prediction of Individual Sample [Python]

Treeinterpreter - Interpreting Tree-Based Model's Prediction of Individual Sample

Interpreting the prediction of the machine learning model has become very important nowadays to check the reliability of the model. The common machine learning metrics like accuracy, r2 score, mean squared error, roc AUC curve, precision-recall curve, etc and plotting of global weights of the model do not give us 100% confidence about the performance of the model. We might need to look into further which features contributed by how much in particular prediction. The global weights might not be useful in this situation when we need to answer about the contribution of an individual feature on a particular prediction. Finding out how a particular prediction contributed to a particular prediction can help us make a better decision, as well as help, check the reliability of model performance.

The treeinterpreter is one such library which can help us finding out contribution of individual feature on particular prediction for tree based models of scikit-learn. We'll be primarily focus on it by going through various examples. Currently, Treeinterpreter supports below mentioned scikit-learn models:

  • DecisionTreeRegressor
  • DecisionTreeClassifier
  • ExtraTreeRegressor
  • ExtraTreeClassifier
  • RandomForestRegressor
  • RandomForestClassifier
  • ExtraTreesRegressor
  • ExtraTreesClassifier

The treeinterpreter is based on a concept that when making a particular prediction decision tree or random forest follows a particular path to come to that prediction. Each node in the decision tree represent some feature and makes decisions based on the feature value in the sample. The treeinterpreter divides prediction region space into regions the same as the number of leaves present in that tree. At each internal node in a tree, the prediction value will be the average of all possible predictions in data from the path going through that node. We'll have the average value for the root node as well this way which will be the average of all predictions. This way we'll have some prediction value at each node in the tree. The treeinterpreter uses these values to find out the contributions of each feature in prediction by finding out the difference in prediction by a particular node and the node in the path before it. It follows the same process for the random forest where there is more than one tree and the final prediction is taken based on an average of all trees predictions.

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

We can notice from the above example of a tree generated for Boston house price prediction. We'll explain how the sample represented by the red line will come to the final prediction. We start with a base price of 22.60 and then subtract 2.64 because the value of feature RM is less than 6.94 to come to the prediction of 19.96. We then add 3.51 to 19.96 to come to the prediction of 23.47 because the value of the feature LSTAT is less than 14.40 in the sample. We'll then add 22.12 to previous prediction 23.47 to come to final prediction 45.59 because the value of feature DIS is less than 1.38 in the sample. This way we'll start with a base value of 22.60 and then add values based on feature contributions.

The treeinterpreter takes as input tree-based model and samples and returns the base value for each sample, contributions of each feature into a prediction of each sample, and predictions for each sample. It'll become clear when we'll go through the examples below.

We'll be explaining both classification and regression models through various examples. We'll start by importing the necessary libraries.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

Regression

As first, we'll explain the usage of treeinterpreter when solving a regression task.

Load Dataset

We'll be using California housing datasets available from sklearn for explaining the usage of treeinterpreter to explain ML model predictions on datasets whose prediction variable is continuous. Below we have loaded California housing datasets. We have also printed its description which explains the individual features of the dataset. We have even shown the first few samples of the dataset.

In [2]:
from sklearn.datasets import fetch_california_housing

calif_housing = fetch_california_housing()

for line in calif_housing.DESCR.split("\n")[5:22]:
    print(line)

calif_housing_df = pd.DataFrame(data=calif_housing.data, columns=calif_housing.feature_names)
calif_housing_df["Price($)"] = calif_housing.target

calif_housing_df.head()
**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None
Out[2]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Price($)
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Below we have divided the dataset into the train (80%) and test (20%) sets. We'll be using this training dataset for training purposes and randomly select a sample from test data to explain model prediction using treeinterpreter.

In [3]:
from sklearn.model_selection import train_test_split

X_calif, Y_calif = calif_housing.data, calif_housing.target

print("Dataset Size : ", X_calif.shape, Y_calif.shape)

X_train_calif, X_test_calif, Y_train_calif, Y_test_calif = train_test_split(X_calif, Y_calif,
                                                                            train_size=0.8,
                                                                            test_size=0.2,
                                                                            random_state=123)

print("Train/Test Size : ", X_train_calif.shape, X_test_calif.shape, Y_train_calif.shape, Y_test_calif.shape)
Dataset Size :  (20640, 8) (20640,)
Train/Test Size :  (16512, 8) (4128, 8) (16512,) (4128,)

DecisionTreeRegressor

The first model that we'll fit to train data is DecisionTreeRegressor as explained below. We have then printed the R2 score of the model on train and test dataset both.

In [4]:
from sklearn.tree import DecisionTreeRegressor

dtree_reg = DecisionTreeRegressor(max_depth=10)
dtree_reg.fit(X_train_calif, Y_train_calif)
Out[4]:
DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [5]:
print("Test  R^2 Score : %.2f"%dtree_reg.score(X_test_calif, Y_test_calif))
print("Train R^2 Score : %.2f"%dtree_reg.score(X_train_calif, Y_train_calif))
Test  R^2 Score : 0.70
Train R^2 Score : 0.83

We'll start by loading treeinterpreter. The treeinterpreter has a single method named predict() which takes as input model instance and dataset for which we need explanations. It returns three arrays as output.

  • The first array is predictions for a number of samples passed to the method.
  • The second array is bias or base value for each sample of data to which individual feature contribution will be added to generate a final prediction.
  • The third array is of size (#samples x #no_of_features) as it has the contribution of each feature for each sample which gets added to base/bias value to generate predictions.

Below we have passed the decision tree regressor and California testing dataset to predict() method of treeinterpreter to generate predictions, biases, and feature contributions.

In [6]:
from treeinterpreter import treeinterpreter as ti

preds, bias, contributions = ti.predict(dtree_reg, X_test_calif)
preds.shape, bias.shape, contributions.shape
Out[6]:
((4128, 1), (4128,), (4128, 8))

Below we are explaining various values of output for the 0th sample. We are also adding contributions for the 0th sample to 0th bias value to generate a prediction.

In [7]:
print("Bias For Sample 0                        : %.2f"%bias[0])
print("Constributions For Sample 0              : %s"%contributions[0])
print("Prediction Based on Bias & Contributions : %.2f"%(bias[0] + contributions[0].sum()))
print("Actual Target Value                      : %.2f"%Y_test_calif[0])
print("Target Value As Per Treeinterpreter      : %.2f"%preds[0][0])
Bias For Sample 0                        : 2.07
Constributions For Sample 0              : [-0.16431123  0.          0.         -0.23541604  0.         -0.22254362
  0.04525048  0.10894851]
Prediction Based on Bias & Contributions : 1.60
Actual Target Value                      : 1.52
Target Value As Per Treeinterpreter      : 1.60

Below we are taking a random sample from the test dataset. We have created a method named create_contrbutions_df() which takes as input contributions, sample data, and feature names as input and generates a pandas dataframe where each row represents contributions from each feature. The first row is a bias/base value and the last row has actual prediction calculated by adding all feature contributions to the base/bias value.

In [8]:
import random

random_sample = random.randint(1, len(X_test_calif))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%Y_test_calif[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

def create_contrbutions_df(contributions, random_sample, feature_names):
    contribs = contributions[random_sample].tolist()
    contribs.insert(0, bias[random_sample])
    contribs = np.array(contribs)
    contrib_df = pd.DataFrame(data=contribs, index=["Base"] + feature_names, columns=["Contributions"])
    prediction = contrib_df.Contributions.sum()
    contrib_df.loc["Prediction"] = prediction
    return contrib_df

contrib_df = create_contrbutions_df(contributions, random_sample, calif_housing.feature_names)
contrib_df
Selected Sample     : 425
Actual Target Value : 1.67
Predicted Value     : 2.02
Out[8]:
Contributions
Base 2.069687
MedInc 0.393169
HouseAge 0.095787
AveRooms -0.134500
AveBedrms 0.000000
Population 0.000000
AveOccup -0.409689
Latitude 0.000000
Longitude 0.005552
Prediction 2.020006

Below we have created a method that takes as input contributions dataframe created earlier and creates a plotly waterfall chart of it. The chart will show how we start with the base value and add contributions of each feature to come for the final prediction.

In [ ]:
import plotly.graph_objects as go

def create_waterfall_chart(contrib_df, prediction):
    fig = go.Figure(go.Waterfall(
        name = "Prediction", #orientation = "h", 
        measure = ["relative"] * (len(contrib_df)-1) + ["total"],
        x = contrib_df.index,
        y = contrib_df.Contributions,
        connector = {"mode":"between", "line":{"width":4, "color":"rgb(0, 0, 0)", "dash":"solid"}}
    ))

    fig.update_layout(title = "Prediction : %s"%prediction)

    return fig

create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

ExtraTreeRegressor

The second estimator that we'll use for explaining the usage of treeinterpreter is ExtraTreeRegressor. We'll be following the same process as followed in the first example in all our examples.

Below we have fitted ExtraTreeRegressor to train data and evaluated R2 score on test and train data both. We have also calculated predictions, biases, and feature contributions for the test dataset using treeinterpreter.

In [10]:
from sklearn.tree import ExtraTreeRegressor

etree_reg = ExtraTreeRegressor(max_depth=15)
etree_reg.fit(X_train_calif, Y_train_calif)

print("Test  R^2 Score : %.2f"%etree_reg.score(X_test_calif, Y_test_calif))
print("Train R^2 Score : %.2f"%etree_reg.score(X_train_calif, Y_train_calif))

preds, bias, contributions = ti.predict(etree_reg, X_test_calif)
Test  R^2 Score : 0.66
Train R^2 Score : 0.87

Below we are generating a contributions dataframe for the random test sample.

In [11]:
random_sample = random.randint(1, len(X_test_calif))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%Y_test_calif[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

contrib_df = create_contrbutions_df(contributions, random_sample, calif_housing.feature_names)
contrib_df
Selected Sample     : 3000
Actual Target Value : 2.58
Predicted Value     : 4.90
Out[11]:
Contributions
Base 2.069687
MedInc 1.391115
HouseAge 0.632292
AveRooms 0.496890
AveBedrms 0.035665
Population 0.000000
AveOccup 0.000000
Latitude 0.194094
Longitude 0.076665
Prediction 4.896408

Below we have generated a waterfall chart from the contributions dataframe created in the previous step.

In [ ]:
create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

RandomForestRegressor

The third sklearn estimator that we'll explore is RandomForestRegressor. We have fitted RandomForestRegressor to train data below and evaluated R2 score on test and train data both. We have then generated predictions, biases, and contributions on the test dataset.

In [13]:
from sklearn.ensemble import RandomForestRegressor

rand_forest = RandomForestRegressor()
rand_forest.fit(X_train_calif, Y_train_calif)

print("Test  R^2 Score : %.2f"%rand_forest.score(X_test_calif, Y_test_calif))
print("Train R^2 Score : %.2f"%rand_forest.score(X_train_calif, Y_train_calif))

preds, bias, contributions = ti.predict(rand_forest, X_test_calif)
Test  R^2 Score : 0.78
Train R^2 Score : 0.96

Below we have generated contributions dataframe for a random sample of test data.

In [14]:
random_sample = random.randint(1, len(X_test_calif))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%Y_test_calif[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

contrib_df = create_contrbutions_df(contributions, random_sample, calif_housing.feature_names)
contrib_df
Selected Sample     : 113
Actual Target Value : 3.03
Predicted Value     : 2.84
Out[14]:
Contributions
Base 2.070873
MedInc 1.168176
HouseAge -0.155213
AveRooms -0.054797
AveBedrms -0.056146
Population -0.027034
AveOccup -0.040711
Latitude 0.130924
Longitude -0.198070
Prediction 2.838000

We have now plotted a waterfall chart for the random test sample.

In [ ]:
create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

ExtraTreesRegressor

The last sklearn estimator that we'll explain as a part of the regression task section is ExtraTreesRegressor. Below we have fitted ExtraTreesRegressor to train data and evaluated R2 score on test & train data both. We have then calculated predictions, biases, and feature contributions on test data.

In [16]:
from sklearn.ensemble import ExtraTreesRegressor

etrees_reg = ExtraTreesRegressor()
etrees_reg.fit(X_train_calif, Y_train_calif)

print("Test  R^2 Score : %.2f"%etrees_reg.score(X_test_calif, Y_test_calif))
print("Train R^2 Score : %.2f"%etrees_reg.score(X_train_calif, Y_train_calif))

preds, bias, contributions = ti.predict(etrees_reg, X_test_calif)
Test  R^2 Score : 0.79
Train R^2 Score : 1.00

Below we have generated contributions dataframe on the random test sample.

In [17]:
random_sample = random.randint(1, len(X_test_calif))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%Y_test_calif[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

contrib_df = create_contrbutions_df(contributions, random_sample, calif_housing.feature_names)
contrib_df
Selected Sample     : 3003
Actual Target Value : 2.73
Predicted Value     : 3.92
Out[17]:
Contributions
Base 2.069687
MedInc -0.599242
HouseAge -0.009282
AveRooms 0.045874
AveBedrms 0.223374
Population -0.101220
AveOccup 1.427132
Latitude 0.350550
Longitude 0.510232
Prediction 3.917104

We have then generated a waterfall chart for the contributions dataframe created in the previous step.

In [ ]:
create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

Classification

The second section of this tutorial will explain the usage of treeinterpreter in case of the classification tasks. We'll be using a famous wine classification dataset which is easily available for this. The dataset has information about ingredients used in three different categories of wine. Below, we have loaded the dataset and printed description of individual features as well.

In [19]:
from sklearn.datasets import load_wine

wine = load_wine()

for line in wine.DESCR.split("\n")[5:29]:
    print(line)

wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_df["WineType"] = wine.target

wine_df.head()
**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2

Out[19]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline WineType
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0

We have now divided the dataset into the train (80%) and test (20%) sets.

In [20]:
from sklearn.model_selection import train_test_split

X_wine, Y_wine = wine.data, wine.target

print("Dataset Size : ", X_wine.shape, Y_wine.shape)

X_train_wine, X_test_wine, Y_train_wine, Y_test_wine = train_test_split(X_wine, Y_wine,
                                                                        train_size=0.8,
                                                                        test_size=0.2,
                                                                        stratify=Y_wine,
                                                                        random_state=123)

print("Train/Test Size : ", X_train_wine.shape, X_test_wine.shape, Y_train_wine.shape, Y_test_wine.shape)
Dataset Size :  (178, 13) (178,)
Train/Test Size :  (142, 13) (36, 13) (142,) (36,)

DecisionTreeClassifier

The first estimator that we'll try to explain the usage of treeinterpreter for classification task is DecisionTreeClassifier. Below we have trained DecisionTreeClassifier on train data and evaluated the accuracy of the trained model on test and train datasets. We have then generated predictions, biases, and contributions for test samples.

Please make a note of the size of predictions, biases, and contributions this time. They all have the last dimension the same as a number of classes of the target variable.

  • The predictions array will have three probabilities, one for each class.
  • The biases dataset will have three biases, one for each class.
  • The contributions dataset will have three values for each feature, one per each class.
In [21]:
from sklearn.tree import DecisionTreeClassifier

dtree_classif = DecisionTreeClassifier()
dtree_classif.fit(X_train_wine, Y_train_wine)

print("Test  Accuracy : %.2f"%dtree_classif.score(X_test_wine, Y_test_wine))
print("Train Accuracy : %.2f"%dtree_classif.score(X_train_wine, Y_train_wine))

preds, bias, contributions = ti.predict(dtree_classif, X_test_wine)
preds.shape, bias.shape, contributions.shape
Test  Accuracy : 0.89
Train Accuracy : 1.00
Out[21]:
((36, 3), (36, 3), (36, 13, 3))

Below we have included code that explains the 0th sample of test data. We have also done calculations on how feature contributions are added to biases to generate prediction class.

In [22]:
print("Bias For Sample 0                        : %s"%bias[0])
print("Constributions For Sample 0              : %s"%contributions[0])
print("Prediction Based on Bias & Contributions : %.2f"%np.argmax((bias[0] + contributions[0].sum(axis=0))))
print("Actual Target Value                      : %.2f"%Y_test_wine[0])
print("Target Value As Per Treeinterpreter      : %.2f"%np.argmax(preds[0]))
Bias For Sample 0                        : [0.33098592 0.40140845 0.26760563]
Constributions For Sample 0              : [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [-0.02040816  0.02040816  0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [-0.29098592  0.55859155 -0.26760563]
 [ 0.          0.          0.        ]
 [-0.01959184  0.01959184  0.        ]
 [ 0.          0.          0.        ]]
Prediction Based on Bias & Contributions : 1.00
Actual Target Value                      : 1.00
Target Value As Per Treeinterpreter      : 1.00

We have now taken a random sample of test data. We have then created a method named create_contrbutions_df() which takes as input contributions, random sample data, and feature names to generate contributions dataframe like explained during the regression section. The only difference in contributions dataframe this time is that it'll have three columns, one for each class of data.

In [23]:
random_sample = random.randint(1, len(X_test_wine))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %s"%wine.target_names[Y_test_wine[random_sample]])
print("Predicted Value     : %s"%wine.target_names[np.argmax(preds[random_sample])])

def create_contrbutions_df(contributions, random_sample, feature_names):
    contribs = contributions[random_sample].tolist()
    contribs.insert(0, bias[random_sample])
    contribs = np.array(contribs)
    contrib_df = pd.DataFrame(data=contribs, index=["Base"] + feature_names, columns=["Contributions_0", "Contributions_1", "Contributions_2"])
    prediction = contrib_df[["Contributions_0", "Contributions_1", "Contributions_2"]].sum()
    contrib_df.loc["Prediction"] = prediction
    return contrib_df

contrib_df = create_contrbutions_df(contributions, random_sample, wine.feature_names)
contrib_df
Selected Sample     : 32
Actual Target Value : class_0
Predicted Value     : class_0
Out[23]:
Contributions_0 Contributions_1 Contributions_2
Base 0.330986 4.014085e-01 0.267606
alcohol 0.000000 0.000000e+00 0.000000
malic_acid 0.000000 0.000000e+00 0.000000
ash 0.000000 0.000000e+00 0.000000
alcalinity_of_ash 0.000000 0.000000e+00 0.000000
magnesium 0.000000 0.000000e+00 0.000000
total_phenols 0.000000 0.000000e+00 0.000000
flavanoids 0.359926 5.311731e-02 -0.413043
nonflavanoid_phenols 0.000000 0.000000e+00 0.000000
proanthocyanins 0.000000 0.000000e+00 0.000000
color_intensity 0.158145 -3.035824e-01 0.145438
hue 0.000000 0.000000e+00 0.000000
od280/od315_of_diluted_wines 0.000000 0.000000e+00 0.000000
proline 0.150943 -1.509434e-01 0.000000
Prediction 1.000000 -2.775558e-17 0.000000

Below we have generated a waterfall chart from the contributions dataframe based on contributions of the column where the last row has value 1.

In [ ]:
idx = contrib_df.loc["Prediction"].values.argmax()
col = "Contributions_%d"%idx
contrib_df = contrib_df[[col]].rename(columns={col:"Contributions"})

create_waterfall_chart(contrib_df, wine.target_names[idx])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

ExtraTreeClassifier

The second sklearn estimator that we'll train is ExtraTreeClassifier. We have also printed the accuracy of the model on the test and train dataset. We have also then generated predictions, biases, and contributions for this model on the test dataset.

In [25]:
from sklearn.tree import ExtraTreeClassifier

etree_classif = ExtraTreeClassifier()
etree_classif.fit(X_train_wine, Y_train_wine)

print("Test  Accuracy : %.2f"%etree_classif.score(X_test_wine, Y_test_wine))
print("Train Accuracy : %.2f"%etree_classif.score(X_train_wine, Y_train_wine))

preds, bias, contributions = ti.predict(etree_classif, X_test_wine)
Test  Accuracy : 0.89
Train Accuracy : 1.00

Below we have generated contributions dataframe for a random sample of the test dataset.

In [26]:
random_sample = random.randint(1, len(X_test_wine))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %s"%wine.target_names[Y_test_wine[random_sample]])
print("Predicted Value     : %s"%wine.target_names[np.argmax(preds[random_sample])])

contrib_df = create_contrbutions_df(contributions, random_sample, wine.feature_names)
contrib_df
Selected Sample     : 14
Actual Target Value : class_2
Predicted Value     : class_2
Out[26]:
Contributions_0 Contributions_1 Contributions_2
Base 3.309859e-01 0.401408 0.267606
alcohol 0.000000e+00 -0.562500 0.562500
malic_acid 0.000000e+00 -0.104167 0.104167
ash 0.000000e+00 0.000000 0.000000
alcalinity_of_ash 0.000000e+00 0.000000 0.000000
magnesium -1.568473e-01 0.156716 0.000131
total_phenols 0.000000e+00 0.000000 0.000000
flavanoids 0.000000e+00 0.000000 0.000000
nonflavanoid_phenols -1.540299e-01 -0.016716 0.170746
proanthocyanins 0.000000e+00 0.000000 0.000000
color_intensity 1.989128e-02 0.098592 -0.118483
hue 0.000000e+00 0.000000 0.000000
od280/od315_of_diluted_wines 0.000000e+00 0.000000 0.000000
proline -4.000000e-02 0.026667 0.013333
Prediction 6.938894e-18 0.000000 1.000000

We have now generated a waterfall chart from the contributions dataframe using column where the last row is 1.

In [ ]:
idx = contrib_df.loc["Prediction"].values.argmax()
col = "Contributions_%d"%idx
contrib_df = contrib_df[[col]].rename(columns={col:"Contributions"})

create_waterfall_chart(contrib_df, wine.target_names[idx])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

RandomForestClassifier

The third estimator that we have trained on wine train data is RandomForestClassifier. We have also printed the test and train the accuracy of the model. We then calculated predictions, biases, and feature contributions for test samples.

In [28]:
from sklearn.ensemble import RandomForestClassifier

rf_classif = RandomForestClassifier()
rf_classif.fit(X_train_wine, Y_train_wine)

print("Test  Accuracy : %.2f"%rf_classif.score(X_test_wine, Y_test_wine))
print("Train Accuracy : %.2f"%rf_classif.score(X_train_wine, Y_train_wine))

preds, bias, contributions = ti.predict(rf_classif, X_test_wine)
Test  Accuracy : 0.97
Train Accuracy : 1.00

We have now taken a random test sample and generated a contributions dataframe based on it.

In [29]:
random_sample = random.randint(1, len(X_test_wine))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %s"%wine.target_names[Y_test_wine[random_sample]])
print("Predicted Value     : %s"%wine.target_names[np.argmax(preds[random_sample])])

contrib_df = create_contrbutions_df(contributions, random_sample, wine.feature_names)
contrib_df
Selected Sample     : 9
Actual Target Value : class_0
Predicted Value     : class_0
Out[29]:
Contributions_0 Contributions_1 Contributions_2
Base 0.330282 0.403521 2.661972e-01
alcohol 0.129859 -0.163338 3.347835e-02
malic_acid 0.029504 0.003362 -3.286635e-02
ash 0.000000 0.000000 0.000000e+00
alcalinity_of_ash 0.016705 -0.006178 -1.052632e-02
magnesium 0.033831 -0.034356 5.250525e-04
total_phenols -0.013669 -0.012396 2.606508e-02
flavanoids 0.185725 0.019274 -2.049993e-01
nonflavanoid_phenols 0.002222 -0.002222 0.000000e+00
proanthocyanins 0.000000 0.040105 -4.010539e-02
color_intensity -0.017827 0.000987 1.684017e-02
hue 0.012495 -0.003192 -9.302326e-03
od280/od315_of_diluted_wines 0.006912 0.018154 -2.506624e-02
proline 0.183960 -0.163720 -2.023996e-02
Prediction 0.900000 0.100000 1.387779e-17

Below we have plotted a waterfall chart of predicted class feature contributions.

In [ ]:
idx = contrib_df.loc["Prediction"].values.argmax()
col = "Contributions_%d"%idx
contrib_df = contrib_df[[col]].rename(columns={col:"Contributions"})

create_waterfall_chart(contrib_df, wine.target_names[idx])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

ExtraTreesClassifier

The fourth and last estimator that we'll explain is ExtraTreesClassifier. We have trained ExtraTreesClassifier on train data and calculated accuracy on test and train data. We have then calculated predictions, biases, and feature contributions using treeinterpreter for test samples.

In [31]:
from sklearn.ensemble import ExtraTreesClassifier

etrees_classif = ExtraTreesClassifier()
etrees_classif.fit(X_train_wine, Y_train_wine)

print("Test  Accuracy : %.2f"%etrees_classif.score(X_test_wine, Y_test_wine))
print("Train Accuracy : %.2f"%etrees_classif.score(X_train_wine, Y_train_wine))

preds, bias, contributions = ti.predict(etrees_classif, X_test_wine)
Test  Accuracy : 0.97
Train Accuracy : 1.00

Below we have taken a random sample from test data and generated feature contributions dataframe from it.

In [32]:
random_sample = random.randint(1, len(X_test_wine))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %s"%wine.target_names[Y_test_wine[random_sample]])
print("Predicted Value     : %s"%wine.target_names[np.argmax(preds[random_sample])])

contrib_df = create_contrbutions_df(contributions, random_sample, wine.feature_names)
contrib_df
Selected Sample     : 6
Actual Target Value : class_1
Predicted Value     : class_1
Out[32]:
Contributions_0 Contributions_1 Contributions_2
Base 3.309859e-01 0.401408 2.676056e-01
alcohol -1.578991e-01 0.197958 -4.005875e-02
malic_acid 2.619048e-03 -0.030175 2.755556e-02
ash 2.773810e-02 -0.033864 6.125541e-03
alcalinity_of_ash -7.813283e-03 -0.001394 9.207045e-03
magnesium -6.396396e-03 -0.019511 2.590734e-02
total_phenols 2.051282e-02 -0.020513 0.000000e+00
flavanoids 4.221722e-02 0.117587 -1.598046e-01
nonflavanoid_phenols 3.231152e-02 -0.008018 -2.429308e-02
proanthocyanins 3.346871e-03 0.011087 -1.443370e-02
color_intensity -1.199572e-01 0.228560 -1.086023e-01
hue -5.650711e-02 -0.034076 9.058280e-02
od280/od315_of_diluted_wines 7.329753e-02 0.032569 -1.058669e-01
proline -1.844559e-01 0.158381 2.607541e-02
Prediction 1.110223e-16 1.000000 8.673617e-17

At last, we have generated a waterfall chart for predicted class feature contributions from the contributions dataframe created in the previous step.

In [ ]:
idx = contrib_df.loc["Prediction"].values.argmax()
col = "Contributions_%d"%idx
contrib_df = contrib_df[[col]].rename(columns={col:"Contributions"})

create_waterfall_chart(contrib_df, wine.target_names[idx])

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample

Text Classification Example

We'll now explain how we can use treeinterpreter with a text dataset which is trained using tree based classifier. We'll start by downloading the spam/ham mails dataset from the UCI machine learning datasets repository.

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
--2020-11-01 21:10:32--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z 100%[===================>] 198.65K  60.6KB/s    in 3.3s

2020-11-01 21:10:38 (60.6 KB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection
  inflating: readme

Below we have loaded the dataset as a list of text mails and their class (ham/spam).

In [34]:
with open('SMSSpamCollection') as f:
    data = [line.strip().split('\t') for line in f.readlines()]

y, text = zip(*data)
In [35]:
import collections

collections.Counter(y)
Out[35]:
Counter({'ham': 4827, 'spam': 747})

We have divided the dataset below into train (75%) and test (25%) sets.

In [36]:
from sklearn.model_selection import train_test_split

text_train, text_test, y_train, y_test = train_test_split(text, y,
                                                          random_state=42,
                                                          test_size=0.25,
                                                          stratify=y)

Below we have used the TF-IDF vectorizer to convert text data to the floating matrix. We have then trained the RandomForestClassifier classifier on this transformed matrix and printed test & train accuracy. We have then calculated predictions, biases, and contributions of test samples using treeinterpreter.

If you don’t have a background on feature extraction from text data and interested in learning about the same then please feel free to check our tutorial on the same.

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_features=500)
tfidf_vectorizer.fit(text_train)

X_train_tfidf = tfidf_vectorizer.transform(text_train)
X_test_tfidf = tfidf_vectorizer.transform(text_test)


print(X_train_tfidf.shape, X_test_tfidf.shape)

rf = RandomForestClassifier()

rf.fit(X_train_tfidf, y_train)

print("Test  Accuracy : %.2f"%rf.score(X_test_tfidf, y_test))
print("Train Accuracy : %.2f"%rf.score(X_train_tfidf, y_train))

preds, bias, contributions = ti.predict(rf, X_test_tfidf)
preds.shape, bias.shape, contributions.shape
(4180, 500) (1394, 500)
Test  Accuracy : 0.97
Train Accuracy : 1.00
Out[37]:
((1394, 2), (1394, 2), (1394, 500, 2))

Below we have taken a random test sample and generated a contributions dataframe based on words that contribute to prediction. We have kept a dataframe with only 20 important features as main contributors.

In [42]:
random_sample = random.randint(1, len(text_test))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %s"%y_test[random_sample])
print("Predicted Value     : %s"%["ham", "spam"][np.argmax(preds[random_sample])])
print("Test Sample : ", text_test[random_sample])

contribs = contributions[random_sample].tolist()
contribs.insert(0, bias[random_sample])
contribs = np.array(contribs)
contrib_df = pd.DataFrame(data=contribs, index=["Base"] + tfidf_vectorizer.get_feature_names(), columns=["ham", "spam"])
prediction = contrib_df[["ham", "spam"]].sum()
contrib_df.loc["Prediction"] = prediction

first = pd.DataFrame(contrib_df.loc["Base"]).T
contrib_df = contrib_df[1:].sort_values(by=y_test[random_sample])[-20:]
contrib_df = pd.concat((first,contrib_df))
contrib_df
Selected Sample     : 838
Actual Target Value : spam
Predicted Value     : spam
Test Sample :  Thanks for the Vote. Now sing along with the stars with Karaoke on your mobile. For a FREE link just reply with SING now.
Out[42]:
ham spam
Base 0.866555 0.133445
day -0.006667 0.006667
good -0.007619 0.007619
liao -0.008205 0.008205
know -0.008327 0.008327
decimal -0.008680 0.008680
ya -0.009524 0.009524
pick -0.010112 0.010112
later -0.010119 0.010119
messages -0.010256 0.010256
gt -0.010351 0.010351
video -0.010397 0.010397
hi -0.011785 0.011785
ur -0.012854 0.012854
like -0.016663 0.016663
sat -0.020569 0.020569
just -0.066220 0.066220
reply -0.091385 0.091385
mobile -0.136736 0.136736
free -0.397951 0.397951
Prediction 0.200000 0.800000

Below we have generated a waterfall chart showing 20 main features that contributed most to prediction.

In [ ]:
pred = y_test[random_sample]
create_waterfall_chart(contrib_df[[pred]].rename(columns={pred:"Contributions"}), pred)

Treeinterpreter - Interpreting Tree Based Model's Prediction of Individual Sample



Sunny Solanki  Sunny Solanki