- Introduction
- 1. Load Datasets
- 2. Estimator Performance
- 3. Machine Learning Metrics
- 4. Clustering Plot
- 5. Dimensionality Reduction Plots
- References

Scikit learn is a very commonly used library for various machine learning algorithms. Once we are done with training machine learning algorithms, we need the right way to visualize results as well as the performance of the algorithm. We need to visualize various metrics to understand results better and hence performance of the algorithm. Matplotlib is a very commonly used data visualization library for plotting results of ML algorithms. But plotting with matplotlib requires a quite a learning curve to plot ML algorithms results. Also, one minor mistake when implementing visualizations can result in interpreting results totally wrong way. This can even slow down the process of testing various algorithms if data scientist is involved in getting visualizations right. In short, it can even slow data scientist and whole train-test process.

But what if you have a library that is ready-made for plotting results of machine learning algorithms?

It'll fasten your whole process a lot and you won't have to get into getting visualizations right. You also don't need to worry about its styling as well. The data scientist can then peacefully concentrate on his/her machine learning algorithms performance and try many different experiments.

Python has a library called **Scikit-Plot** which provides visualizations for many machine learning metrics related to regression, classification, and clustering. Scikit-Plot is built on top of `matplotlib`

. So if you have some background on matplotlib then you can build on top of scikit-plot plots as well.

Scikit-plot has 4 main modules which are used for different visualizations as described below.

`estimators`

- It has methods for plotting the performance of various machine learning algorithms.`metrics`

- It has methods for plotting various machine learning metrics like confusion matrix, ROC AUC curves, precision-recall curves, etc.`cluster`

- It currently has one method for plotting elbow method plot for clustering to find out the best number of clusters for data.`decomposition`

- It has methods for plotting results of PCA decomposition.

As a part of this tutorial, we'll be explaining the usage of scikit-plot API for interpreting ML algorithms results.

We'll start by importing the necessary libraries for our tutorial.

In [1]:

```
import scikitplot as skplt
import sklearn
from sklearn.datasets import load_digits, load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import sys
import warnings
warnings.filterwarnings("ignore")
print("Scikit Plot Version : ", skplt.__version__)
print("Scikit Learn Version : ", sklearn.__version__)
print("Python Version : ", sys.version)
%matplotlib inline
```

We'll be loading three different datasets which we'll be using to train various machine learning models. We'll then visualize the results of these models.

The first dataset that we'll load is digits dataset which is 8x8 images of numbers. It's readily available in scikit-learn for our usage.

We'll then divide the dataset into the train (80%) and test sets(20%).

In [2]:

```
digits = load_digits()
X_digits, Y_digits = digits.data, digits.target
print("Digits Dataset Size : ", X_digits.shape, Y_digits.shape)
```

In [3]:

```
X_digits_train, X_digits_test, Y_digits_train, Y_digits_test = train_test_split(X_digits, Y_digits,
train_size=0.8,
stratify=Y_digits,
random_state=1)
print("Digits Train/Test Sizes : ",X_digits_train.shape, X_digits_test.shape, Y_digits_train.shape, Y_digits_test.shape)
```

The second dataset that we'll use if a cancer dataset which has information about the malignant and benign tumor. It's also readily available in scikit-learn for our use.

We'll divide it as well in train (80%) and test sets (20%). We have also printed features available with the dataset.

In [4]:

```
cancer = load_breast_cancer()
X_cancer, Y_cancer = cancer.data, cancer.target
print("Feautre Names : ", cancer.feature_names)
print("Cancer Dataset Size : ", X_cancer.shape, Y_cancer.shape)
```

In [5]:

```
X_cancer_train, X_cancer_test, Y_cancer_train, Y_cancer_test = train_test_split(X_cancer, Y_cancer,
train_size=0.8,
stratify=Y_cancer,
random_state=1)
print("Cancer Train/Test Sizes : ",X_cancer_train.shape, X_cancer_test.shape, Y_cancer_train.shape, Y_cancer_test.shape)
```

The third dataset that we'll use is the Boston housing price dataset. It has information about various houses of Boston and the price at which they were sold. We'll divide it as well in train and test sets with the same proportion as above mentioned datasets.

In [6]:

```
boston = load_boston()
X_boston, Y_boston = boston.data, boston.target
print("Boston Dataset Size : ", X_boston.shape, Y_boston.shape)
print("Boston Dataset Features : ", boston.feature_names)
```

In [7]:

```
X_boston_train, X_boston_test, Y_boston_train, Y_boston_test = train_test_split(X_boston, Y_boston,
train_size=0.8,
random_state=1)
print("Boston Train/Test Sizes : ",X_boston_train.shape, X_boston_test.shape, Y_boston_train.shape, Y_boston_test.shape)
```

The module that we'll be exploring is the `estimators`

module. We'll be plotting various plots after training ML models.

We can plot the cross-validation performance of models by passing it whole dataset. Scikit-plot provides a method named `plot_learning_curve()`

as a part of the `estimators`

module which accepts estimator, X, Y, cross-validation info, and scoring metric for plotting performance of cross-validation on the dataset.

Below we are plotting the performance of logistic regression on digits dataset with cross-validation.

In [ ]:

```
skplt.estimators.plot_learning_curve(LogisticRegression(), X_digits, Y_digits,
cv=7, shuffle=True, scoring="accuracy",
n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
title="Digits Classification Learning Curve");
```

Below we are plotting the performance of linear regression on the Boston dataset with cross-validation.

In [ ]:

```
skplt.estimators.plot_learning_curve(LinearRegression(), X_boston, Y_boston,
cv=7, shuffle=True, scoring="r2", n_jobs=-1,
figsize=(6,4), title_fontsize="large", text_fontsize="large",
title="Boston Regression Learning Curve ");
```

We can use many other scoring metrics for plotting purposes. Below is a list of scoring metrics available with scikit-learn.

In [10]:

```
sklearn.metrics.SCORERS.keys()
```

Out[10]:

The second chart that we'll be plotting is a bar chart depicting the importance of features for various ML models. We'll first train random forest on the Boston dataset and gradient boosting on the cancer dataset. We'll then plot feature importance available from both models as a bar chart using `plot_feature_importances()`

method of `estimators`

module of scikit-plot.

In [11]:

```
rf_reg = RandomForestRegressor()
rf_reg.fit(X_boston_train, Y_boston_train)
rf_reg.score(X_boston_test, Y_boston_test)
```

Out[11]:

In [12]:

```
gb_classif = GradientBoostingClassifier()
gb_classif.fit(X_cancer_train, Y_cancer_train)
gb_classif.score(X_cancer_test, Y_cancer_test)
```

Out[12]:

Below we have a combined chart of both random forest and gradient boosting into one figure because scikit-plot is based on matplotlib and lets us include more details on graphs using matplotlib.

In [ ]:

```
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
skplt.estimators.plot_feature_importances(rf_reg, feature_names=boston.feature_names,
title="Random Forest Regressor Feature Importance",
x_tick_rotation=90, order="ascending",
ax=ax1);
ax2 = fig.add_subplot(122)
skplt.estimators.plot_feature_importances(gb_classif, feature_names=cancer.feature_names,
title="Gradient Boosting Classifier Feature Importance",
x_tick_rotation=90,
ax=ax2);
plt.tight_layout()
```

We'll now explore various plotting methods available as a part of the `metrics`

module of scikit-plot.

To start with it, we'll first train logistic regression on the digits dataset. We'll then use this trained model for various plotting methods.

In [14]:

```
log_reg = LogisticRegression()
log_reg.fit(X_digits_train, Y_digits_train)
log_reg.score(X_digits_test, Y_digits_test)
```

Out[14]:

The first metric that we'll plot is a confusion matrix. The confusion matrix let us analyze how our classification algorithm is doing for various classes of data.

Below we are plotting confusion matrix using `plot_confusion_matrix()`

method of `metrics`

module. We are plotting two confusion matrix where the second one has normalized its values before plotting.

We need to pass original values and predicted values in order to plot a confusion matrix.

In [ ]:

```
Y_test_pred = log_reg.predict(X_digits_test)
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred,
title="Confusion Matrix",
cmap="Oranges",
ax=ax1)
ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred,
normalize=True,
title="Confusion Matrix",
cmap="Purples",
ax=ax2);
```

The second metric that we'll plot is the ROC AUC curve. Scikit-plot provides methods named `plot_roc()`

and `plot_roc_curve()`

as a part of `metrics`

module for plotting roc AUC curves. We need to pass original values and predicted probability to methods in order to plot the ROC AUC plot for each class of classification dataset.

It also plots the dashed line which depicts the random guess model covering 50% area of ROC AUC curve.

We can notice from the below plot that the area covered by the ROC AUC curve line of each class is more than 95% which is good. We want a line of each class to cover more than 90% area so that we can be sure that our model is doing well predicting each class even in an imbalanced dataset situation.

In [ ]:

```
Y_test_probs = log_reg.predict_proba(X_digits_test)
skplt.metrics.plot_roc_curve(Y_digits_test, Y_test_probs,
title="Digits ROC Curve", figsize=(12,6));
```

The third metric is the precision-recall curve which has almost the same usage as that of the ROC AUC curve. Both the ROC AUC curve and precision-recall curves are useful when you have an imbalanced dataset.

Scikit-plot provides methods named `plot_precision_recall()`

and `plot_precision_recall_curve()`

for plotting precision-recall curve. We need to pass original target values as well as predicted probabilities of that target values by our model to plot the precision-recall curve.

We can notice from the below plot that the area covered by the precision-recall curve line of each class is more than 95% which is good. We want a line of each class to cover more than 90% area so that we can be sure that our model is doing well predicting each class even in an imbalanced dataset situation.

In [ ]:

```
skplt.metrics.plot_precision_recall_curve(Y_digits_test, Y_test_probs,
title="Digits Precision-Recall Curve", figsize=(12,6));
```

The fourth metric that we'll like to plot is the silhouette analysis plot for clustering machine learning problems. The silhouette analysis lets us know how our clustering algorithm did in clustering various samples. It gives us results in the range of -1 to 1 and if the majority of our values are high towards 1 than it means that our clustering algorithm did well in clustering similar samples together. The silhouette score of the sample is between -1 to 1 where score 1 means that sample if far away from its neighboring clusters and score of -1 means that sample is near to its neighboring cluster than cluster it’s assigned. The value of 0 means that it’s on the boundary between two clusters. We need value for samples on the higher side(>0) to consider our model as a good model.

Scikit-plot provides a method named `plot_silhouette`

as a part of the `metrics`

module to plot the silhouette analysis plot. We need to pass is original data and labels predicted by our clustering algorithm in order to plot silhouette analysis.

Below we are first training our KMeans model on digits train data and then we are passing predicted test labels along with original test data to `plot_silhouette`

method for plotting silhouette analysis.

In [18]:

```
kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_digits_train, Y_digits_train)
cluster_labels = kmeans.predict(X_digits_test)
```

In [ ]:

```
skplt.metrics.plot_silhouette(X_digits_test, cluster_labels,
figsize=(8,6));
```

The fifth metric that we would like to visualize is the calibration curve also known as reliability curves. Scikit-plot provides a method named `plot_calibration_curve()`

as a part of the `estimators`

module for this purpose.

The calibration curve is suitable for comparing the performance of various models as well as understanding which threshold value for deciding class label is leading to model overfit or underfit. The points in various model lines which are above way above-dashed line have overfitted and one below the dashed line has under fitted. We need a model where points are mostly near the dashed line.

We are first training logistic regression random forest, gradient boosting, and extra trees classifier on cancer train data and then predicting the probability of test data generated by each model. We then pass actual test labels and list of predicted test probabilities by each model to `plot_calibration_curve()`

to plot calibration curve. We also pass a list of classifier names for having legends in the graph.

In [20]:

```
lr_probas = LogisticRegression().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
rf_probas = RandomForestClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
gb_probas = GradientBoostingClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
et_scores = ExtraTreesClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Extra Trees Classifier']
```

In [ ]:

```
skplt.metrics.plot_calibration_curve(Y_cancer_test,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);
```

The sixth metric that we'll be plotting is the KS statistics plot. Scikit-plot module `metrics`

has method named `plot_ks_statistic()`

for this purpose.

KS Statistics is for binary classification problems only. The KS statistic (Kolmogorov-Smirnov statistic) is the maximum difference between the cumulative true positive and cumulative false-positive rate. It captures the model's power of discriminating positive labels from a negative labels. KS Statistics Plot is for binary classification problems only.

We have first trained random forest classifier on cancer train data. We then passed original cancer test labels and predicted test probabilities by random forest trained model to `plot_ks_statistic()`

in order to plot the KS Statistics chart.

In [22]:

```
rf = RandomForestClassifier()
rf.fit(X_cancer_train, Y_cancer_train)
```

Out[22]:

In [ ]:

```
Y_cancer_probas = rf.predict_proba(X_cancer_test)
skplt.metrics.plot_ks_statistic(Y_cancer_test, Y_cancer_probas, figsize=(10,6));
```

Cumulative Gains Curve is the seventh metric that we'll be plotting using scikit-plot. It provides a method named `plot_cumulative_gain()`

as a part of the `metrics`

module for plotting this metric.

Cumulative gains chart tells us the percentage of samples in a given category which were truly predicted by targeting a percentage of the total number of samples. It means when we took that many percentages of samples from the total percentage that we get from the curve for y-axis are labels which were truly guessed by model from a total number of samples of that class in that many samples. The dashed line in the chart is the baseline curve (random guess model) and our model should perform better than it and both class curves should be above it ideally. Cumulative Gains Curve is for binary classification problems only.

We need to pass its original labels of data and predicted probabilities by the trained model in order to plot the cumulative gains curve.

In [ ]:

```
skplt.metrics.plot_cumulative_gain(Y_cancer_test, Y_cancer_probas, figsize=(10,6));
```

The eight and last metric that's available with scikit-plot is the Lift curve. Scikit-plot has a method named `plot_lift_curve()`

as a part of the `metrics`

module for plotting this curve.

The lift chart is derived from the cumulative chart by taking a ratio of cumulative gains for each curve to the baseline and showing this ratio on the y-axis. The x-axis has the same meaning as the above chart. The lift curve is for binary classification problems only.

We need to pass its original labels of data and predicted probabilities by the trained model in order to plot the lift curve.

In [ ]:

```
skplt.metrics.plot_lift_curve(Y_cancer_test, Y_cancer_probas, figsize=(10,6));
```

The only clustering plot that is available with scikit-plot is the elbow method plot. Scikit-plot provides method named `plot_elbow_curve()`

as a part of `cluster`

module for plotting elbow method curve.

The elbow method is useful in deciding the right number of clusters to be used to divide data into. If you are not aware of a number of clusters beforehand then the elbow method can help you decide a number of clusters to use.

Elbow method plots the number of clusters versus squared error for each sample with that many clusters. The plot generally looks like a human hand and elbow is a place which you select is a number of clusters. It means that more clusters than that are not improving squared errors hence its best number of clusters to choose to divide samples.

We need to pass the clustering algorithm and original data along with an array of cluster sizes to plotting method in order to plot the elbow method curve.

In [ ]:

```
skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_digits,
cluster_ranges=range(2, 20),
figsize=(8,6));
```

There are two plots available with dimensionality reduction module of scikit-plot.

The first dimensionality reduction plot that we'll explore is the PCA component explained variance. Scikit-plot has a method named `plot_pca_component_variance`

as a part of the `decomposition`

module for this.

The PCA components explained variance chart let us know how much of the original data variance is contained within first n components. Below we can see from `red dot`

that `76.2%`

of digits data variance is present in 11 components. We can see that nearly 45 components have nearly a `100%`

variance of data. We can decide from this graph how much the percentage of original data's variance is enough for better model performance and take components accordingly which will result in reducing the dimension of original data.

We need to pass trained PCA on the dataset to method `plot_pca_component_variance()`

in order to plot this chart.

In [ ]:

```
pca = PCA(random_state=1)
pca.fit(X_digits)
skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));
```

The second plot for dimensionality reduction that we'll plot is a 2D projection of data transformed through PCA. Scikit-plot provides method named `plot_pca_2d_projection()`

as a part of `decomposition`

module for this purpose.

We need to pass it trained PCA model along with dataset and its labels for plotting purposes.

In [ ]:

```
skplt.decomposition.plot_pca_2d_projection(pca, X_digits, Y_digits,
figsize=(10,10),
cmap="tab10");
```

This ends our small tutorial explaining various plotting functionalities available with **scikit-plot** for machine learning purposes. Please feel free to let us know your views in the comments section.

Sunny Solanki

Scikit-Learn - Anomaly Detection [Outliers Detection]

Scikit-Learn - Clustering: Density-Based Clustering of Applications with Noise [DBSCAN]

Scikit-Learn - Non-Linear Dimensionality Reduction: Manifold Learning

Scikit-Learn - Ensemble Learning : Boosting