Scikit learn is a very commonly used library for trying machine learning algorithms on our datasets. Once we have trained ML Model, we need the right way to understand performance of the model by visualizing various ML Metrics. We need to understand whether our model has generalized or not.
Matplotlib is a very commonly used data visualization library for plotting results of ML algorithms.
But plotting with matplotlib requires quite a learning curve. One minor mistake when implementing visualizations can result in interpreting results totally wrong way. This can even slow down the process of testing various algorithms if data scientist is involved in getting visualizations right.
In short, it can even slow down whole train-test process and experimentation cycle.
But what if you have a Python library that is ready-made for plotting ML Metrics?
It'll fasten your whole experimentation cycle a lot as you won't be involved in getting visualizations right. The data scientist can then peacefully concentrate on his/her machine learning algorithm's performance and try many different experiments.
Python has a library called Scikit-Plot which provides visualizations for many machine learning metrics related to regression, classification, and clustering. Scikit-Plot is built on top of matplotlib. So if you have some background on matplotlib then you can customize charts created using scikit-plot further.
As a part of this tutorial, We have explained how to use Python library scikit-plot to visualize ML metrics evaluating performance of ML Model. We have explained how to create charts to visualize ML Metrics for Classification, Dimensionality Reduction, and Clustering tasks. Apart from metrics, we have also explained training progress chart and features importances charts available from scikit-plot. We have trained scikit-learn models on toy datasets for our purposes. Tutorial tries to cover the whole API of scikit-plot.
Below, we have listed important sections of tutorial to give an overview of material covered.
We'll start by importing the necessary Python libraries for our tutorial. We have also printed the versions that we used in our tutorial.
import scikitplot as skplt import sklearn from sklearn.datasets import load_digits, load_boston, load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.cluster import KMeans from sklearn.decomposition import PCA import matplotlib.pyplot as plt import sys import warnings warnings.filterwarnings("ignore") print("Scikit Plot Version : ", skplt.__version__) print("Scikit Learn Version : ", sklearn.__version__) print("Python Version : ", sys.version) %matplotlib inline
Scikit Plot Version : 0.3.7 Scikit Learn Version : 1.0.2 Python Version : 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
We'll be loading three different datasets which we'll be using to train various machine learning models. We'll then visualize the results of these models.
The first dataset that we'll load is digits dataset which is 8x8 images of numbers. It's readily available in scikit-learn for our usage.
We'll then divide the dataset into the train (80%) and test sets(20%).
digits = load_digits() X_digits, Y_digits = digits.data, digits.target print("Digits Dataset Size : ", X_digits.shape, Y_digits.shape)
Digits Dataset Size : (1797, 64) (1797,)
X_digits_train, X_digits_test, Y_digits_train, Y_digits_test = train_test_split(X_digits, Y_digits, train_size=0.8, stratify=Y_digits, random_state=1) print("Digits Train/Test Sizes : ",X_digits_train.shape, X_digits_test.shape, Y_digits_train.shape, Y_digits_test.shape)
Digits Train/Test Sizes : (1437, 64) (360, 64) (1437,) (360,)
The second dataset that we'll use if a cancer dataset which has information about the malignant and benign tumor. It's also readily available in scikit-learn for our use.
We'll divide it as well in train (80%) and test sets (20%). We have also printed features available with the dataset.
cancer = load_breast_cancer() X_cancer, Y_cancer = cancer.data, cancer.target print("Feautre Names : ", cancer.feature_names) print("Cancer Dataset Size : ", X_cancer.shape, Y_cancer.shape)
Feautre Names : ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension'] Cancer Dataset Size : (569, 30) (569,)
X_cancer_train, X_cancer_test, Y_cancer_train, Y_cancer_test = train_test_split(X_cancer, Y_cancer, train_size=0.8, stratify=Y_cancer, random_state=1) print("Cancer Train/Test Sizes : ",X_cancer_train.shape, X_cancer_test.shape, Y_cancer_train.shape, Y_cancer_test.shape)
Cancer Train/Test Sizes : (455, 30) (114, 30) (455,) (114,)
The third dataset that we'll use is the Boston housing price dataset. It has information about various houses of Boston and the price at which they were sold. We'll divide it as well in train and test sets with the same proportion as above mentioned datasets.
boston = load_boston() X_boston, Y_boston = boston.data, boston.target print("Boston Dataset Size : ", X_boston.shape, Y_boston.shape) print("Boston Dataset Features : ", boston.feature_names)
Boston Dataset Size : (506, 13) (506,) Boston Dataset Features : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
X_boston_train, X_boston_test, Y_boston_train, Y_boston_test = train_test_split(X_boston, Y_boston, train_size=0.8, random_state=1) print("Boston Train/Test Sizes : ",X_boston_train.shape, X_boston_test.shape, Y_boston_train.shape, Y_boston_test.shape)
Boston Train/Test Sizes : (404, 13) (102, 13) (404,) (102,)
Scikit-plot has 4 main modules which are used for different visualizations as described below.
The module that we'll be exploring is the estimators module. We'll be plotting various plots after training ML models.
We can plot the cross-validation performance of models by passing it whole dataset. Scikit-plot provides a method named plot_learning_curve() as a part of the estimators module which accepts estimator, X, Y, cross-validation info, and scoring metric for plotting performance of cross-validation on the dataset.
Below we are plotting the performance of logistic regression on digits dataset with cross-validation.
skplt.estimators.plot_learning_curve(LogisticRegression(), X_digits, Y_digits, cv=7, shuffle=True, scoring="accuracy", n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large", title="Digits Classification Learning Curve");
Below we are plotting the performance of linear regression on the Boston dataset with cross-validation.
skplt.estimators.plot_learning_curve(LinearRegression(), X_boston, Y_boston, cv=7, shuffle=True, scoring="r2", n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large", title="Boston Regression Learning Curve ");
We can use many other scoring metrics for plotting purposes. Below is a list of scoring metrics available with scikit-learn.
dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])
The second chart that we'll be plotting is a bar chart depicting the importance of features for various ML models. We'll first train random forest on the Boston dataset and gradient boosting on the cancer dataset. We'll then plot feature importance available from both models as a bar chart using plot_feature_importances() method of estimators module of scikit-plot.
rf_reg = RandomForestRegressor() rf_reg.fit(X_boston_train, Y_boston_train) rf_reg.score(X_boston_test, Y_boston_test)
gb_classif = GradientBoostingClassifier() gb_classif.fit(X_cancer_train, Y_cancer_train) gb_classif.score(X_cancer_test, Y_cancer_test)
Below we have a combined chart of both random forest and gradient boosting into one figure because scikit-plot is based on matplotlib and lets us include more details on graphs using matplotlib.
fig = plt.figure(figsize=(15,6)) ax1 = fig.add_subplot(121) skplt.estimators.plot_feature_importances(rf_reg, feature_names=boston.feature_names, title="Random Forest Regressor Feature Importance", x_tick_rotation=90, order="ascending", ax=ax1); ax2 = fig.add_subplot(122) skplt.estimators.plot_feature_importances(gb_classif, feature_names=cancer.feature_names, title="Gradient Boosting Classifier Feature Importance", x_tick_rotation=90, ax=ax2); plt.tight_layout()
The third chart that we'll plot is the calibration curve also known as reliability curve. Scikit-plot provides a method named plot_calibration_curve() as a part of the metrics module for this purpose.
The calibration curve is suitable for comparing the performance of various models as well as understanding which threshold value for deciding class label is leading to model overfit or underfit. The points in various model lines which are above way above-dashed line have overfitted and one below the dashed line has under fitted. We need a model where points are mostly near the dashed line.
We are first training logistic regression random forest, gradient boosting, and extra trees classifier on cancer train data and then predicting the probability of test data generated by each model. We then pass actual test labels and a list of predicted test probabilities by each model to plot_calibration_curve() to plot calibration curve. We also pass a list of classifier names for having legends in the graph.
lr_probas = LogisticRegression().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test) rf_probas = RandomForestClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test) gb_probas = GradientBoostingClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test) et_scores = ExtraTreesClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test) probas_list = [lr_probas, rf_probas, gb_probas, et_scores] clf_names = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Extra Trees Classifier']
skplt.metrics.plot_calibration_curve(Y_cancer_test, probas_list, clf_names, n_bins=15, figsize=(12,6) );
We'll now explore various plotting methods available as a part of the metrics module of scikit-plot.
To start with it, we'll first train logistic regression on the digits dataset. We'll then use this trained model for various plotting methods.
log_reg = LogisticRegression() log_reg.fit(X_digits_train, Y_digits_train) log_reg.score(X_digits_test, Y_digits_test)
The first metric that we'll plot is a confusion matrix. The confusion matrix let us analyze how our classification algorithm is doing for various classes of data.
Below we are plotting confusion matrix using plot_confusion_matrix() method of metrics module. We are plotting two confusion matrix where the second one has normalized its values before plotting.
We need to pass original values and predicted values in order to plot a confusion matrix.
Y_test_pred = log_reg.predict(X_digits_test) fig = plt.figure(figsize=(15,6)) ax1 = fig.add_subplot(121) skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred, title="Confusion Matrix", cmap="Oranges", ax=ax1) ax2 = fig.add_subplot(122) skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred, normalize=True, title="Confusion Matrix", cmap="Purples", ax=ax2);
The second metric that we'll plot is the ROC AUC curve. Scikit-plot provides methods named plot_roc() and plot_roc_curve() as a part of metrics module for plotting roc AUC curves. We need to pass original values and predicted probability to methods in order to plot the ROC AUC plot for each class of classification dataset.
It also plots the dashed line which depicts the random guess model covering 50% area of ROC AUC curve.
We can notice from the below plot that the area covered by the ROC AUC curve line of each class is more than 95% which is good. We want a line of each class to cover more than 90% area so that we can be sure that our model is doing well predicting each class even in an imbalanced dataset situation.
Y_test_probs = log_reg.predict_proba(X_digits_test) skplt.metrics.plot_roc_curve(Y_digits_test, Y_test_probs, title="Digits ROC Curve", figsize=(12,6));
The third metric is the precision-recall curve which has almost the same usage as that of the ROC AUC curve. Both the ROC AUC curve and precision-recall curves are useful when you have an imbalanced dataset.
Scikit-plot provides methods named plot_precision_recall() and plot_precision_recall_curve() for plotting precision-recall curve. We need to pass original target values as well as predicted probabilities of that target values by our model to plot the precision-recall curve.
We can notice from the below plot that the area covered by the precision-recall curve line of each class is more than 95% which is good. We want a line of each class to cover more than 90% area so that we can be sure that our model is doing well predicting each class even in an imbalanced dataset situation.
skplt.metrics.plot_precision_recall_curve(Y_digits_test, Y_test_probs, title="Digits Precision-Recall Curve", figsize=(12,6));
The fourth metric that we'll be plotting is the KS statistics plot. Scikit-plot module metrics has method named plot_ks_statistic() for this purpose.
KS Statistics is for binary classification problems only. The KS statistic (Kolmogorov-Smirnov statistic) is the maximum difference between the cumulative true positive and cumulative false-positive rate. It captures the model's power of discriminating positive labels from negative labels. KS Statistics Plot is for binary classification problems only.
We have first trained random forest classifier on cancer train data. We then passed original cancer test labels and predicted test probabilities by random forest trained model to plot_ks_statistic() in order to plot the KS Statistics chart.
rf = RandomForestClassifier() rf.fit(X_cancer_train, Y_cancer_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Y_cancer_probas = rf.predict_proba(X_cancer_test) skplt.metrics.plot_ks_statistic(Y_cancer_test, Y_cancer_probas, figsize=(10,6));
Cumulative Gains Curve is the fifth metric that we'll be plotting using scikit-plot. It provides a method named plot_cumulative_gain() as a part of the metrics module for plotting this metric.
Cumulative gains chart tells us the percentage of samples in a given category that were truly predicted by targeting a percentage of the total number of samples. It means when we took that many percentages of samples from the total percentage that we get from the curve for y-axis are labels which were truly guessed by model from a total number of samples of that class in that many samples. The dashed line in the chart is the baseline curve (random guess model) and our model should perform better than it and both class curves should be above it ideally. Cumulative Gains Curve is for binary classification problems only.
We need to pass its original labels of data and predicted probabilities by the trained model in order to plot the cumulative gains curve.
skplt.metrics.plot_cumulative_gain(Y_cancer_test, Y_cancer_probas, figsize=(10,6));
The sixth and last metric that's available with scikit-plot is the Lift curve. Scikit-plot has a method named plot_lift_curve() as a part of the metrics module for plotting this curve.
The lift chart is derived from the cumulative chart by taking a ratio of cumulative gains for each curve to the baseline and showing this ratio on the y-axis. The x-axis has the same meaning as the above chart. The lift curve is for binary classification problems only.
We need to pass its original labels of data and predicted probabilities by the trained model in order to plot the lift curve.
skplt.metrics.plot_lift_curve(Y_cancer_test, Y_cancer_probas, figsize=(10,6));
The only clustering plot that is available with scikit-plot is the elbow method plot. Scikit-plot provides method named plot_elbow_curve() as a part of cluster module for plotting elbow method curve.
The elbow method is useful in deciding the right number of clusters to be used to divide data into. If you are not aware of number of clusters beforehand then the elbow method can help you decide number of clusters to use.
Elbow method plots the number of clusters versus squared error for each sample with that many clusters. The plot generally looks like a human hand and elbow is a place where you select a number of clusters. It means that more clusters than that are not improving squared errors hence it's best number of clusters to choose to divide samples.
We need to pass the clustering algorithm and original data along with an array of cluster sizes to plotting method in order to plot the elbow method curve.
skplt.cluster.plot_elbow_curve(KMeans(random_state=1), X_digits, cluster_ranges=range(2, 20), figsize=(8,6));
The second metric that we'll like to plot is the silhouette analysis plot for clustering machine learning problems.
The silhouette analysis lets us know how our clustering algorithm did in clustering various samples. It gives us results in the range of -1 to 1 and if the majority of our values are high towards 1 then it means that our clustering algorithm did well in clustering similar samples together.
The silhouette score of the sample is between -1 to 1 where score 1 means that sample is far away from its neighboring clusters and score of -1 means that sample is near to its neighboring cluster than cluster it's assigned. The value of 0 means that it's on the boundary between two clusters. We need value for samples on the higher side(>0) to consider our model a good model.
Scikit-plot provides a method named plot_silhouette() as a part of the metrics module to plot the silhouette analysis plot though it is a clustering metric. We need to pass is original data and labels predicted by our clustering algorithm in order to plot silhouette analysis.
Below we are first training our KMeans model on digits train data and then we are passing predicted test labels along with original test data to plot_silhouette() method for plotting silhouette analysis.
kmeans = KMeans(n_clusters=10, random_state=1) kmeans.fit(X_digits_train, Y_digits_train) cluster_labels = kmeans.predict(X_digits_test)
skplt.metrics.plot_silhouette(X_digits_test, cluster_labels, figsize=(8,6));
There are two plots available with dimensionality reduction module of scikit-plot.
The first dimensionality reduction plot that we'll explore is the PCA component explained variance. Scikit-plot has a method named plot_pca_component_variance() as a part of the decomposition module for this.
The PCA components explained variance chart let us know how much of the original data variance is contained within first n components.
Below we can see from red dot that 76.2% of digits data variance is present in 11 components. We can see that nearly 45 components have nearly a 100% variance of data. We can decide from this graph how much the percentage of original data's variance is enough for better model performance and take components accordingly which will result in reducing the dimension of original data.
We need to pass trained PCA on the dataset to method plot_pca_component_variance() in order to plot this chart.
pca = PCA(random_state=1) pca.fit(X_digits) skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));
The second plot for dimensionality reduction that we'll plot is a 2D projection of data transformed through PCA. Scikit-plot provides method named plot_pca_2d_projection() as a part of decomposition module for this purpose.
We need to pass it trained PCA model along with dataset and its labels for plotting purposes.
skplt.decomposition.plot_pca_2d_projection(pca, X_digits, Y_digits, figsize=(10,10), cmap="tab10");
This ends our small tutorial explaining various plotting functionalities available with scikit-plot to visualize ML Metrics evaluating performance of ML Models.
Evaluating ML Metrics is generally a good starting point for checking performance of ML models. It can give us an idea of whether model has generalized or not. But there can be situations where ML metrics are not enough. The results of our metrics are quite good but our model is not generalizing. E.g., A simple cat vs dog image classifier can be using background pixels to classify images instead of actual object pixels.
In those situations, we need to look at predictions made by models on individual examples.
To interpret individual predictions, many algorithms have been developed. Different Python libraries (lime, shap, eli5, captum, treeinterpreter, etc.) provide an implementation of these algorithms. It can help us dig deeper to understand our model better at individual example levels.
We would suggest that you read about interpreting predictions of ML Models in your free time. Below we have listed tutorials providing guidance on topic.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at firstname.lastname@example.org. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to