Share @ LinkedIn Facebook  scoring_metrics, scikit-learn
Model Evaluation & Scoring Metrics using Scikit-Learn

Model Evaluation & Scoring Matrices

In this tutorial, we'll discuss various model evaluation metrics provided in scikit-learn.

In scikit-learn, the default choice for classification is accuracy which is a number of labels correctly classified and for regression is r2 which is a coefficient of determination.

Scikit-learn has a metrics module that provides other metrics that can be used for other purposes like when there is class imbalance etc. It also lets the user create custom evaluation metrics for a specific task.

We'll start by importing necessary libraries for our tutorial and setting few defaults.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn import metrics, datasets, neighbors

import sys
import warnings
import itertools

warnings.filterwarnings("ignore")
np.set_printoptions(precision=2)
print("Python Verion : ", sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)

%matplotlib inline
Python Verion :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

Classification Metrics:

We'll be using scikit-learn's in-built methods to create the dataset and use various metrics to evaluate the performance of a model trained on that dataset. We'll create a classification dataset with 500 samples, 20 features, and 2 classes.

In [2]:
X,Y  = datasets.make_classification(n_samples=500, n_features=20, n_classes=2, random_state=1)
print('Dataset Size : ',X.shape,Y.shape)
Dataset Size :  (500, 20) (500,)

Splitting Dataset into Train/Test Sets

We'll be splitting a dataset into train set(80% samples) and test set (20% samples).

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y, random_state=1)
print('Train/Test Size : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Size :  (400, 20) (100, 20) (400,) (100,)

Model Initialization and Fitting to Train Data

We'll be using a simple LinearSVC model for training purpose. We'll then proceed to introduce various classification metrics which will be evaluating model performance on test data from various angles.

In [4]:
from sklearn.svm import LinearSVC

linear_svc = LinearSVC(random_state=1, C=0.1)
linear_svc.fit(X_train, Y_train)
Out[4]:
LinearSVC(C=0.1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)

Classification Accuracy

It refers to number of true predictions divided by total number of samples.

In [5]:
Y_preds = linear_svc.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%linear_svc.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%linear_svc.score(X_train, Y_train))
[0 1 0 1 1 0 0 0 1 0 1 0 0 0 1]
[0 1 0 1 1 0 0 0 1 0 0 0 0 0 1]
Test Accuracy : 0.930
Test Accuracy : 0.930
Training Accuracy : 0.953

Confusion Matrix

For binary and multi-class classification problems, confusion matrix is another metrics which helps in indentifying which classes are easy to predict and which are hard to predict. It provides how many samples for each class are correctly classified and how many are confused with other classes.

In [6]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(Y_test, Y_preds)
print(conf_mat)
[[47  3]
 [ 4 46]]

Confusion Matrix for binary classification problems has the below-mentioned structure.

[[TN, FP ]

[FN, TP ]]

  • TN refers to True Negative which is the count of labels which were originally belonged to negative class and model also predicted them as negative.
  • FP refers to False positive which is the count of labels which were actually belonged to negative class but model predicted them as positive.
  • FN refers to False Negative which is the count of labels which were actually belonged to Positive Class but model predicted them as negative.
  • TP refers to True Positive which is the count of labels predicted positive which were actually positive.

Below we are plotting the confusion matrix as it helps in interpreting results fast.

In [7]:
with plt.style.context(('ggplot', 'seaborn')):
    fig = plt.figure(figsize=(6,6), num=1)
    plt.imshow(conf_mat, interpolation='nearest',cmap= plt.cm.Blues )
    plt.xticks([0,1],[0,1])
    plt.yticks([0,1],[0,1])
    plt.xlabel('Predicted Label')
    plt.ylabel('Actual Label')
    for i, j in itertools.product(range(conf_mat.shape[0]), range(conf_mat.shape[1])):
                plt.text(j, i,conf_mat[i, j], horizontalalignment="center",color="red")
    plt.grid(None)
    plt.title('Confusion Matrix')
    plt.colorbar();

Classification Report

Classification report metrics provides precision, recall, f1-score and support for each class.

  • Precision - It represents how many of predictions of particular class are actually of that class. $Precision = TP / (TP+FP)$.
  • Recall - It represents how many predictions of particular class is right. $Recall = TP / (TP+FN)$.
  • f1-score - It's geometric average of precision & recall. $F1-Score = 2 * (Precision * recall) / (Precision + recall)$
  • support - It represents number of occurances of particular class in y_true

In [8]:
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, precision_recall_fscore_support

print('Precision                                   : %.3f'%precision_score(Y_test, Y_preds))
print('Recall                                      : %.3f'%recall_score(Y_test, Y_preds))
print('F1-Score                                    : %.3f'%f1_score(Y_test, Y_preds))
print('\nPrecision Recall F1-Score Support Per Class : \n',precision_recall_fscore_support(Y_test, Y_preds))
print('\nClassification Report                       : ')
print(classification_report(Y_test, Y_preds))
Precision                                   : 0.939
Recall                                      : 0.920
F1-Score                                    : 0.929

Precision Recall F1-Score Support Per Class :
 (array([0.92, 0.94]), array([0.94, 0.92]), array([0.93, 0.93]), array([50, 50]))

Classification Report                       :
              precision    recall  f1-score   support

           0       0.92      0.94      0.93        50
           1       0.94      0.92      0.93        50

    accuracy                           0.93       100
   macro avg       0.93      0.93      0.93       100
weighted avg       0.93      0.93      0.93       100

The classification report is necessary when we want to analyze the performance of a model on individual classes. We want to check whether our model is not biassed towards one class. It helps in the case of unbalanced classes as we can understand the performance of a model on individual class. We can further improve the performance of a model by analyzing the performance of it in individual classes using this report.

Let’s go below through imbalanced class scenario to understand more and introduce the concept pf ROC Curves. We'll create a new dataset of 1000 samples, 10 classes and make it an imbalance for our purpose.

In [9]:
X, Y = datasets.make_classification(n_samples=1000, n_classes=10, n_informative=10)
print('Dataset Size : ',X.shape, Y.shape)
Dataset Size :  (1000, 20) (1000,)

Below we are creating imbalance by marking all samples with value 0 as True and remaining all classes as False. In our dataset, 10% of values belong to class 0 and the remaining 90% to other classes.

In [10]:
Y = (Y == 0).astype(int) ## We are creating imbalanced classes here.
In [11]:
np.bincount(Y)/  len(Y) ## We can see here that one class is 90% of samples whereas another class represents only 10%
Out[11]:
array([0.9, 0.1])

Fitting Default SVC Model To Imbalanced Data

We'll be using the default SVC model with scikit-learn's cross_val_score method with cross-validation of 5 folds. It'll divide the dataset into 5 folds and take one of the fold as test data and remaining folds as train data. It'll then train the default SVC model on train data and evaluate performance on test data. It'll try it for all 5 combinations by taking one fold each time as a test set and remaining as a train set.

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

cross_val_score(SVC(), X, Y, cv=5)
Out[12]:
array([0.91, 0.91, 0.91, 0.92, 0.9 ])

We can see that SVC with default parameters is giving 90% accuracy on average for 5-folds cross-validation.

Fitting DummyClassifier To Imbalanced Data

We'll first try DummyClassifier provided by scikit-learn which generally predicts the most occurring label as predicted label each time.

In [13]:
from sklearn.dummy import DummyClassifier

cross_val_score(DummyClassifier('most_frequent'), X, Y, cv=5)
Out[13]:
array([0.9, 0.9, 0.9, 0.9, 0.9])

After trying DummyClassifier which predicts class which frequently occurs, We can see that even that classifier is also giving 90% accuracy. it can leave a person puzzled that how can both models are giving 90% accuracy whereas one is guessing the most frequent class. In this kind of scenario, classification report and ROC Curves can help much to identify the accuracy of our model on individual class.

We'll first split our dataset into train and test sets. We'll then check the performance of default SVC and DummyClassifier on predicting individual classes using classification reports. We'll then introduce the ROC Curves concept to get better insights into model performance.

In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sizes :  (800, 20) (200, 20) (800,) (200,)

Below we are initializing defaults SVC model, training it and checking its performance on test data.

In [15]:
svc = SVC()
svc.fit(X_train, Y_train)

Y_preds = svc.predict(X_test)

print(classification_report(Y_test, Y_preds))
              precision    recall  f1-score   support

           0       0.92      1.00      0.96       180
           1       1.00      0.20      0.33        20

    accuracy                           0.92       200
   macro avg       0.96      0.60      0.65       200
weighted avg       0.93      0.92      0.90       200

We can see above that recall is quite bad for class 1.

Below we are initializing the defaults DummyClassifier model, training it and checking its performance on test data.

In [16]:
dummy_classifier = DummyClassifier('most_frequent')
dummy_classifier.fit(X_train, Y_train)

Y_preds = dummy_classifier.predict(X_test)

print(classification_report(Y_test, Y_preds))
              precision    recall  f1-score   support

           0       0.90      1.00      0.95       180
           1       0.00      0.00      0.00        20

    accuracy                           0.90       200
   macro avg       0.45      0.50      0.47       200
weighted avg       0.81      0.90      0.85       200

We can see above that DummyClassifier is performing quite bad in guessing class 1 as both precision and recall are really bad.

ROC Curves

ROC(Receiver Operating Characteristic) Curve helps better understand the performance of the model when handling an unbalanced dataset. ROC Curve works with the output of prediction function by setting different threshold values to find out different false positives and true positive rates according to the threshold. In the case of SVC, for example, a threshold set for output of decision function is 0 whereas ROC Curve tries various values for thresholds like [2,1,-1,-2] including negative threshold values as well. In the case of LogisticRegression, the default threshold is 0.5 and ROC will try default threshold values. For linear regression, the output is a probability between [0,1] hence threshold is set at 0.5 to differentiate positive/negative classes whereas in case of SVC internal kernel function returns value and threshold is set on that value for making a prediction.

Note: It's restricted to binary classification tasks.

The below plot is ROC Curve for SVM on the unbalanced dataset test set.

In [17]:
from sklearn.metrics import roc_curve, roc_auc_score

decision_function = svc.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, decision_function)
acc = svc.score(X_test, Y_test)
auc = roc_auc_score(Y_test, svc.decision_function(X_test))

with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(8,6))
    plt.scatter(fpr, tpr, c='blue')
    plt.plot(fpr, tpr, label="Accuracy:%.2f AUC:%.2f" % (acc, auc), linewidth=2, c='red')
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (recall)")
    plt.title('ROC Curve')
    plt.legend(loc='best');

With a very small decision threshold, there will be few false positives, but also few false negatives, while with a very high threshold, both true positive rate and the false positive rate will be high. So in general, the curve will be from the lower left to the upper right. A diagonal line reflects chance performance, while the goal is to be as much in the top left corner as possible. We want ROC Curve to cover almost 100% area for good performance. 50% area coverage refers to the chance model (random prediction).

Grid Search To Improve Model Performance On Unbalanced Dataset

For doing grid-search, we usually want to condense our model evaluation into a single number. A good way to do this with the roc curve is to use the area under the curve (AUC). We can simply use this in GridSearchCV by specifying scoring="roc_auc".

In [18]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(SVC(),param_grid = {'gamma': ['auto', 'scale'], 'C': [1.0, 0.1, 0.01, 10.0]}, scoring="roc_auc", cv=5)
grid.fit(X, Y)
print('Best Parameters                               : ',grid.best_params_)
print('Best Score                                    : ',grid.best_score_)

decision_function = grid.best_estimator_.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, decision_function)

print('True Positive Rates                           : ', tpr)
print('False Positive Rates                          : ', fpr)
print('Different Thresholds For Calculating TPR, FPR : ', thresholds)
print('Classification Report                         : ')
print(classification_report(Y_test, grid.best_estimator_.predict(X_test)))

acc = grid.best_estimator_.score(X_test, Y_test)
auc = roc_auc_score(Y_test, decision_function)

with plt.style.context(('ggplot', 'seaborn')):
    plt.figure(figsize=(8,6))
    plt.scatter(fpr, tpr, c='blue')
    plt.plot(fpr, tpr, label="Accuracy:%.2f AUC:%.2f" % (acc, auc), linewidth=2, c='red')
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (recall)")
    plt.title('ROC Curve')
    plt.legend(loc='best');
Best Parameters                               :  {'C': 10.0, 'gamma': 'scale'}
Best Score                                    :  0.9117532943158642
True Positive Rates                           :  [0.   0.05 1.   1.  ]
False Positive Rates                          :  [0. 0. 0. 1.]
Different Thresholds For Calculating TPR, FPR :  [ 2.73  1.73  0.01 -3.73]
Classification Report                         :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        20

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

In [19]:
grid = GridSearchCV(SVC(probability=True),param_grid = {'gamma': ['auto', 'scale'], 'C': [1.0, 0.1, 0.01, 10.0]}, scoring="roc_auc", cv=5)
grid.fit(X, Y)
print('Best Parameters                               : ',grid.best_params_)
print('Best Score                                    : ',grid.best_score_)

probs = grid.best_estimator_.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, probs[:, 1])

print('True Positive Rates                           : ', tpr)
print('False Positive Rates                          : ', fpr)
print('Different Thresholds For Calculating TPR, FPR : ', thresholds)
print('Classification Report                         : ')
print(classification_report(Y_test, grid.best_estimator_.predict(X_test)))

acc = grid.best_estimator_.score(X_test, Y_test)
auc = roc_auc_score(Y_test, probs[:,1])

with plt.style.context(('ggplot', 'seaborn')):
    plt.figure(figsize=(8,6))
    plt.scatter(fpr, tpr, c='blue')
    plt.plot(fpr, tpr, label="Accuracy:%.2f, AUC:%.2f" % (acc, auc), linewidth=2, c='red')
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (recall)")
    plt.title('ROC Curve')
    plt.legend(loc='best');
Best Parameters                               :  {'C': 10.0, 'gamma': 'scale'}
Best Score                                    :  0.9117532943158642
True Positive Rates                           :  [0.   0.05 1.   1.  ]
False Positive Rates                          :  [0. 0. 0. 1.]
Different Thresholds For Calculating TPR, FPR :  [1.96e+00 9.61e-01 4.19e-01 3.41e-04]
Classification Report                         :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        20

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

We can notice above from the classification report and ROC Curve that our model is performing quite well in the case of the imbalanced dataset after parameter tuning.

Precision-Recall Curve

Precision and Recall helps a lot in case of imbalanced datasets. Plotting different values of precision vs recall by setting different thresholds helps in evaluating the performance of the model better in case of imbalance classes. It does not take into consideration true negatives as it's majority class and True positives represent minority class which has quite a few occurrences.

Note: It's restricted to binary classification tasks.

The below plot is Precision-Recall Curve for SVM on the unbalanced dataset test set.

In [20]:
from sklearn.metrics import precision_recall_curve, auc,average_precision_score

decision_function = svc.decision_function(X_test)
precision, recall, thresholds = precision_recall_curve(Y_test, decision_function)
acc = svc.score(X_test, Y_test)
p_auc = auc(recall, precision)

with plt.style.context(('ggplot', 'seaborn')):
    plt.figure(figsize=(8,6))
    plt.scatter(recall, precision, c='blue')
    plt.plot(recall, precision, label="Accuray:%.2f, AUC:%.2f" % (acc, p_auc), linewidth=2, c='red')
    plt.hlines(0.5,0.0,1.0, linestyle='dashed', colors=['orange'])
    plt.xlabel("Recall (Sensitivity)")
    plt.ylabel("Precision")
    plt.title('Precision Recall Curve')
    plt.legend(loc='best');

Precision-recall curve totally crashes if our model is not performing well in case of imbalanced dataset. Notice that AUC in case of precison recall curve is 50% and whereas AUC with ROC curve was around 90%. ROC curves sometimes give optimistic results hence its better to consider precision recall curves as well in case of imbalanced datasets.

In [21]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(SVC(probability=True),param_grid = {'gamma': ['auto', 'scale'], 'C': [1.0, 0.1, 0.01, 10.0]}, cv=5)
grid.fit(X, Y)
print('Best Parameters                                        : ',grid.best_params_)
print('Best Score                                             : ',grid.best_score_)

decision_function = grid.best_estimator_.decision_function(X_test)
precision, recall, thresholds = precision_recall_curve(Y_test, decision_function)

print('Precision                                              : ', precision)
print('Recall                                                 : ', recall)
print('Different Thresholds For Calculating Precision, Recall : ', thresholds)
print('Classification Report                                  : ')
print(classification_report(Y_test, grid.best_estimator_.predict(X_test)))

acc = grid.best_estimator_.score(X_test, Y_test)
p_auc = auc(recall, precision)
ap = average_precision_score(Y_test, grid.predict_proba(X_test)[:,1])

with plt.style.context(('ggplot', 'seaborn')):
    plt.figure(figsize=(8,6))
    plt.scatter(recall, precision, c='blue')
    plt.plot(recall, precision, label="Accuracy:%.2f, AUC:%.2f, Average Precision %.2f" % (acc, p_auc, ap), linewidth=2, c='red')
    plt.xlabel("Recall (Sensitivity)")
    plt.ylabel("Precision")
    plt.title('Precision Recall Curve')
    plt.legend(loc='best');
Best Parameters                                        :  {'C': 10.0, 'gamma': 'scale'}
Best Score                                             :  0.928
Precision                                              :  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Recall                                                 :  [1.   0.95 0.9  0.85 0.8  0.75 0.7  0.65 0.6  0.55 0.5  0.45 0.4  0.35
 0.3  0.25 0.2  0.15 0.1  0.05 0.  ]
Different Thresholds For Calculating Precision, Recall :  [0.01 0.18 0.56 0.71 0.74 1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.25 1.56 1.73]
Classification Report                                  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        20

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

Log Loss (Logistic Loss or Cross-Entropy Loss)

Log loss refers to the negative log-likelihood of true labels predicted by the classifier. It's a cost function whose output classifiers try to minimize while updating weights of the model.

$$log\_loss = - y * log (y') - (1-y) * log(1 - y')$$
In [22]:
from sklearn.metrics import log_loss

X, Y = datasets.make_classification(n_samples= 500)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

grid = GridSearchCV(SVC(probability=True),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="neg_log_loss", cv=5)
grid.fit(X, Y)

print('Best Parameters : ',grid.best_params_)
#print('Test Log Loss : %.3f'%grid.best_estimator_.score(X_test, Y_test))
#print('Train Log Loss : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Log Loss   : %.3f'%log_loss(Y_test, grid.best_estimator_.predict_proba(X_test)))
print('Train Log Loss  : %.3f'%log_loss(Y_train, grid.best_estimator_.predict_proba(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])
Train/Test Sizes :  (400, 20) (100, 20) (400,) (100,)
Best Parameters :  {'C': 0.1}
Test Log Loss   : 0.122
Train Log Loss  : 0.278
[0 0 0 0 1 0 1 0 0 1]
[0 0 0 0 1 0 1 0 0 1]

Zero One Classification Loss

It returns a number of misclassifications or a fraction of misclassifications. It accepts normalize parameter whose value if set True then returns a fraction of misclassifications else if set to False then it returns misclassifications.

In [23]:
from sklearn.metrics import zero_one_loss

print('Number of Misclassificied Examples   : ',zero_one_loss(Y_test, Y_preds, normalize=False))
print('Fraction of Misclassificied Examples : ',zero_one_loss(Y_test, Y_preds))
Number of Misclassificied Examples   :  3
Fraction of Misclassificied Examples :  0.030000000000000027

Balanced Accuracy Score

It returns an average of recall of each class in classification problem. It's useful to deal with imbalanced datasets.

It has parameter adjusted which when set True results are adjusted for a chance so that the random performing model would get a score of 0 and perfect performance will get 1.0.

In [24]:
from sklearn.metrics import balanced_accuracy_score

print('Balanced Accuracy          : ',balanced_accuracy_score(Y_test, Y_preds))
print('Balanced Accuracy Adjusted : ',balanced_accuracy_score(Y_test, Y_preds, adjusted=True))
Balanced Accuracy          :  0.97
Balanced Accuracy Adjusted :  0.94

Brier Loss

It computes squared differences between the actual labels of class and predicted probability by model. It should be as low as possible for good performance. It’s for binary classification problems only. It by defaults takes 1 as positive class hence if one needs to consider 0 as a positive class then one can use the pos_label parameter as below.

In [25]:
from sklearn.metrics import brier_score_loss

print('Brier Loss                       : ',brier_score_loss(Y_test, grid.predict_proba(X_test)[:, 1]))
print('Brier Loss (0 as Positive Class) : ', brier_score_loss(Y_test, grid.predict_proba(X_test)[:, 0], pos_label=0))
Brier Loss                       :  0.029492727292769395
Brier Loss (0 as Positive Class) :  0.02949272729276939

F-Beta Score

F-Beta score refers to weighted average of precision and recall based on the value of the beta parameter provided. If beta < 1 then it lends more weight to precision, while beta > 1 lends more weight to recall. It has the best value of 1.0 and the worst 0.0.

It has a parameter called average which is required for multiclass problems. It accepts values [None, 'binary'(default), 'micro', 'macro', 'samples', 'weighted']. If None is specified then the score for each class is returned else average as per parameter is returned in a multiclass problem.

In [26]:
from sklearn.metrics import fbeta_score

print('Fbeta Favouring Precision : ', fbeta_score(Y_test, Y_preds, beta=0.5))
print('Fbeta Favouring Recall    : ' ,fbeta_score(Y_test, Y_preds, beta=2.0))
Fbeta Favouring Precision :  0.9756097560975608
Fbeta Favouring Recall    :  0.9638554216867469

Hamming Loss

It returns fraction of labels misclassified.

In [27]:
from sklearn.metrics import hamming_loss

print('Hamming Loss : ', hamming_loss(Y_test, Y_preds))
Hamming Loss :  0.03

Regression Metrics

We'll now introduce model evaluation metrics for regression tasks. We'll start with loading the Boston dataset available in scikit-learn for our purpose.

In [28]:
#X, Y = datasets.make_regression(n_samples=200, n_features=20, )
boston = datasets.load_boston()
X, Y = boston.data, boston.target
print('Dataset Size : ', X.shape, Y.shape)
Dataset Size :  (506, 13) (506,)

We'll be splitting a dataset into train/test sets with 80% for a train set and 20% for the test set.

In [29]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, random_state=1, )
print('Train/Test Size : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Size :  (404, 13) (102, 13) (404,) (102,)

We'll now initialize a simple LinearSVR model and train it on the train dataset. We'll then check its performance by evaluating various regression metrics provided by scikit-learn.

In [30]:
from sklearn.svm import LinearSVR

svr = LinearSVR()
svr.fit(X_train, Y_train)
Out[30]:
LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
          intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
          random_state=None, tol=0.0001, verbose=0)

$R^2$ (Coefficient Of Determination)

The coefficient of $R^2$ is defined as $(1- u/v)$.

$u = ((y_{true} - y_{pred})^2).sum()$

$v = ((y_{true} - y_{true}.sum())^2).sum()$

The best possible score is 1.0 and it can be negative as well if the model is performing badly. A model that outputs constant prediction for each input will have a score of 0.0.

Note: The majority of the regression model's score() method outputs this metric which is quite different from MSE(mean square error). Hence both should not be confused.

In [31]:
from sklearn.metrics import r2_score

Y_preds = svr.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Test R^2     : %.3f'%r2_score(Y_test, Y_preds))
print('Test R^2     : %.3f'%svr.score(X_test, Y_test))
print('Training R^2 : %.3f'%svr.score(X_train, Y_train))
[27.63 25.4  13.66 16.64 18.39 19.68 28.14 17.55 16.29 21.42]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]
Test R^2     : 0.598
Test R^2     : 0.598
Training R^2 : 0.469

Below we are doing grid search through various values of parameter C of LinearSVR and using r2 as an evaluation metric whose value will be optimized.

In [32]:
grid = GridSearchCV(LinearSVR(),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="r2", cv=5)
grid.fit(X, Y)

print('Best Parameters : ',grid.best_params_)
print('Best Score      : ',grid.best_score_)
print('Test R^2        : %.3f'%r2_score(Y_test, grid.best_estimator_.predict(X_test)))
print('Test R^2        : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Training R^2    : %.3f'%grid.best_estimator_.score(X_train, Y_train))

Y_preds = grid.best_estimator_.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])
Best Parameters :  {'C': 1.0}
Best Score      :  0.15758816494998057
Test R^2        : 0.606
Test R^2        : 0.606
Training R^2    : 0.448
[30.99 25.9  16.58 18.15 14.26 18.95 30.69 18.3  19.04 24.13]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]

Mean Absolute Error

Mean absolute error is a simple sum of the absolute difference between actual and predicted target value divided by a number of samples.

$$MAE = \frac 1 n {\sum_{i=1}^n (x_i - y_i)}$$
In [33]:
from sklearn.metrics import mean_absolute_error

print('Test MAE  : %.3f'%mean_absolute_error(Y_test, Y_preds))
print('Train MAE : %.3f'%mean_absolute_error(Y_train, svr.predict(X_train)))
Test MAE  : 4.568
Train MAE : 4.750

Below we are doing grid search through various values of parameter C of LinearSVR and using neg_mean_absolute_error as an evaluation metric whose value will be optimized.

In [34]:
grid = GridSearchCV(LinearSVR(),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="neg_mean_absolute_error", cv=5)
grid.fit(X, Y)

print('Best Parameters : ',grid.best_params_)
print('Test MAE        : %.3f'%mean_absolute_error(Y_test, grid.best_estimator_.predict(X_test)))
print('Train MAE       : %.3f'%mean_absolute_error(Y_train, grid.best_estimator_.predict(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])
Best Parameters :  {'C': 0.1}
Test MAE        : 3.791
Train MAE       : 3.459
[32.57 29.16 18.28 21.13 20.8  23.01 32.51 20.9  21.21 26.31]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]

Mean Squared Error

Mean Squared Error loss function simple sum of the squared difference between actual and predicted value divided by a number of samples.

$$MSE = \frac 1 {2n} {\sum_{i=1}^n (x_i - y_i)^2}$$
In [35]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

print('Test MSE  : %.3f'%mean_squared_error(Y_test, Y_preds))
print('Train MSE : %.3f'%mean_squared_error(Y_train, svr.predict(X_train)))
Test MSE  : 24.166
Train MSE : 42.867

Below we are doing grid search through various values of parameter C of LinearSVR and using neg_mean_squared_error as an evaluation metric whose value will be optimized.

In [36]:
grid = GridSearchCV(LinearSVR(),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="neg_mean_squared_error", cv=5)
grid.fit(X, Y)

print('Best Parameters : ',grid.best_params_)
print('Test MSE        : %.3f'%mean_squared_error(Y_test, grid.best_estimator_.predict(X_test)))
print('Train MSE       : %.3f'%mean_squared_error(Y_train, grid.best_estimator_.predict(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])
Best Parameters :  {'C': 1.0}
Test MSE        : 26.686
Train MSE       : 31.321
[32.02 27.37 17.97 19.09 17.46 21.41 31.17 20.22 20.7  25.39]
[28.2 23.9 16.6 22.  20.8 23.  27.9 14.5 21.5 22.6]

Mean Squared Log Error

$$ MSLE(Y, Y') = \dfrac 1 n \sum_{i=0}^n (log(y_i + 1) - log(y_i' + 1))^2 $$

It can not be used when target contains negative values/predictions.

In [37]:
from sklearn.metrics import mean_squared_log_error

print(mean_squared_log_error(Y_test, Y_preds))
0.11473521867831994

Median Absolute Error

$$ MED(Y, Y') = median(|y_1 - y_1'|,|y_2 - y_2'|,|y_3 - y_3'|,....,|y_n - y_n'|) $$
In [38]:
from sklearn.metrics import median_absolute_error

print('Median Absolute Error : ', median_absolute_error(Y_test, Y_preds))
print('Median Absolute Error : ', np.median(np.abs(Y_test - Y_preds)))
Median Absolute Error :  2.8996093566651595
Median Absolute Error :  2.8996093566651595

Explained Variance Score

It returns the explained variance regression score. The best value is 1.0 and fewer values refer to a bad model.

In [39]:
from sklearn.metrics import explained_variance_score

print('Explained Variance Score : ', explained_variance_score(Y_test, Y_preds))
Explained Variance Score :  0.7549895376057316

Residual Error

It returns the max of the difference between actual values and the predicted value of all samples.

$$ ME(Y, Y') = max(|y_1 - y_1'|,|y_2 - y_2'|,....,|y_n - y_n'| ) $$
In [40]:
from sklearn.metrics import max_error

print('Maximum Residual Error : ', max_error(Y_test, Y_preds))
print('Maximum Residual Error : ', max_error([1,2,3,4], [1,2,3.5,7])) ## here 4th sample has highest difference
Maximum Residual Error :  20.32500374655169
Maximum Residual Error :  3.0

Clustering Metrics

We'll now introduce evaluation metrics for unsupervised learning - clustering tasks.

Adjusted Rand Score

Clustering algorithms return cluster labels for each cluster specified but it might not return in the same sequence as original labels. It might happen that in the original dataset some class has samples labeled as 1 and in predictions by cluster, an algorithm can label it as other than 1.

We'll use the IRIS dataset and KMeans for explanation purposes.We'll even plot results to show the difference. We'll how accuracy will improve once we use adjusted_rand_score as an evaluation function.

In [41]:
from sklearn.cluster import KMeans, MeanShift
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, adjusted_rand_score, confusion_matrix

iris = load_iris()
X, Y = iris.data, iris.target

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y, random_state=12)
print('Train/Test Sizes  : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

kmeans = KMeans(n_clusters=3)
kmeans.fit(X_train, Y_train)
Y_preds = kmeans.predict(X_test)

#print(Y_test, Y_preds)
print('Confusion Matrix : ')
print(confusion_matrix(Y_test, Y_preds))
print('Accuracy of Model : %.3f'%accuracy_score(Y_test, Y_preds))
print('Adjusted Accuracy : %.3f'%adjusted_rand_score(Y_test, Y_preds))

with plt.style.context(('ggplot', 'seaborn')):
    plt.figure(figsize=(10,4))
    plt.subplot(121)
    plt.scatter(X_test[: , 1], X_test[:, 2], c=Y_test, cmap = plt.cm.viridis)
    plt.xlabel(iris.feature_names[1])
    plt.ylabel(iris.feature_names[2])
    plt.title('Y Original')
    plt.subplot(122)
    plt.scatter(X_test[: , 1], X_test[:, 2], c=Y_preds, cmap = plt.cm.viridis)
    plt.xlabel(iris.feature_names[1])
    plt.ylabel(iris.feature_names[2])
    plt.title('Y Predicted');
Train/Test Sizes  :  (120, 4) (30, 4) (120,) (30,)
Confusion Matrix :
[[ 0 10  0]
 [ 0  0 10]
 [ 8  0  2]]
Accuracy of Model : 0.067
Adjusted Accuracy : 0.808

Custom Scoring Function

Users can also define their own scoring function if their scoring function is not available in built-in scoring functions of sklearn. In GridSearchCV and cross_val_score, one can provide object which has __call__ method or function to scoring parameter. Object or function both need to accept estimator object, test features(X) and target(Y) as input and return float.

Below we are defining RMSE (Root Mean Squared Error) as a class and as a function as well. We'll then use it in cross_val_score() to check performance also compares it's value with negative of neg_mean_squared_error.

In [42]:
class RootMeanSquareError(object):
    def __call__(self, model, X, Y):
        Y_preds = model.predict(X)
        return np.sqrt(((Y - Y_preds)**2).mean())

def rootMeanSquareError(model, X, Y):
    Y_preds = model.predict(X)
    return np.sqrt(((Y - Y_preds)**2).mean())

lsvr = LinearSVR(random_state=1)
print('Cross Val Score Using Object                                     : ',cross_val_score(lsvr, X, Y, scoring=RootMeanSquareError()))
print('Cross Val Score Using Function                                   : ', cross_val_score(lsvr, X, Y, scoring=rootMeanSquareError))
print('Cross Val Score Using Negative Mean Squared Error                : ', -1*cross_val_score(lsvr, X, Y, scoring='neg_mean_squared_error'))
print('Cross Val Score Using Square Root of Negative Mean Squared Error : ', np.sqrt(-1*cross_val_score(lsvr, X, Y, scoring='neg_mean_squared_error')))
Cross Val Score Using Object                                     :  [0.73 0.39 0.59]
Cross Val Score Using Function                                   :  [0.73 0.39 0.59]
Cross Val Score Using Negative Mean Squared Error                :  [0.53 0.15 0.35]
Cross Val Score Using Square Root of Negative Mean Squared Error :  [0.73 0.39 0.59]

Below are list of scikit-learn builtin functions.

In [43]:
print('List of Inbuilt Scorers : ')
sklearn.metrics.SCORERS
List of Inbuilt Scorers :
Out[43]:
{'explained_variance': make_scorer(explained_variance_score),
 'r2': make_scorer(r2_score),
 'max_error': make_scorer(max_error, greater_is_better=False),
 'neg_median_absolute_error': make_scorer(median_absolute_error, greater_is_better=False),
 'neg_mean_absolute_error': make_scorer(mean_absolute_error, greater_is_better=False),
 'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False),
 'neg_mean_squared_log_error': make_scorer(mean_squared_log_error, greater_is_better=False),
 'accuracy': make_scorer(accuracy_score),
 'roc_auc': make_scorer(roc_auc_score, needs_threshold=True),
 'balanced_accuracy': make_scorer(balanced_accuracy_score),
 'average_precision': make_scorer(average_precision_score, needs_threshold=True),
 'neg_log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True),
 'brier_score_loss': make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True),
 'adjusted_rand_score': make_scorer(adjusted_rand_score),
 'homogeneity_score': make_scorer(homogeneity_score),
 'completeness_score': make_scorer(completeness_score),
 'v_measure_score': make_scorer(v_measure_score),
 'mutual_info_score': make_scorer(mutual_info_score),
 'adjusted_mutual_info_score': make_scorer(adjusted_mutual_info_score),
 'normalized_mutual_info_score': make_scorer(normalized_mutual_info_score),
 'fowlkes_mallows_score': make_scorer(fowlkes_mallows_score),
 'precision': make_scorer(precision_score, average=binary),
 'precision_macro': make_scorer(precision_score, pos_label=None, average=macro),
 'precision_micro': make_scorer(precision_score, pos_label=None, average=micro),
 'precision_samples': make_scorer(precision_score, pos_label=None, average=samples),
 'precision_weighted': make_scorer(precision_score, pos_label=None, average=weighted),
 'recall': make_scorer(recall_score, average=binary),
 'recall_macro': make_scorer(recall_score, pos_label=None, average=macro),
 'recall_micro': make_scorer(recall_score, pos_label=None, average=micro),
 'recall_samples': make_scorer(recall_score, pos_label=None, average=samples),
 'recall_weighted': make_scorer(recall_score, pos_label=None, average=weighted),
 'f1': make_scorer(f1_score, average=binary),
 'f1_macro': make_scorer(f1_score, pos_label=None, average=macro),
 'f1_micro': make_scorer(f1_score, pos_label=None, average=micro),
 'f1_samples': make_scorer(f1_score, pos_label=None, average=samples),
 'f1_weighted': make_scorer(f1_score, pos_label=None, average=weighted),
 'jaccard': make_scorer(jaccard_score, average=binary),
 'jaccard_macro': make_scorer(jaccard_score, pos_label=None, average=macro),
 'jaccard_micro': make_scorer(jaccard_score, pos_label=None, average=micro),
 'jaccard_samples': make_scorer(jaccard_score, pos_label=None, average=samples),
 'jaccard_weighted': make_scorer(jaccard_score, pos_label=None, average=weighted)}

Sunny Solanki  Sunny Solanki