In this tutorial, we'll discuss various model evaluation metrics provided in scikit-learn.

In scikit-learn, the default choice for classification is `accuracy`

which is a number of labels correctly classified and for regression is `r2`

which is a coefficient of determination.

Scikit-learn has a `metrics`

module that provides other metrics that can be used for other purposes like when there is class imbalance etc. It also lets the user create custom evaluation metrics for a specific task.

We'll start by importing necessary libraries for our tutorial and setting few defaults.

In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn import metrics, datasets, neighbors
import sys
import warnings
import itertools
warnings.filterwarnings("ignore")
np.set_printoptions(precision=2)
print("Python Verion : ", sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
%matplotlib inline
```

We'll be using scikit-learn's in-built methods to create the dataset and use various metrics to evaluate the performance of a model trained on that dataset. We'll create a classification dataset with 500 samples, 20 features, and 2 classes.

In [2]:

```
X,Y = datasets.make_classification(n_samples=500, n_features=20, n_classes=2, random_state=1)
print('Dataset Size : ',X.shape,Y.shape)
```

We'll be splitting a dataset into train set(80% samples) and test set (20% samples).

In [3]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y, random_state=1)
print('Train/Test Size : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
```

We'll be using a simple LinearSVC model for training purpose. We'll then proceed to introduce various classification metrics which will be evaluating model performance on test data from various angles.

In [4]:

```
from sklearn.svm import LinearSVC
linear_svc = LinearSVC(random_state=1, C=0.1)
linear_svc.fit(X_train, Y_train)
```

Out[4]:

It refers to number of true predictions divided by total number of samples.

In [5]:

```
Y_preds = linear_svc.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Test Accuracy : %.3f'%linear_svc.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%linear_svc.score(X_train, Y_train))
```

For binary and multi-class classification problems, confusion matrix is another metrics which helps in indentifying which classes are easy to predict and which are hard to predict. It provides how many samples for each class are correctly classified and how many are confused with other classes.

In [6]:

```
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(Y_test, Y_preds)
print(conf_mat)
```

**Confusion Matrix** for binary classification problems has the below-mentioned structure.

[[TN, FP ]

[FN, TP ]]

**TN**refers to True Negative which is the count of labels which were originally belonged to negative class and model also predicted them as negative.**FP**refers to False positive which is the count of labels which were actually belonged to negative class but model predicted them as positive.**FN**refers to False Negative which is the count of labels which were actually belonged to Positive Class but model predicted them as negative.**TP**refers to True Positive which is the count of labels predicted positive which were actually positive.

Below we are plotting the confusion matrix as it helps in interpreting results fast.

In [7]:

```
with plt.style.context(('ggplot', 'seaborn')):
fig = plt.figure(figsize=(6,6), num=1)
plt.imshow(conf_mat, interpolation='nearest',cmap= plt.cm.Blues )
plt.xticks([0,1],[0,1])
plt.yticks([0,1],[0,1])
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
for i, j in itertools.product(range(conf_mat.shape[0]), range(conf_mat.shape[1])):
plt.text(j, i,conf_mat[i, j], horizontalalignment="center",color="red")
plt.grid(None)
plt.title('Confusion Matrix')
plt.colorbar();
```

Classification report metrics provides precision, recall, f1-score and support for each class.

**Precision**- It represents how many of predictions of particular class are actually of that class. $Precision = TP / (TP+FP)$.**Recall**- It represents how many predictions of particular class is right. $Recall = TP / (TP+FN)$.**f1-score**- It's geometric average of precision & recall. $F1-Score = 2 * (Precision * recall) / (Precision + recall)$**support**- It represents number of occurances of particular class in`y_true`

In [8]:

```
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, precision_recall_fscore_support
print('Precision : %.3f'%precision_score(Y_test, Y_preds))
print('Recall : %.3f'%recall_score(Y_test, Y_preds))
print('F1-Score : %.3f'%f1_score(Y_test, Y_preds))
print('\nPrecision Recall F1-Score Support Per Class : \n',precision_recall_fscore_support(Y_test, Y_preds))
print('\nClassification Report : ')
print(classification_report(Y_test, Y_preds))
```

The classification report is necessary when we want to analyze the performance of a model on individual classes. We want to check whether our model is not biassed towards one class. It helps in the case of `unbalanced classes`

as we can understand the performance of a model on individual class. We can further improve the performance of a model by analyzing the performance of it in individual classes using this report.

Let’s go below through imbalanced class scenario to understand more and introduce the concept pf ROC Curves. We'll create a new dataset of 1000 samples, 10 classes and make it an imbalance for our purpose.

In [9]:

```
X, Y = datasets.make_classification(n_samples=1000, n_classes=10, n_informative=10)
print('Dataset Size : ',X.shape, Y.shape)
```

Below we are creating imbalance by marking all samples with value 0 as True and remaining all classes as False. In our dataset, 10% of values belong to class 0 and the remaining 90% to other classes.

In [10]:

```
Y = (Y == 0).astype(int) ## We are creating imbalanced classes here.
```

In [11]:

```
np.bincount(Y)/ len(Y) ## We can see here that one class is 90% of samples whereas another class represents only 10%
```

Out[11]:

We'll be using the default `SVC`

model with scikit-learn's `cross_val_score`

method with cross-validation of 5 folds. It'll divide the dataset into 5 folds and take one of the fold as test data and remaining folds as train data. It'll then train the default SVC model on train data and evaluate performance on test data. It'll try it for all 5 combinations by taking one fold each time as a test set and remaining as a train set.

In [12]:

```
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
cross_val_score(SVC(), X, Y, cv=5)
```

Out[12]:

We can see that SVC with default parameters is giving 90% accuracy on average for 5-folds cross-validation.

We'll first try `DummyClassifier`

provided by scikit-learn which generally predicts the most occurring label as predicted label each time.

In [13]:

```
from sklearn.dummy import DummyClassifier
cross_val_score(DummyClassifier('most_frequent'), X, Y, cv=5)
```

Out[13]:

After trying DummyClassifier which predicts class which frequently occurs, We can see that even that classifier is also giving 90% accuracy. it can leave a person puzzled that how can both models are giving 90% accuracy whereas one is guessing the most frequent class. In this kind of scenario, classification report and ROC Curves can help much to identify the accuracy of our model on individual class.

We'll first split our dataset into train and test sets. We'll then check the performance of default SVC and DummyClassifier on predicting individual classes using classification reports. We'll then introduce the ROC Curves concept to get better insights into model performance.

In [14]:

```
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
```

Below we are initializing defaults SVC model, training it and checking its performance on test data.

In [15]:

```
svc = SVC()
svc.fit(X_train, Y_train)
Y_preds = svc.predict(X_test)
print(classification_report(Y_test, Y_preds))
```

We can see above that `recall`

is quite bad for class 1.

Below we are initializing the defaults DummyClassifier model, training it and checking its performance on test data.

In [16]:

```
dummy_classifier = DummyClassifier('most_frequent')
dummy_classifier.fit(X_train, Y_train)
Y_preds = dummy_classifier.predict(X_test)
print(classification_report(Y_test, Y_preds))
```

We can see above that DummyClassifier is performing quite bad in guessing class 1 as both `precision`

and `recall`

are really bad.

ROC(Receiver Operating Characteristic) Curve helps better understand the performance of the model when handling an unbalanced dataset. ROC Curve works with the output of prediction function by setting different threshold values to find out different false positives and true positive rates according to the threshold. In the case of SVC, for example, a threshold set for output of `decision function`

is 0 whereas ROC Curve tries various values for thresholds like [2,1,-1,-2] including negative threshold values as well. In the case of LogisticRegression, the default threshold is 0.5 and ROC will try default threshold values. For linear regression, the output is a probability between [0,1] hence threshold is set at 0.5 to differentiate positive/negative classes whereas in case of SVC internal kernel function returns value and threshold is set on that value for making a prediction.

**Note:** It's restricted to binary classification tasks.

The below plot is ROC Curve for SVM on the unbalanced dataset test set.

In [17]:

```
from sklearn.metrics import roc_curve, roc_auc_score
decision_function = svc.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, decision_function)
acc = svc.score(X_test, Y_test)
auc = roc_auc_score(Y_test, svc.decision_function(X_test))
with plt.style.context(('ggplot','seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(fpr, tpr, c='blue')
plt.plot(fpr, tpr, label="Accuracy:%.2f AUC:%.2f" % (acc, auc), linewidth=2, c='red')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (recall)")
plt.title('ROC Curve')
plt.legend(loc='best');
```

With a very small decision threshold, there will be few false positives, but also few false negatives, while with a very high threshold, both true positive rate and the false positive rate will be high. So in general, the curve will be from the lower left to the upper right. A diagonal line reflects chance performance, while the goal is to be as much in the top left corner as possible. We want ROC Curve to cover almost 100% area for good performance. 50% area coverage refers to the chance model (random prediction).

For doing grid-search, we usually want to condense our model evaluation into a single number. A good way to do this with the roc curve is to use the area under the curve (AUC). We can simply use this in GridSearchCV by specifying `scoring="roc_auc"`

.

In [18]:

```
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid = {'gamma': ['auto', 'scale'], 'C': [1.0, 0.1, 0.01, 10.0]}, scoring="roc_auc", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Best Score : ',grid.best_score_)
decision_function = grid.best_estimator_.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, decision_function)
print('True Positive Rates : ', tpr)
print('False Positive Rates : ', fpr)
print('Different Thresholds For Calculating TPR, FPR : ', thresholds)
print('Classification Report : ')
print(classification_report(Y_test, grid.best_estimator_.predict(X_test)))
acc = grid.best_estimator_.score(X_test, Y_test)
auc = roc_auc_score(Y_test, decision_function)
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(fpr, tpr, c='blue')
plt.plot(fpr, tpr, label="Accuracy:%.2f AUC:%.2f" % (acc, auc), linewidth=2, c='red')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (recall)")
plt.title('ROC Curve')
plt.legend(loc='best');
```

In [19]:

```
grid = GridSearchCV(SVC(probability=True),param_grid = {'gamma': ['auto', 'scale'], 'C': [1.0, 0.1, 0.01, 10.0]}, scoring="roc_auc", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Best Score : ',grid.best_score_)
probs = grid.best_estimator_.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, probs[:, 1])
print('True Positive Rates : ', tpr)
print('False Positive Rates : ', fpr)
print('Different Thresholds For Calculating TPR, FPR : ', thresholds)
print('Classification Report : ')
print(classification_report(Y_test, grid.best_estimator_.predict(X_test)))
acc = grid.best_estimator_.score(X_test, Y_test)
auc = roc_auc_score(Y_test, probs[:,1])
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(fpr, tpr, c='blue')
plt.plot(fpr, tpr, label="Accuracy:%.2f, AUC:%.2f" % (acc, auc), linewidth=2, c='red')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (recall)")
plt.title('ROC Curve')
plt.legend(loc='best');
```

We can notice above from the classification report and ROC Curve that our model is performing quite well in the case of the imbalanced dataset after parameter tuning.

Precision and Recall helps a lot in case of imbalanced datasets. Plotting different values of precision vs recall by setting different thresholds helps in evaluating the performance of the model better in case of imbalance classes. It does not take into consideration true negatives as it's majority class and True positives represent minority class which has quite a few occurrences.

**Note:** It's restricted to binary classification tasks.

The below plot is Precision-Recall Curve for SVM on the unbalanced dataset test set.

In [20]:

```
from sklearn.metrics import precision_recall_curve, auc,average_precision_score
decision_function = svc.decision_function(X_test)
precision, recall, thresholds = precision_recall_curve(Y_test, decision_function)
acc = svc.score(X_test, Y_test)
p_auc = auc(recall, precision)
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(recall, precision, c='blue')
plt.plot(recall, precision, label="Accuray:%.2f, AUC:%.2f" % (acc, p_auc), linewidth=2, c='red')
plt.hlines(0.5,0.0,1.0, linestyle='dashed', colors=['orange'])
plt.xlabel("Recall (Sensitivity)")
plt.ylabel("Precision")
plt.title('Precision Recall Curve')
plt.legend(loc='best');
```

Precision-recall curve totally crashes if our model is not performing well in case of imbalanced dataset. Notice that AUC in case of precison recall curve is 50% and whereas AUC with ROC curve was around 90%. ROC curves sometimes give optimistic results hence its better to consider precision recall curves as well in case of imbalanced datasets.

In [21]:

```
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(probability=True),param_grid = {'gamma': ['auto', 'scale'], 'C': [1.0, 0.1, 0.01, 10.0]}, cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Best Score : ',grid.best_score_)
decision_function = grid.best_estimator_.decision_function(X_test)
precision, recall, thresholds = precision_recall_curve(Y_test, decision_function)
print('Precision : ', precision)
print('Recall : ', recall)
print('Different Thresholds For Calculating Precision, Recall : ', thresholds)
print('Classification Report : ')
print(classification_report(Y_test, grid.best_estimator_.predict(X_test)))
acc = grid.best_estimator_.score(X_test, Y_test)
p_auc = auc(recall, precision)
ap = average_precision_score(Y_test, grid.predict_proba(X_test)[:,1])
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(recall, precision, c='blue')
plt.plot(recall, precision, label="Accuracy:%.2f, AUC:%.2f, Average Precision %.2f" % (acc, p_auc, ap), linewidth=2, c='red')
plt.xlabel("Recall (Sensitivity)")
plt.ylabel("Precision")
plt.title('Precision Recall Curve')
plt.legend(loc='best');
```

Log loss refers to the negative log-likelihood of true labels predicted by the classifier. It's a cost function whose output classifiers try to minimize while updating weights of the model.

$$log\_loss = - y * log (y') - (1-y) * log(1 - y')$$In [22]:

```
from sklearn.metrics import log_loss
X, Y = datasets.make_classification(n_samples= 500)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
grid = GridSearchCV(SVC(probability=True),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="neg_log_loss", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
#print('Test Log Loss : %.3f'%grid.best_estimator_.score(X_test, Y_test))
#print('Train Log Loss : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Log Loss : %.3f'%log_loss(Y_test, grid.best_estimator_.predict_proba(X_test)))
print('Train Log Loss : %.3f'%log_loss(Y_train, grid.best_estimator_.predict_proba(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
```

It returns a number of misclassifications or a fraction of misclassifications. It accepts `normalize`

parameter whose value if set `True`

then returns a fraction of misclassifications else if set to `False`

then it returns misclassifications.

In [23]:

```
from sklearn.metrics import zero_one_loss
print('Number of Misclassificied Examples : ',zero_one_loss(Y_test, Y_preds, normalize=False))
print('Fraction of Misclassificied Examples : ',zero_one_loss(Y_test, Y_preds))
```

It returns an average of recall of each class in classification problem. It's useful to deal with imbalanced datasets.

It has parameter `adjusted`

which when set `True`

results are adjusted for a chance so that the random performing model would get a score of 0 and perfect performance will get 1.0.

In [24]:

```
from sklearn.metrics import balanced_accuracy_score
print('Balanced Accuracy : ',balanced_accuracy_score(Y_test, Y_preds))
print('Balanced Accuracy Adjusted : ',balanced_accuracy_score(Y_test, Y_preds, adjusted=True))
```

It computes squared differences between the actual labels of class and predicted probability by model. It should be as low as possible for good performance. It’s for binary classification problems only. It by defaults takes 1 as positive class hence if one needs to consider 0 as a positive class then one can use the `pos_label`

parameter as below.

In [25]:

```
from sklearn.metrics import brier_score_loss
print('Brier Loss : ',brier_score_loss(Y_test, grid.predict_proba(X_test)[:, 1]))
print('Brier Loss (0 as Positive Class) : ', brier_score_loss(Y_test, grid.predict_proba(X_test)[:, 0], pos_label=0))
```

F-Beta score refers to weighted average of precision and recall based on the value of the `beta`

parameter provided. If `beta < 1`

then it lends more weight to precision, while `beta > 1`

lends more weight to recall. It has the best value of `1.0`

and the worst `0.0`

.

It has a parameter called `average`

which is required for multiclass problems. It accepts values `[None, 'binary'(default), 'micro', 'macro', 'samples', 'weighted']`

. If `None`

is specified then the score for each class is returned else average as per parameter is returned in a multiclass problem.

In [26]:

```
from sklearn.metrics import fbeta_score
print('Fbeta Favouring Precision : ', fbeta_score(Y_test, Y_preds, beta=0.5))
print('Fbeta Favouring Recall : ' ,fbeta_score(Y_test, Y_preds, beta=2.0))
```

It returns fraction of labels misclassified.

In [27]:

```
from sklearn.metrics import hamming_loss
print('Hamming Loss : ', hamming_loss(Y_test, Y_preds))
```

We'll now introduce model evaluation metrics for regression tasks. We'll start with loading the Boston dataset available in scikit-learn for our purpose.

In [28]:

```
#X, Y = datasets.make_regression(n_samples=200, n_features=20, )
boston = datasets.load_boston()
X, Y = boston.data, boston.target
print('Dataset Size : ', X.shape, Y.shape)
```

We'll be splitting a dataset into train/test sets with 80% for a train set and 20% for the test set.

In [29]:

```
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, random_state=1, )
print('Train/Test Size : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
```

We'll now initialize a simple LinearSVR model and train it on the train dataset. We'll then check its performance by evaluating various regression metrics provided by scikit-learn.

In [30]:

```
from sklearn.svm import LinearSVR
svr = LinearSVR()
svr.fit(X_train, Y_train)
```

Out[30]:

The coefficient of $R^2$ is defined as $(1- u/v)$.

$u = ((y_{true} - y_{pred})^2).sum()$

$v = ((y_{true} - y_{true}.sum())^2).sum()$

The best possible score is 1.0 and it can be negative as well if the model is performing badly. A model that outputs constant prediction for each input will have a score of 0.0.

**Note:** The majority of the regression model's `score()`

method outputs this metric which is quite different from MSE(mean square error). Hence both should not be confused.

In [31]:

```
from sklearn.metrics import r2_score
Y_preds = svr.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
print('Test R^2 : %.3f'%r2_score(Y_test, Y_preds))
print('Test R^2 : %.3f'%svr.score(X_test, Y_test))
print('Training R^2 : %.3f'%svr.score(X_train, Y_train))
```

Below we are doing grid search through various values of parameter `C`

of LinearSVR and using `r2`

as an evaluation metric whose value will be optimized.

In [32]:

```
grid = GridSearchCV(LinearSVR(),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="r2", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Best Score : ',grid.best_score_)
print('Test R^2 : %.3f'%r2_score(Y_test, grid.best_estimator_.predict(X_test)))
print('Test R^2 : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Training R^2 : %.3f'%grid.best_estimator_.score(X_train, Y_train))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
```

Mean absolute error is a simple sum of the absolute difference between actual and predicted target value divided by a number of samples.

$$MAE = \frac 1 n {\sum_{i=1}^n (x_i - y_i)}$$In [33]:

```
from sklearn.metrics import mean_absolute_error
print('Test MAE : %.3f'%mean_absolute_error(Y_test, Y_preds))
print('Train MAE : %.3f'%mean_absolute_error(Y_train, svr.predict(X_train)))
```

Below we are doing grid search through various values of parameter `C`

of LinearSVR and using `neg_mean_absolute_error`

as an evaluation metric whose value will be optimized.

In [34]:

```
grid = GridSearchCV(LinearSVR(),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="neg_mean_absolute_error", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Test MAE : %.3f'%mean_absolute_error(Y_test, grid.best_estimator_.predict(X_test)))
print('Train MAE : %.3f'%mean_absolute_error(Y_train, grid.best_estimator_.predict(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
```

Mean Squared Error loss function simple sum of the squared difference between actual and predicted value divided by a number of samples.

$$MSE = \frac 1 {2n} {\sum_{i=1}^n (x_i - y_i)^2}$$In [35]:

```
from sklearn.metrics import mean_squared_error, mean_squared_log_error
print('Test MSE : %.3f'%mean_squared_error(Y_test, Y_preds))
print('Train MSE : %.3f'%mean_squared_error(Y_train, svr.predict(X_train)))
```

Below we are doing grid search through various values of parameter `C`

of LinearSVR and using `neg_mean_squared_error`

as an evaluation metric whose value will be optimized.

In [36]:

```
grid = GridSearchCV(LinearSVR(),param_grid = {'C': [1.0, 0.1, 0.01, 10.0,]}, scoring="neg_mean_squared_error", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Test MSE : %.3f'%mean_squared_error(Y_test, grid.best_estimator_.predict(X_test)))
print('Train MSE : %.3f'%mean_squared_error(Y_train, grid.best_estimator_.predict(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
```

It can not be used when target contains negative values/predictions.

In [37]:

```
from sklearn.metrics import mean_squared_log_error
print(mean_squared_log_error(Y_test, Y_preds))
```

In [38]:

```
from sklearn.metrics import median_absolute_error
print('Median Absolute Error : ', median_absolute_error(Y_test, Y_preds))
print('Median Absolute Error : ', np.median(np.abs(Y_test - Y_preds)))
```

It returns the explained variance regression score. The best value is 1.0 and fewer values refer to a bad model.

In [39]:

```
from sklearn.metrics import explained_variance_score
print('Explained Variance Score : ', explained_variance_score(Y_test, Y_preds))
```

It returns the max of the difference between actual values and the predicted value of all samples.

$$ ME(Y, Y') = max(|y_1 - y_1'|,|y_2 - y_2'|,....,|y_n - y_n'| ) $$In [40]:

```
from sklearn.metrics import max_error
print('Maximum Residual Error : ', max_error(Y_test, Y_preds))
print('Maximum Residual Error : ', max_error([1,2,3,4], [1,2,3.5,7])) ## here 4th sample has highest difference
```

We'll now introduce evaluation metrics for unsupervised learning - clustering tasks.

Clustering algorithms return cluster labels for each cluster specified but it might not return in the same sequence as original labels. It might happen that in the original dataset some class has samples labeled as `1`

and in predictions by cluster, an algorithm can label it as other than `1`

.

We'll use the IRIS dataset and KMeans for explanation purposes.We'll even plot results to show the difference. We'll how accuracy will improve once we use `adjusted_rand_score`

as an evaluation function.

In [41]:

```
from sklearn.cluster import KMeans, MeanShift
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, adjusted_rand_score, confusion_matrix
iris = load_iris()
X, Y = iris.data, iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y, random_state=12)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_train, Y_train)
Y_preds = kmeans.predict(X_test)
#print(Y_test, Y_preds)
print('Confusion Matrix : ')
print(confusion_matrix(Y_test, Y_preds))
print('Accuracy of Model : %.3f'%accuracy_score(Y_test, Y_preds))
print('Adjusted Accuracy : %.3f'%adjusted_rand_score(Y_test, Y_preds))
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(10,4))
plt.subplot(121)
plt.scatter(X_test[: , 1], X_test[:, 2], c=Y_test, cmap = plt.cm.viridis)
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])
plt.title('Y Original')
plt.subplot(122)
plt.scatter(X_test[: , 1], X_test[:, 2], c=Y_preds, cmap = plt.cm.viridis)
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])
plt.title('Y Predicted');
```

Users can also define their own scoring function if their scoring function is not available in built-in scoring functions of sklearn. In `GridSearchCV`

and `cross_val_score`

, one can provide `object`

which has `__call__`

method or `function`

to `scoring`

parameter. Object or function both need to accept estimator object, test features(X) and target(Y) as input and return `float`

.

Below we are defining RMSE (Root Mean Squared Error) as a class and as a function as well. We'll then use it in `cross_val_score()`

to check performance also compares it's value with negative of `neg_mean_squared_error`

.

In [42]:

```
class RootMeanSquareError(object):
def __call__(self, model, X, Y):
Y_preds = model.predict(X)
return np.sqrt(((Y - Y_preds)**2).mean())
def rootMeanSquareError(model, X, Y):
Y_preds = model.predict(X)
return np.sqrt(((Y - Y_preds)**2).mean())
lsvr = LinearSVR(random_state=1)
print('Cross Val Score Using Object : ',cross_val_score(lsvr, X, Y, scoring=RootMeanSquareError()))
print('Cross Val Score Using Function : ', cross_val_score(lsvr, X, Y, scoring=rootMeanSquareError))
print('Cross Val Score Using Negative Mean Squared Error : ', -1*cross_val_score(lsvr, X, Y, scoring='neg_mean_squared_error'))
print('Cross Val Score Using Square Root of Negative Mean Squared Error : ', np.sqrt(-1*cross_val_score(lsvr, X, Y, scoring='neg_mean_squared_error')))
```

Below are list of scikit-learn builtin functions.

In [43]:

```
print('List of Inbuilt Scorers : ')
sklearn.metrics.SCORERS
```

Out[43]:

Sunny Solanki