Share @ LinkedIn Facebook  classification, supervised-learning
Supervised Learning - Classification using Scikit-Learn

Supervised Learning - Classification

Supervised learning is a type of machine learning problem where users are given targets which they need to predict. Classification is a type of supervised learning where an algorithm predicts one output from a list of given classes. It can be a binary classification task where there are 2-classes or multi-class problems where there are more than 2-classes.

Applications of Classification

  • Classifying mails as spam/ham.
  • Classifying tumor as malignant/benign.
  • Classifying digits in an image.
  • & many more.

Supervised Learning Workflow

In this tutorial, we'll be covering classification problems and will try to solve them using the scikit-learn module. We'll be using LogisticRegression and KNearestNeighbors for explanation purposes. Dataset that we'll be using for our tutorial is the famous Iris flower dataset. It has 4 features based on which we'll predict the target variable which is one of the 3 classes of iris flowers.

We’ll start with importing scikit-learn and few supporting libraries.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import sklearn

import warnings
import sys

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)

warnings.filterwarnings("ignore") ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)

## Beow magic function fits plot inside of current notebook. 
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline
Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

Loading Data

Below we are loading the IRIS dataset which comes as default with the sklearn package. It returns Bunch object which is almost the same as the dictionary. We'll also print details about the dataset.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
#type(iris) ## Type is Bunch object which is almost same as Python Dictionary.
print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target names : '+str(iris.target.shape))
Dataset features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Dataset features size : (150, 4)
Dataset target names : ['setosa' 'versicolor' 'virginica']
Dataset target names : (150,)

Visualizing Data

Below we are visualizing our data by using a scatter plot which shows the relationship between two attributes of data (sepal length - X-axis vs petal width- Y-axis). One can also try different combinations of attributes of data to see how they are related. We also have color-encoded classes.

In [3]:
with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(15,6))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(iris.data[iris.target==i,0],iris.data[iris.target==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Visualizing Dataset')

Splitting Dataset into Train & Test sets

We'll split the dataset into two parts:

  • Training data which will be used for the training model.
  • Test data against which accuracy of the trained model will be checked.

train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

Please make a note that we are also using the stratify parameter which will prevent the unequal distribution of all classes in train and test sets. For each class, we'll have 80% samples in the train set and 20% samples in the test set. This will make sure that we don't have any dominating class in either train or test set.

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, train_size=.8, test_size=.2, stratify=iris.target, random_state=123)
print('Train-Test dataset sizes : ',X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)
Train-Test dataset sizes :  (120, 4) (120,) (30, 4) (30,)

Logistic Regression

Logistic Regression is a linear model for classification tasks. It can fit binary or multi-class(one-vs-rest) tasks. For more than 2 classes as an output scenario, it generates more than one linear line separating one class from the remaining classes. It should not be confused with the linear regression model which is used for supervised regression tasks.

Initializing Model

We are initializing the LogisticRegression model below which is the basic model used extensively for classification tasks. We are initializing it with the seed(random_state=123) to reproduce the same results in the feature.

In [5]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=123)
classifier
Out[5]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Fitting Model To Train Data

We can train a model by passing train data and train labels. It returns objects of trained classifier as well after training.

In [6]:
classifier.fit(X_train,Y_train)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict the target variables on Test Set passed to it. Most of the models also provide score() method which generally returns accuracy in the case of classification models. We'll utilize both methods below to compare results on test data.

In [7]:
Y_preds = classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Accuracy : %.3f'%classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
[1 0 2 2 0 0 2 2 2 0 0 1 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
[1 0 2 2 0 0 2 1 2 0 0 1 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
Accuracy : 0.967
Accuracy : 0.967

The majority of classifiers in scikit-learn also provide the predict_proba() method which can be used to see probability generated by the model for each class of classification task.

In [8]:
print(classifier.predict_proba(X_test)[:10])  ## It returns probability predicted by model for each class for each example.
[[4.101e-02 7.560e-01 2.030e-01]
 [8.157e-01 1.842e-01 8.929e-05]
 [5.975e-04 1.789e-01 8.205e-01]
 [3.761e-03 3.018e-01 6.944e-01]
 [8.209e-01 1.790e-01 9.875e-05]
 [8.440e-01 1.559e-01 4.662e-05]
 [2.552e-04 3.270e-01 6.728e-01]
 [2.461e-03 4.976e-01 4.999e-01]
 [5.309e-03 3.420e-01 6.527e-01]
 [8.080e-01 1.919e-01 6.463e-05]]

As we discussed above, logistic regression tries to generate lines through data to separate classes. We can access coordinates of those lines through coef_ and intercept_ attributes of classifier. In the case of binary classification, only 1 line separating both classes is generated. But in our case which consists of 3 classes, there are 3 lines generated separating each class from the other 2 classes.

In [9]:
print('Weight Coefficients : '+str(classifier.coef_))
print('Y-axis Intercept : '+str(classifier.intercept_))
Weight Coefficients : [[ 0.398  1.397 -2.17  -0.987]
 [ 0.244 -1.31   0.567 -1.182]
 [-1.386 -1.747  2.278  2.254]]
Y-axis Intercept : [ 0.248  1.026 -1.141]

Visualizing Prediction Results On Test Data

Below we are trying to visualize how our model performed on test data by plotting scatter chart of sepal length vs petal width and color-encoding them with flower class.

In [10]:
with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(12,5))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Actual')

    plt.subplot(122)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Prediction');

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below are list of hypterparameters that we can tune to get best estimator for our data.

  • penalty - Penalty to be used in model to penalize weights to avoid over-fitting and under-fitting. It accepts string like l1, l2, elasticnet, and none. elasticnet refers to using both l1 and l2 in proportion. default=l2
  • fit_intercept - It's boolean value referring whether to include intercept in model or not ($y =mx + c$ - here c is referring to intercept).default=True
  • C - It's inverse of regularization strength(1/$\alpha$ whereas $\alpha$ is regularization strength in our cost function). default=1.0
  • solver - Algorithms for optimization. It accepts string from list ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'] default=liblinear
  • l1_ratio - When penalty is elasticnet then this parameter helps in determining proportion of l1 & l2 penalties. It accepts float(0.0-1.0] or None value. l1_ratio=0 is equivalent to using penalty=l2. l1_ratio=1 is equivalent to using penalty=l1. default=None

GridSearchCV

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.

Note: n_jobs parameter is provided by many estimators. It accepts a number of cores to use for parallelization. If the value of -1 is given then it uses all cores. We are also using %%time which jupyter notebook cell magic command which prints time taken by that cell to complete running. Time will be different on different computers based on their configurations.

Below we are trying liblinear solver for our purpose. We can only use penalties l2, l1 with this algorithm. It works faster for small datasets.

In [11]:
%%time

from sklearn.model_selection import GridSearchCV

params = {'penalty' : ['l1', 'l2',],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train Accuracy : 0.975
Test Accuracy : 0.933
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 0.9, 'fit_intercept': True, 'penalty': 'l1'}
CPU times: user 73.3 ms, sys: 24.2 ms, total: 97.5 ms
Wall time: 821 ms

Printing First Few Cross-Validation Results

GridSearchCV object maintains all different parameters tried and results generated for each split of data in an attribute cv_results_ as a dictionary. Below we are loading that cross-validation results as pandas dataframe and printing first few entries.

In [12]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 40
Out[12]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_fit_intercept param_penalty params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.002213 0.000188 0.028940 0.014129 0.1 True l1 {'C': 0.1, 'fit_intercept': True, 'penalty': '... 0.690476 0.666667 0.666667 0.675000 0.011356 39
1 0.001111 0.000197 0.004891 0.006528 0.1 True l2 {'C': 0.1, 'fit_intercept': True, 'penalty': '... 0.761905 0.820513 0.794872 0.791667 0.024162 35
2 0.004236 0.001956 0.000295 0.000062 0.1 False l1 {'C': 0.1, 'fit_intercept': False, 'penalty': ... 0.690476 0.666667 0.666667 0.675000 0.011356 39
3 0.001697 0.000581 0.000506 0.000359 0.1 False l2 {'C': 0.1, 'fit_intercept': False, 'penalty': ... 0.761905 0.820513 0.769231 0.783333 0.025973 36
4 0.002466 0.000695 0.000263 0.000052 0.2 True l1 {'C': 0.2, 'fit_intercept': True, 'penalty': '... 0.761905 0.794872 0.794872 0.783333 0.015724 36

Below we are trying saga solver for our purpose. We can only use penalties l2, l1, elasticnet or no penalty(none) with this algorithm. It's the only algorithm which supports elasticnet penalty. It works faster for large datasets.

In [13]:
%%time

params = {'penalty' : ['l1', 'l2','elasticnet', 'none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='saga', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train Accuracy : 0.975
Test Accuracy : 1.000
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 0.1, 'fit_intercept': True, 'l1_ratio': 0.1, 'penalty': 'none'}
CPU times: user 6.77 s, sys: 334 ms, total: 7.1 s
Wall time: 1min 4s

Printing First Few Cross Validation Results

In [14]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 800
Out[14]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_fit_intercept param_l1_ratio param_penalty params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.117888 0.002170 0.001064 0.000290 0.1 True 0.1 l1 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.738095 0.871795 0.769231 0.791667 0.057050 742
1 0.117184 0.005631 0.000959 0.000173 0.1 True 0.1 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.785714 0.923077 0.820513 0.841667 0.058268 708
2 0.119445 0.002697 0.001686 0.000776 0.1 True 0.1 elasticnet {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.785714 0.897436 0.820513 0.833333 0.046718 722
3 0.106255 0.002538 0.002070 0.000443 0.1 True 0.1 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 1.000000 0.923077 1.000000 0.975000 0.036029 1
4 0.124330 0.012873 0.001101 0.000599 0.1 True 0.2 l1 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.738095 0.871795 0.769231 0.791667 0.057050 742

Below we are trying sag solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm. It works faster for large datasets.

In [15]:
%%time

params = {'penalty' : ['l2', 'none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='sag', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train Accuracy : 0.975
Test Accuracy : 0.967
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 1.0, 'fit_intercept': True, 'l1_ratio': 0.1, 'penalty': 'l2'}
CPU times: user 3.5 s, sys: 124 ms, total: 3.62 s
Wall time: 32.4 s

Printing First Few Cross Validation Results

In [16]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 400
Out[16]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_fit_intercept param_l1_ratio param_penalty params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.108505 0.003258 0.000981 0.000175 0.1 True 0.1 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.923077 0.846154 0.875000 0.033664 371
1 0.109000 0.003525 0.000810 0.000037 0.1 True 0.1 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.952381 0.923077 1.000000 0.958333 0.031315 21
2 0.110173 0.004449 0.000863 0.000090 0.1 True 0.2 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.923077 0.846154 0.875000 0.033664 371
3 0.105495 0.001606 0.000884 0.000017 0.1 True 0.2 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.952381 0.923077 1.000000 0.958333 0.031315 21
4 0.105184 0.000777 0.000757 0.000017 0.1 True 0.3 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.923077 0.846154 0.875000 0.033664 371

Below we are trying lbfgs solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm.

In [17]:
%%time

params = {'penalty' : ['l2','none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='lbfgs', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train Accuracy : 0.958
Test Accuracy : 0.967
Best Score Through Grid Search : 0.958
Best Parameters :  {'C': 0.7000000000000001, 'fit_intercept': False, 'l1_ratio': 0.1, 'penalty': 'l2'}
CPU times: user 2.26 s, sys: 85.3 ms, total: 2.35 s
Wall time: 31.5 s

Printing First Few Cross Validation Results

In [18]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 400
Out[18]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_fit_intercept param_l1_ratio param_penalty params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.105871 0.002855 0.000697 0.000012 0.1 True 0.1 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.948718 0.871795 0.891667 0.040042 371
1 0.105649 0.001703 0.000725 0.000019 0.1 True 0.1 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.928571 0.948718 0.948718 0.941667 0.009609 141
2 0.107423 0.002051 0.000733 0.000059 0.1 True 0.2 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.948718 0.871795 0.891667 0.040042 371
3 0.103537 0.000132 0.000728 0.000047 0.1 True 0.2 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.928571 0.948718 0.948718 0.941667 0.009609 141
4 0.105095 0.001997 0.000700 0.000019 0.1 True 0.3 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.948718 0.871795 0.891667 0.040042 371

Below we are trying newton-cg solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm.

In [19]:
%%time

params = {'penalty' : ['l2','none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='newton-cg', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train Accuracy : 0.975
Test Accuracy : 0.933
Best Score Through Grid Search : 0.958
Best Parameters :  {'C': 0.1, 'fit_intercept': False, 'l1_ratio': 0.1, 'penalty': 'none'}
CPU times: user 1.58 s, sys: 49 ms, total: 1.63 s
Wall time: 31.5 s

Printing First Few Cross Validation Results

In [20]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 400
Out[20]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_fit_intercept param_l1_ratio param_penalty params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.105452 0.001049 0.000815 0.000152 0.1 True 0.1 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.948718 0.871795 0.891667 0.040042 371
1 0.141541 0.044703 0.000508 0.000167 0.1 True 0.1 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.928571 0.948718 0.948718 0.941667 0.009609 141
2 0.106475 0.002936 0.000529 0.000228 0.1 True 0.2 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.948718 0.871795 0.891667 0.040042 371
3 0.104525 0.003361 0.000302 0.000099 0.1 True 0.2 none {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.928571 0.948718 0.948718 0.941667 0.009609 141
4 0.110624 0.011969 0.000277 0.000033 0.1 True 0.3 l2 {'C': 0.1, 'fit_intercept': True, 'l1_ratio': ... 0.857143 0.948718 0.871795 0.891667 0.040042 371

K-Nearest Neighbors

K-nearest neighbor is one of the simplest algorithms which maintains all points from the train dataset and class to which it belongs. Later on, whenever a new unknown point comes for prediction it checks a predefined number of points nearer to that new point and based on majority class it assigns that majority class to a new point.n_neighbors is used to set the number of neighbors to check for predicting class for new unseen points.

Initializing Model

In [21]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_classifier
Out[21]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

Fitting Model To Train Data

In [22]:
knn_classifier.fit(X_train,Y_train)
Out[22]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

Evaluating Trained Model On Test Data.

In [23]:
Y_preds = knn_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Accuracy : %.3f'%knn_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
[1 0 2 2 0 0 2 2 2 0 0 2 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
[1 0 2 2 0 0 2 1 2 0 0 1 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
Accuracy : 0.933
Accuracy : 0.933
In [24]:
print(knn_classifier.predict_proba(X_test)[:10]) ## It returns probability predicted by model for each class for each example.
[[0.  1.  0. ]
 [1.  0.  0. ]
 [0.  0.  1. ]
 [0.  0.  1. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [0.  0.  1. ]
 [0.  0.2 0.8]
 [0.  0.4 0.6]
 [1.  0.  0. ]]

Visualizing Prediction Results On Test Data

In [25]:
with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(12,5))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Actual')

    plt.subplot(122)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Prediction');

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below are list of hypterparameters that we can tune to get best estimator for our data.

  • n_neighbors - Number of neighbors to use to determine class of target. default=5
  • algorithm - Algorithm for finding nearest neighbors. It takes one of the values from list [ball_tree, kd_tree, brute, auto]. default=auto
  • leaf_size - Leaf size of KDTree and BallTree. It controls speed of construction and quer of tree as well as memory requirement of tree.default=30
In [26]:
%%time

params = {'n_neighbors' : np.arange(1,10),
         'leaf_size': np.arange(5,50,5),
         'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto']}

grid = GridSearchCV(KNeighborsClassifier(n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train Accuracy : 0.983
Test Accuracy : 0.933
Best Score Through Grid Search : 0.983
Best Parameters :  {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_neighbors': 3}
CPU times: user 2.38 s, sys: 84.4 ms, total: 2.46 s
Wall time: 27.2 s

Printing First Few Cross Validation Results

In [27]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.
Number of Various Combinations of Parameters Tried : 324
Out[27]:
mean_fit_time std_fit_time mean_score_time std_score_time param_algorithm param_leaf_size param_n_neighbors params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.000408 0.000023 0.109026 0.003087 ball_tree 5 1 {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_... 0.952381 0.974359 0.974359 0.966667 0.010483 136
1 0.000857 0.000341 0.112922 0.003247 ball_tree 5 2 {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_... 0.928571 0.974359 0.974359 0.958333 0.021839 253
2 0.001043 0.000076 0.106558 0.000596 ball_tree 5 3 {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_... 1.000000 0.974359 0.974359 0.983333 0.012230 1
3 0.001003 0.000124 0.110045 0.004013 ball_tree 5 4 {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_... 0.952381 0.974359 0.974359 0.966667 0.010483 136
4 0.000965 0.000069 0.109340 0.001368 ball_tree 5 5 {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_... 0.976190 0.974359 0.974359 0.975000 0.000874 55

Sunny Solanki  Sunny Solanki