Updated On : Jan-02,2020 Time Investment : ~30 mins

Supervised Learning - Classification¶

Supervised learning is a type of machine learning problem where users are given targets which they need to predict. Classification is a type of supervised learning where an algorithm predicts one output from a list of given classes. It can be a binary classification task where there are 2-classes or multi-class problems where there are more than 2-classes.

Applications of Classification¶

Classifying mails as spam/ham.
Classifying tumor as malignant/benign.
Classifying digits in an image.
& many more.

Supervised Learning Workflow¶

In this tutorial, we'll be covering classification problems and will try to solve them using the scikit-learn module. We'll be using LogisticRegression and KNearestNeighbors for explanation purposes. Dataset that we'll be using for our tutorial is the famous Iris flower dataset. It has 4 features based on which we'll predict the target variable which is one of the 3 classes of iris flowers.

We’ll start with importing scikit-learn and few supporting libraries.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import sklearn

import warnings
import sys

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)

warnings.filterwarnings("ignore") ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)

## Beow magic function fits plot inside of current notebook. 
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline

Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

Loading Data¶

Below we are loading the IRIS dataset which comes as default with the sklearn package. It returns Bunch object which is almost the same as the dictionary. We'll also print details about the dataset.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
#type(iris) ## Type is Bunch object which is almost same as Python Dictionary.
print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target names : '+str(iris.target.shape))

Dataset features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Dataset features size : (150, 4)
Dataset target names : ['setosa' 'versicolor' 'virginica']
Dataset target names : (150,)

Visualizing Data¶

Below we are visualizing our data by using a scatter plot which shows the relationship between two attributes of data (sepal length - X-axis vs petal width- Y-axis). One can also try different combinations of attributes of data to see how they are related. We also have color-encoded classes.

with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(15,6))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(iris.data[iris.target==i,0],iris.data[iris.target==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Visualizing Dataset')

Splitting Dataset into Train & Test sets¶

We'll split the dataset into two parts:

Training data which will be used for the training model.
Test data against which accuracy of the trained model will be checked.

train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

Please make a note that we are also using the stratify parameter which will prevent the unequal distribution of all classes in train and test sets. For each class, we'll have 80% samples in the train set and 20% samples in the test set. This will make sure that we don't have any dominating class in either train or test set.

X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, train_size=.8, test_size=.2, stratify=iris.target, random_state=123)
print('Train-Test dataset sizes : ',X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

Train-Test dataset sizes :  (120, 4) (120,) (30, 4) (30,)

Logistic Regression¶

Logistic Regression is a linear model for classification tasks. It can fit binary or multi-class(one-vs-rest) tasks. For more than 2 classes as an output scenario, it generates more than one linear line separating one class from the remaining classes. It should not be confused with the linear regression model which is used for supervised regression tasks.

Initializing Model¶

We are initializing the LogisticRegression model below which is the basic model used extensively for classification tasks. We are initializing it with the seed(random_state=123) to reproduce the same results in the feature.

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=123)
classifier

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Fitting Model To Train Data¶

We can train a model by passing train data and train labels. It returns objects of trained classifier as well after training.

classifier.fit(X_train,Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides predict() method which can be used to predict the target variables on Test Set passed to it. Most of the models also provide score() method which generally returns accuracy in the case of classification models. We'll utilize both methods below to compare results on test data.

Y_preds = classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Accuracy : %.3f'%classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.

[1 0 2 2 0 0 2 2 2 0 0 1 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
[1 0 2 2 0 0 2 1 2 0 0 1 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
Accuracy : 0.967
Accuracy : 0.967

The majority of classifiers in scikit-learn also provide the predict_proba() method which can be used to see probability generated by the model for each class of classification task.

print(classifier.predict_proba(X_test)[:10])  ## It returns probability predicted by model for each class for each example.

[[4.101e-02 7.560e-01 2.030e-01]
 [8.157e-01 1.842e-01 8.929e-05]
 [5.975e-04 1.789e-01 8.205e-01]
 [3.761e-03 3.018e-01 6.944e-01]
 [8.209e-01 1.790e-01 9.875e-05]
 [8.440e-01 1.559e-01 4.662e-05]
 [2.552e-04 3.270e-01 6.728e-01]
 [2.461e-03 4.976e-01 4.999e-01]
 [5.309e-03 3.420e-01 6.527e-01]
 [8.080e-01 1.919e-01 6.463e-05]]

As we discussed above, logistic regression tries to generate lines through data to separate classes. We can access coordinates of those lines through coef_ and intercept_ attributes of classifier. In the case of binary classification, only 1 line separating both classes is generated. But in our case which consists of 3 classes, there are 3 lines generated separating each class from the other 2 classes.

print('Weight Coefficients : '+str(classifier.coef_))
print('Y-axis Intercept : '+str(classifier.intercept_))

Weight Coefficients : [[ 0.398  1.397 -2.17  -0.987]
 [ 0.244 -1.31   0.567 -1.182]
 [-1.386 -1.747  2.278  2.254]]
Y-axis Intercept : [ 0.248  1.026 -1.141]

Visualizing Prediction Results On Test Data¶

Below we are trying to visualize how our model performed on test data by plotting scatter chart of sepal length vs petal width and color-encoding them with flower class.

with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(12,5))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Actual')

    plt.subplot(122)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Prediction');

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below are list of hypterparameters that we can tune to get best estimator for our data.

penalty - Penalty to be used in model to penalize weights to avoid over-fitting and under-fitting. It accepts string like l1, l2, elasticnet, and none. elasticnet refers to using both l1 and l2 in proportion. default=l2
fit_intercept - It's boolean value referring whether to include intercept in model or not ($y =mx + c$ - here c is referring to intercept).default=True
C - It's inverse of regularization strength(1/$\alpha$ whereas $\alpha$ is regularization strength in our cost function). default=1.0
solver - Algorithms for optimization. It accepts string from list ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'] default=liblinear
l1_ratio - When penalty is elasticnet then this parameter helps in determining proportion of l1 & l2 penalties. It accepts float(0.0-1.0] or None value. l1_ratio=0 is equivalent to using penalty=l2. l1_ratio=1 is equivalent to using penalty=l1. default=None

GridSearchCV¶

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.

Note: n_jobs parameter is provided by many estimators. It accepts a number of cores to use for parallelization. If the value of -1 is given then it uses all cores. We are also using %%time which jupyter notebook cell magic command which prints time taken by that cell to complete running. Time will be different on different computers based on their configurations.

Below we are trying liblinear solver for our purpose. We can only use penalties l2, l1 with this algorithm. It works faster for small datasets.

%%time

from sklearn.model_selection import GridSearchCV

params = {'penalty' : ['l1', 'l2',],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.975
Test Accuracy : 0.933
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 0.9, 'fit_intercept': True, 'penalty': 'l1'}
CPU times: user 73.3 ms, sys: 24.2 ms, total: 97.5 ms
Wall time: 821 ms

Printing First Few Cross-Validation Results¶

GridSearchCV object maintains all different parameters tried and results generated for each split of data in an attribute cv_results_ as a dictionary. Below we are loading that cross-validation results as pandas dataframe and printing first few entries.

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 40

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_fit_intercept	param_penalty	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.002213	0.000188	0.028940	0.014129	0.1	True	l1	{'C': 0.1, 'fit_intercept': True, 'penalty': '...	0.690476	0.666667	0.666667	0.675000	0.011356	39
1	0.001111	0.000197	0.004891	0.006528	0.1	True	l2	{'C': 0.1, 'fit_intercept': True, 'penalty': '...	0.761905	0.820513	0.794872	0.791667	0.024162	35
2	0.004236	0.001956	0.000295	0.000062	0.1	False	l1	{'C': 0.1, 'fit_intercept': False, 'penalty': ...	0.690476	0.666667	0.666667	0.675000	0.011356	39
3	0.001697	0.000581	0.000506	0.000359	0.1	False	l2	{'C': 0.1, 'fit_intercept': False, 'penalty': ...	0.761905	0.820513	0.769231	0.783333	0.025973	36
4	0.002466	0.000695	0.000263	0.000052	0.2	True	l1	{'C': 0.2, 'fit_intercept': True, 'penalty': '...	0.761905	0.794872	0.794872	0.783333	0.015724	36

Below we are trying saga solver for our purpose. We can only use penalties l2, l1, elasticnet or no penalty(none) with this algorithm. It's the only algorithm which supports elasticnet penalty. It works faster for large datasets.

%%time

params = {'penalty' : ['l1', 'l2','elasticnet', 'none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='saga', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.975
Test Accuracy : 1.000
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 0.1, 'fit_intercept': True, 'l1_ratio': 0.1, 'penalty': 'none'}
CPU times: user 6.77 s, sys: 334 ms, total: 7.1 s
Wall time: 1min 4s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 800

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_fit_intercept	param_l1_ratio	param_penalty	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.117888	0.002170	0.001064	0.000290	0.1	True	0.1	l1	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.738095	0.871795	0.769231	0.791667	0.057050	742
1	0.117184	0.005631	0.000959	0.000173	0.1	True	0.1	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.785714	0.923077	0.820513	0.841667	0.058268	708
2	0.119445	0.002697	0.001686	0.000776	0.1	True	0.1	elasticnet	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.785714	0.897436	0.820513	0.833333	0.046718	722
3	0.106255	0.002538	0.002070	0.000443	0.1	True	0.1	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	1.000000	0.923077	1.000000	0.975000	0.036029	1
4	0.124330	0.012873	0.001101	0.000599	0.1	True	0.2	l1	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.738095	0.871795	0.769231	0.791667	0.057050	742

Below we are trying sag solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm. It works faster for large datasets.

%%time

params = {'penalty' : ['l2', 'none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='sag', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.975
Test Accuracy : 0.967
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 1.0, 'fit_intercept': True, 'l1_ratio': 0.1, 'penalty': 'l2'}
CPU times: user 3.5 s, sys: 124 ms, total: 3.62 s
Wall time: 32.4 s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 400

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_fit_intercept	param_l1_ratio	param_penalty	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.108505	0.003258	0.000981	0.000175	0.1	True	0.1	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.923077	0.846154	0.875000	0.033664	371
1	0.109000	0.003525	0.000810	0.000037	0.1	True	0.1	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.952381	0.923077	1.000000	0.958333	0.031315	21
2	0.110173	0.004449	0.000863	0.000090	0.1	True	0.2	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.923077	0.846154	0.875000	0.033664	371
3	0.105495	0.001606	0.000884	0.000017	0.1	True	0.2	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.952381	0.923077	1.000000	0.958333	0.031315	21
4	0.105184	0.000777	0.000757	0.000017	0.1	True	0.3	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.923077	0.846154	0.875000	0.033664	371

Below we are trying lbfgs solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm.

%%time

params = {'penalty' : ['l2','none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='lbfgs', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.958
Test Accuracy : 0.967
Best Score Through Grid Search : 0.958
Best Parameters :  {'C': 0.7000000000000001, 'fit_intercept': False, 'l1_ratio': 0.1, 'penalty': 'l2'}
CPU times: user 2.26 s, sys: 85.3 ms, total: 2.35 s
Wall time: 31.5 s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 400

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_fit_intercept	param_l1_ratio	param_penalty	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.105871	0.002855	0.000697	0.000012	0.1	True	0.1	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.948718	0.871795	0.891667	0.040042	371
1	0.105649	0.001703	0.000725	0.000019	0.1	True	0.1	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.928571	0.948718	0.948718	0.941667	0.009609	141
2	0.107423	0.002051	0.000733	0.000059	0.1	True	0.2	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.948718	0.871795	0.891667	0.040042	371
3	0.103537	0.000132	0.000728	0.000047	0.1	True	0.2	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.928571	0.948718	0.948718	0.941667	0.009609	141
4	0.105095	0.001997	0.000700	0.000019	0.1	True	0.3	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.948718	0.871795	0.891667	0.040042	371

Below we are trying newton-cg solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm.

%%time

params = {'penalty' : ['l2','none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='newton-cg', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.975
Test Accuracy : 0.933
Best Score Through Grid Search : 0.958
Best Parameters :  {'C': 0.1, 'fit_intercept': False, 'l1_ratio': 0.1, 'penalty': 'none'}
CPU times: user 1.58 s, sys: 49 ms, total: 1.63 s
Wall time: 31.5 s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 400

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_fit_intercept	param_l1_ratio	param_penalty	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.105452	0.001049	0.000815	0.000152	0.1	True	0.1	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.948718	0.871795	0.891667	0.040042	371
1	0.141541	0.044703	0.000508	0.000167	0.1	True	0.1	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.928571	0.948718	0.948718	0.941667	0.009609	141
2	0.106475	0.002936	0.000529	0.000228	0.1	True	0.2	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.948718	0.871795	0.891667	0.040042	371
3	0.104525	0.003361	0.000302	0.000099	0.1	True	0.2	none	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.928571	0.948718	0.948718	0.941667	0.009609	141
4	0.110624	0.011969	0.000277	0.000033	0.1	True	0.3	l2	{'C': 0.1, 'fit_intercept': True, 'l1_ratio': ...	0.857143	0.948718	0.871795	0.891667	0.040042	371

K-Nearest Neighbors¶

K-nearest neighbor is one of the simplest algorithms which maintains all points from the train dataset and class to which it belongs. Later on, whenever a new unknown point comes for prediction it checks a predefined number of points nearer to that new point and based on majority class it assigns that majority class to a new point.n_neighbors is used to set the number of neighbors to check for predicting class for new unseen points.

Initializing Model¶

from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_classifier

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

Fitting Model To Train Data¶

knn_classifier.fit(X_train,Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')

Evaluating Trained Model On Test Data.¶

Y_preds = knn_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Accuracy : %.3f'%knn_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.

[1 0 2 2 0 0 2 2 2 0 0 2 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
[1 0 2 2 0 0 2 1 2 0 0 1 2 1 2 1 0 0 0 0 0 2 2 1 2 2 1 1 1 1]
Accuracy : 0.933
Accuracy : 0.933

print(knn_classifier.predict_proba(X_test)[:10]) ## It returns probability predicted by model for each class for each example.

[[0.  1.  0. ]
 [1.  0.  0. ]
 [0.  0.  1. ]
 [0.  0.  1. ]
 [1.  0.  0. ]
 [1.  0.  0. ]
 [0.  0.  1. ]
 [0.  0.2 0.8]
 [0.  0.4 0.6]
 [1.  0.  0. ]]

Visualizing Prediction Results On Test Data¶

with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(12,5))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Actual')

    plt.subplot(122)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Prediction');

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below are list of hypterparameters that we can tune to get best estimator for our data.

n_neighbors - Number of neighbors to use to determine class of target. default=5
algorithm - Algorithm for finding nearest neighbors. It takes one of the values from list [ball_tree, kd_tree, brute, auto]. default=auto
leaf_size - Leaf size of KDTree and BallTree. It controls speed of construction and quer of tree as well as memory requirement of tree.default=30

%%time

params = {'n_neighbors' : np.arange(1,10),
         'leaf_size': np.arange(5,50,5),
         'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto']}

grid = GridSearchCV(KNeighborsClassifier(n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.983
Test Accuracy : 0.933
Best Score Through Grid Search : 0.983
Best Parameters :  {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_neighbors': 3}
CPU times: user 2.38 s, sys: 84.4 ms, total: 2.46 s
Wall time: 27.2 s

Printing First Few Cross Validation Results¶

cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Number of Various Combinations of Parameters Tried : 324

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_algorithm	param_leaf_size	param_n_neighbors	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.000408	0.000023	0.109026	0.003087	ball_tree	5	1	{'algorithm': 'ball_tree', 'leaf_size': 5, 'n_...	0.952381	0.974359	0.974359	0.966667	0.010483	136
1	0.000857	0.000341	0.112922	0.003247	ball_tree	5	2	{'algorithm': 'ball_tree', 'leaf_size': 5, 'n_...	0.928571	0.974359	0.974359	0.958333	0.021839	253
2	0.001043	0.000076	0.106558	0.000596	ball_tree	5	3	{'algorithm': 'ball_tree', 'leaf_size': 5, 'n_...	1.000000	0.974359	0.974359	0.983333	0.012230	1
3	0.001003	0.000124	0.110045	0.004013	ball_tree	5	4	{'algorithm': 'ball_tree', 'leaf_size': 5, 'n_...	0.952381	0.974359	0.974359	0.966667	0.010483	136
4	0.000965	0.000069	0.109340	0.001368	ball_tree	5	5	{'algorithm': 'ball_tree', 'leaf_size': 5, 'n_...	0.976190	0.974359	0.974359	0.975000	0.000874	55

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

classification, supervised-learning

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Supervised Learning - Classification¶

Applications of Classification¶

Supervised Learning Workflow¶

Loading Data¶

Visualizing Data¶

Splitting Dataset into Train & Test sets¶

Logistic Regression¶

Initializing Model¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Visualizing Prediction Results On Test Data¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

GridSearchCV¶

Printing First Few Cross-Validation Results¶

Printing First Few Cross Validation Results¶

Printing First Few Cross Validation Results¶

Printing First Few Cross Validation Results¶

Printing First Few Cross Validation Results¶

K-Nearest Neighbors¶

Initializing Model¶

Fitting Model To Train Data¶

Evaluating Trained Model On Test Data.¶

Visualizing Prediction Results On Test Data¶

Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Printing First Few Cross Validation Results¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription