Supervised learning
is a type of machine learning problem where users are given targets which they need to predict. Classification
is a type of supervised learning where an algorithm predicts one output from a list of given classes. It can be a binary classification task where there are 2-classes or multi-class problems where there are more than 2-classes.
In this tutorial, we'll be covering classification problems and will try to solve them using the scikit-learn
module. We'll be using LogisticRegression
and KNearestNeighbors
for explanation purposes. Dataset that we'll be using for our tutorial is the famous Iris flower dataset. It has 4 features based on which we'll predict the target variable which is one of the 3 classes of iris flowers.
We’ll start with importing scikit-learn and few supporting libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
import sys
print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
warnings.filterwarnings("ignore") ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)
## Beow magic function fits plot inside of current notebook.
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline
Below we are loading the IRIS dataset which comes as default with the sklearn package. It returns Bunch object which is almost the same as the dictionary. We'll also print details about the dataset.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris = load_iris()
#type(iris) ## Type is Bunch object which is almost same as Python Dictionary.
print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target names : '+str(iris.target.shape))
Below we are visualizing our data by using a scatter plot which shows the relationship between two attributes of data (sepal length - X-axis vs petal width- Y-axis)
. One can also try different combinations of attributes of data to see how they are related. We also have color-encoded classes.
with plt.style.context(('ggplot','seaborn')):
plt.figure(figsize=(15,6))
plt.subplot(121)
for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
plt.scatter(iris.data[iris.target==i,0],iris.data[iris.target==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.legend(loc='best')
plt.title('Visualizing Dataset')
We'll split the dataset into two parts:
Training data
which will be used for the training model. Test data
against which accuracy of the trained model will be checked.train_test_split
function of model_selection
module of sklearn will help us split data into two sets with 80%
for training and 20%
for test purposes. We are also using seed(random_state=123)
with train_test_split
so that we always get the same split and can reproduce results in the future as well.
Please make a note that we are also using the stratify
parameter which will prevent the unequal distribution of all classes in train and test sets. For each class, we'll have 80% samples in the train set and 20% samples in the test set. This will make sure that we don't have any dominating class in either train or test set.
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, train_size=.8, test_size=.2, stratify=iris.target, random_state=123)
print('Train-Test dataset sizes : ',X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)
Logistic Regression is a linear model for classification tasks. It can fit binary or multi-class(one-vs-rest) tasks. For more than 2 classes as an output scenario, it generates more than one linear line separating one class from the remaining classes. It should not be confused with the linear regression model which is used for supervised regression tasks.
We are initializing the LogisticRegression model below which is the basic model used extensively for classification tasks. We are initializing it with the seed(random_state=123) to reproduce the same results in the feature.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=123)
classifier
We can train a model by passing train data and train labels. It returns objects of trained classifier as well after training.
classifier.fit(X_train,Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict the target variables on Test Set passed to it. Most of the models also provide score()
method which generally returns accuracy
in the case of classification models. We'll utilize both methods below to compare results on test data.
Y_preds = classifier.predict(X_test)
print(Y_preds)
print(Y_test)
print('Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Accuracy : %.3f'%classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
The majority of classifiers in scikit-learn also provide the predict_proba()
method which can be used to see probability generated by the model for each class of classification task.
print(classifier.predict_proba(X_test)[:10]) ## It returns probability predicted by model for each class for each example.
As we discussed above, logistic regression tries to generate lines through data to separate classes. We can access coordinates of those lines through coef_
and intercept_
attributes of classifier. In the case of binary classification, only 1 line separating both classes is generated. But in our case which consists of 3 classes, there are 3 lines generated separating each class from the other 2 classes.
print('Weight Coefficients : '+str(classifier.coef_))
print('Y-axis Intercept : '+str(classifier.intercept_))
Below we are trying to visualize how our model performed on test data by plotting scatter chart of sepal length vs petal width and color-encoding them with flower class.
with plt.style.context(('ggplot','seaborn')):
plt.figure(figsize=(12,5))
plt.subplot(121)
for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.legend(loc='best')
plt.title('Actual')
plt.subplot(122)
for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.legend(loc='best')
plt.title('Prediction');
Below are list of hypterparameters that we can tune to get best estimator for our data.
l1
, l2
, elasticnet
, and none
. elasticnet
refers to using both l1
and l2
in proportion. default=l2
intercept
in model or not ($y =mx + c$ - here c
is referring to intercept).default=True
default=1.0
default=liblinear
elasticnet
then this parameter helps in determining proportion of l1
& l2
penalties. It accepts float(0.0-1.0]
or None
value. l1_ratio=0
is equivalent to using penalty=l2
. l1_ratio=1
is equivalent to using penalty=l1
. default=None
It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid
parameter with a number of cross-validation folds provided as cv
parameter, evaluates model performance on all combinations and stores all results in cv_results_
attribute. It also stores model which performs best in all cross-validation folds in best_estimator_
attribute and best score in best_score_
attribute.
Note: n_jobs
parameter is provided by many estimators. It accepts a number of cores to use for parallelization. If the value of -1
is given then it uses all cores. We are also using %%time
which jupyter notebook cell magic command which prints time taken by that cell to complete running. Time will be different on different computers based on their configurations.
Below we are trying liblinear
solver for our purpose. We can only use penalties l2
, l1
with this algorithm. It works faster for small datasets.
%%time
from sklearn.model_selection import GridSearchCV
params = {'penalty' : ['l1', 'l2',],
'fit_intercept': [True, False],
'C': np.linspace(0.1,1.0,10)}
grid = GridSearchCV(LogisticRegression(random_state=1, n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
GridSearchCV object maintains all different parameters tried and results generated for each split of data in an attribute cv_results_
as a dictionary. Below we are loading that cross-validation results as pandas dataframe and printing first few entries.
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Below we are trying saga
solver for our purpose. We can only use penalties l2
, l1
, elasticnet
or no penalty(none
) with this algorithm. It's the only algorithm which supports elasticnet
penalty. It works faster for large datasets.
%%time
params = {'penalty' : ['l1', 'l2','elasticnet', 'none'],
'fit_intercept': [True, False],
'C': np.linspace(0.1,1.0,10),
'l1_ratio': np.linspace(0.1,1.0,10)}
grid = GridSearchCV(LogisticRegression(random_state=1, solver='saga', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Below we are trying sag
solver for our purpose. We can only use penalty l2
or no penalty(none
) with this algorithm. It works faster for large datasets.
%%time
params = {'penalty' : ['l2', 'none'],
'fit_intercept': [True, False],
'C': np.linspace(0.1,1.0,10),
'l1_ratio': np.linspace(0.1,1.0,10)}
grid = GridSearchCV(LogisticRegression(random_state=1, solver='sag', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Below we are trying lbfgs
solver for our purpose. We can only use penalty l2
or no penalty(none
) with this algorithm.
%%time
params = {'penalty' : ['l2','none'],
'fit_intercept': [True, False],
'C': np.linspace(0.1,1.0,10),
'l1_ratio': np.linspace(0.1,1.0,10)}
grid = GridSearchCV(LogisticRegression(random_state=1, solver='lbfgs', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Below we are trying newton-cg
solver for our purpose. We can only use penalty l2
or no penalty(none
) with this algorithm.
%%time
params = {'penalty' : ['l2','none'],
'fit_intercept': [True, False],
'C': np.linspace(0.1,1.0,10),
'l1_ratio': np.linspace(0.1,1.0,10)}
grid = GridSearchCV(LogisticRegression(random_state=1, solver='newton-cg', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
K-nearest neighbor is one of the simplest algorithms which maintains all points from the train dataset and class to which it belongs. Later on, whenever a new unknown point comes for prediction it checks a predefined number of points nearer to that new point and based on majority class it assigns that majority class to a new point.n_neighbors
is used to set the number of neighbors to check for predicting class for new unseen points.
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_classifier
knn_classifier.fit(X_train,Y_train)
Y_preds = knn_classifier.predict(X_test)
print(Y_preds)
print(Y_test)
print('Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Accuracy : %.3f'%knn_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print(knn_classifier.predict_proba(X_test)[:10]) ## It returns probability predicted by model for each class for each example.
with plt.style.context(('ggplot','seaborn')):
plt.figure(figsize=(12,5))
plt.subplot(121)
for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.legend(loc='best')
plt.title('Actual')
plt.subplot(122)
for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.legend(loc='best')
plt.title('Prediction');
Below are list of hypterparameters that we can tune to get best estimator for our data.
default=5
ball_tree
, kd_tree
, brute
, auto
]. default=auto
default=30
%%time
params = {'n_neighbors' : np.arange(1,10),
'leaf_size': np.arange(5,50,5),
'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto']}
grid = GridSearchCV(KNeighborsClassifier(n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to