Share @ LinkedIn Facebook  sklearn, naive-bayes
Scikit-Learn - Naive Bayes

Scikit-Learn - Naive Bayes

Table of Contents

Introduction

Naive Bayes estimators are probabilistic estimators based on the Bayes theorem with assumptions that there is strong independence between features. The Bayes Theorem helps us find out the probability of occurring events based on some prior knowledge of conditions that can be related to the event. The naive Bayes classifiers have worked quite well for document classification and spam filtering applications. It requires a small amount of training data to set up with probabilities for Bayes theorem and therefore works quite fast.

Scikit-Learn provides a list of 4 Naive Bayes estimators where each differs from other based on probability of particular feature appearing if particular class appears:

  • BernoulliNB - It represents classifier which is based on data that is multivariate Bernoulli distributions. The Bernoulli distribution implies that data can have multiple features but each one is assumed to be a binary variable.
  • GaussianNB - It represents classifier which is based on assumption that likelihood of features is Gaussian distribution.
  • ComplementNB - It represents a classifier that uses a complement of each class to compute model weights. It's a standard variant of multinomial naive Bayes which is well suited for imbalanced class classification problems.
  • MultinomialNB - It represents a classifier that is suited for multinomially distributed data.

We'll be explaining the usage of each one of the naive Bayes variants with examples.

We'll start by importing the necessary libraries for our tutorial.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

np.set_printoptions(precision=2)

%matplotlib inline

Load Dataset

We'll be using digits dataset for our explanation purpose. It has data about every 0-9 digits as an 8x8 pixel image. Each sample image is kept as a vector of size 64.

In [2]:
from sklearn.datasets import load_boston, load_digits

digits = load_digits()
X_digits, Y_digits = digits.data, digits.target
print('Dataset Size : ', X_digits.shape, Y_digits.shape)
Dataset Size :  (1797, 64) (1797,)

Splitting Data Into Train/Test Sets

We'll split the dataset into two parts:

  • Training data which will be used for the training model.
  • Test data against which accuracy of the trained model will be checked.

train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.


NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, stratify=Y_digits, random_state=123)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sizes :  (1437, 64) (360, 64) (1437,) (360,)

BernoulliNB

The first estimator that we'll be introducing is BernoulliNB available with the naive_bayes module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of BernoulliNB which can give helpful insight once the model is trained.

Fitting Default Model To Train Data

We'll be fitting model to train data by using fit() method of estimator passing it train features and train labels. We are fitting a default model to train data without setting any parameter explicitly.

In [4]:
from sklearn.naive_bayes import BernoulliNB

bernoulli_nb =  BernoulliNB()
bernoulli_nb.fit(X_train, Y_train)
Out[4]:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [5]:
Y_preds = bernoulli_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%bernoulli_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%bernoulli_nb.score(X_train, Y_train))
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.875
Training Accuracy : 0.864

Plotting Confusion Matrix

We'll be plotting the confusion matrix to better understand the performance of our model. We have designed the method plot_confusion_matrix() which accepts original labels and predicted labels of data. It then plots a confusion matrix. We'll be reusing this method in the future as well when training other estimators.

In [6]:
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(Y_test, Y_preds):
    conf_mat = confusion_matrix(Y_test, Y_preds)
    #print(conf_mat)
    fig = plt.figure(figsize=(6,6))
    plt.matshow(conf_mat, cmap=plt.cm.Blues, fignum=1)
    plt.yticks(range(10), range(10))
    plt.xticks(range(10), range(10))
    plt.colorbar();
    for i in range(10):
        for j in range(10):
            plt.text(i-0.2,j+0.1, str(conf_mat[j, i]), color='tab:red')
In [ ]:
plot_confusion_matrix(Y_test, bernoulli_nb.predict(X_test))

Scikit-Learn - Naive Bayes

Important Attributes of BernoulliNB

Below are list of important attributes available through estimator instance of BernoulliNB.

  • class_log_prior_ - It represents log probability of each class.
  • feature_log_prob_ - It represents log probability of particular feature based on class. (n_classes x n_features)
In [8]:
bernoulli_nb.class_log_prior_
Out[8]:
array([-2.31, -2.29, -2.31, -2.29, -2.29, -2.29, -2.29, -2.31, -2.34,
       -2.3 ])
In [9]:
print("Log Probability of Each Feature per class : ", bernoulli_nb.feature_log_prob_.shape)
Log Probability of Each Feature per class :  (10, 64)

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  • alpha - It accepts float value representing the additive smoothing parameter. The value of 0.0 represents no smoothing. The default value of this parameter is 1.0.

GridSearchCV

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.


NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [10]:
%%time

from sklearn.model_selection import GridSearchCV

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0],
         }

bernoulli_nb_grid = GridSearchCV(BernoulliNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
bernoulli_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%bernoulli_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%bernoulli_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%bernoulli_nb_grid.best_score_)
print('Best Parameters : ',bernoulli_nb_grid.best_params_)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Train Accuracy : 0.869
Test Accuracy : 0.883
Best Accuracy Through Grid Search : 0.825
Best Parameters :  {'alpha': 0.01}
CPU times: user 128 ms, sys: 67.1 ms, total: 195 ms
Wall time: 2.48 s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    2.3s finished

Plotting Confusion Matrix

Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.

In [ ]:
plot_confusion_matrix(Y_test, bernoulli_nb_grid.best_estimator_.predict(X_test))

Scikit-Learn - Naive Bayes

GaussianNB

The first estimator that we'll be introducing is GaussianNB available with the naive_bayes module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of GaussianNB which can give helpful insight once the model is trained.

Fitting Default Model To Train Data

In [12]:
from sklearn.naive_bayes import GaussianNB

gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, Y_train)
Out[12]:
GaussianNB(priors=None, var_smoothing=1e-09)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target varible on Test Set passed to it.

In [13]:
Y_preds = gaussian_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%gaussian_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%gaussian_nb.score(X_train, Y_train))
[5 9 9 6 1 6 6 9 7 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.856
Training Accuracy : 0.875

Plotting Confusion Matrix

In [ ]:
plot_confusion_matrix(Y_test, gaussian_nb.predict(X_test))

Scikit-Learn - Naive Bayes

Important Attributes of GaussianNB

Below are list of important attributes available through estimator instance of GaussianNB.

  • class_log_prior_ - It represents log probability of each class.
  • epsilon_ - It represents absolute additive value to variances.
  • sigma_ - It represents variance of each feature per class. (n_classes x n_features)
  • theta_ - It represents mean of feature per class. (n_classes x n_features)
In [15]:
gaussian_nb.class_prior_
Out[15]:
array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])
In [16]:
gaussian_nb.epsilon_
Out[16]:
4.217280259413155e-08
In [17]:
print("Gaussian Naive Bayes Sigma Shape : ", gaussian_nb.sigma_.shape)
Gaussian Naive Bayes Sigma Shape :  (10, 64)
In [18]:
print("Gaussian Naive Bayes Theta Shape : ", gaussian_nb.theta_.shape)
Gaussian Naive Bayes Theta Shape :  (10, 64)

ComplementNB

The first estimator that we'll be introducing is ComplementNB available with the naive_bayes module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of ComplementNB which can give helpful insight once the model is trained.

Fitting Default Model To Train Data

In [19]:
from sklearn.naive_bayes import ComplementNB

complement_nb = ComplementNB()
complement_nb.fit(X_train, Y_train)
Out[19]:
ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [20]:
Y_preds = complement_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%complement_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%complement_nb.score(X_train, Y_train))
[5 9 2 6 1 6 6 9 1 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.822
Training Accuracy : 0.823

Plotting Confusion Matrix

In [ ]:
plot_confusion_matrix(Y_test, complement_nb.predict(X_test))

Scikit-Learn - Naive Bayes

Important Attributes of ComplementNB

Below are list of important attributes available through estimator instance of ComplementNB.

  • class_log_prior_ - It represents log probability of each class.
  • feature_log_prob_ - It represents log probability of particular feature based on class. (n_classes x n_features)
In [22]:
complement_nb.class_log_prior_
Out[22]:
array([-2.31, -2.29, -2.31, -2.29, -2.29, -2.29, -2.29, -2.31, -2.34,
       -2.3 ])
In [23]:
print("Log Probability of Each Feature per class : ", complement_nb.feature_log_prob_.shape)
Log Probability of Each Feature per class :  (10, 64)

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  • alpha - It accepts float value representing the additive smoothing parameter. The value of 0.0 represents no smoothing. The default value of this parameter is 1.0.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [24]:
%%time

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
         }

complement_nb_grid = GridSearchCV(ComplementNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
complement_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%complement_nb_grid.best_score_)
print('Best Parameters : ',complement_nb_grid.best_params_)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
Train Accuracy : 0.818
Test Accuracy : 0.839
Best Accuracy Through Grid Search : 0.795
Best Parameters :  {'alpha': 10.0}
CPU times: user 81 ms, sys: 12.3 ms, total: 93.3 ms
Wall time: 156 ms
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  25 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.1s finished

Plotting Confusion Matrix

Below we are plotting confusion matrix again with best estimator that we found out using grid search.

In [ ]:
plot_confusion_matrix(Y_test, complement_nb_grid.best_estimator_.predict(X_test))

Scikit-Learn - Naive Bayes

MultinomialNB

The first estimator that we'll be introducing is MultinomialNB available with the naive_bayes module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of MultinomialNB which can give helpful insight once the model is trained.

Fitting Default Model To Train Data

In [26]:
from sklearn.naive_bayes import MultinomialNB

multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train, Y_train)
Out[26]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Evaluating Trained Model On Test Data.

Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [27]:
Y_preds = multinomial_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%multinomial_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%multinomial_nb.score(X_train, Y_train))
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.917
Training Accuracy : 0.900

Plotting Confusion Matrix

In [ ]:
plot_confusion_matrix(Y_test, multinomial_nb.predict(X_test))

Scikit-Learn - Naive Bayes

Important Attributes of MultinomialNB

Below are list of important attributes available through estimator instance of MultinomialNB.

  • class_log_prior_ - It represents log probability of each class.
  • feature_log_prob_ - It represents log probability of particular feature based on class. (n_classes x n_features)
In [29]:
multinomial_nb.class_log_prior_
Out[29]:
array([-2.31, -2.29, -2.31, -2.29, -2.29, -2.29, -2.29, -2.31, -2.34,
       -2.3 ])
In [30]:
print("Log Probability of Each Feature per class : ", multinomial_nb.feature_log_prob_.shape)
Log Probability of Each Feature per class :  (10, 64)

Finetuning Model By Doing Grid Search On Various Hyperparameters.

Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  • alpha - It accepts float value representing the additive smoothing parameter. The value of 0.0 represents no smoothing. The default value of this parameter is 1.0.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [31]:
%%time

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
         }

multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
Train Accuracy : 0.903
Test Accuracy : 0.922
Best Accuracy Through Grid Search : 0.875
Best Parameters :  {'alpha': 10.0}
CPU times: user 68.4 ms, sys: 12.1 ms, total: 80.5 ms
Wall time: 160 ms
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  25 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.1s finished

Plotting Confusion Matrix

Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.

In [ ]:
plot_confusion_matrix(Y_test, multinomial_nb_grid.best_estimator_.predict(X_test))

Scikit-Learn - Naive Bayes

This ends our small tutorial on introducing various naive Bayes implementation available with scikit-learn. Please feel free to let us know your views in the comments section.

References


Sunny Solanki  Sunny Solanki