Updated On : May-26,2020  sklearn, naive-bayes # Scikit-Learn - Naive Bayes¶

## Introduction ¶

Naive Bayes estimators are probabilistic estimators based on the `Bayes theorem` with assumptions that there is strong independence between features. The Bayes Theorem helps us find out the probability of occurring events based on some prior knowledge of conditions that can be related to the event. The naive Bayes classifiers have worked quite well for document classification and spam filtering applications. It requires a small amount of training data to set up with probabilities for Bayes theorem and therefore works quite fast.

Scikit-Learn provides a list of 4 Naive Bayes estimators where each differs from other based on probability of particular feature appearing if particular class appears:

• BernoulliNB - It represents classifier which is based on data that is multivariate Bernoulli distributions. The Bernoulli distribution implies that data can have multiple features but each one is assumed to be a binary variable.
• GaussianNB - It represents classifier which is based on assumption that likelihood of features is Gaussian distribution.
• ComplementNB - It represents a classifier that uses a complement of each class to compute model weights. It's a standard variant of multinomial naive Bayes which is well suited for imbalanced class classification problems.
• MultinomialNB - It represents a classifier that is suited for multinomially distributed data.

We'll be explaining the usage of each one of the naive Bayes variants with examples.

We'll start by importing the necessary libraries for our tutorial.

In :
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

np.set_printoptions(precision=2)

%matplotlib inline
```

We'll be using digits dataset for our explanation purpose. It has data about every 0-9 digits as an `8x8` pixel image. Each sample image is kept as a vector of size `64`.

In :
```from sklearn.datasets import load_boston, load_digits

X_digits, Y_digits = digits.data, digits.target
print('Dataset Size : ', X_digits.shape, Y_digits.shape)
```
```Dataset Size :  (1797, 64) (1797,)
```

### Splitting Data Into Train/Test Sets¶

We'll split the dataset into two parts:

• `Training data` which will be used for the training model.
• `Test data` against which accuracy of the trained model will be checked.

`train_test_split` function of `model_selection` module of sklearn will help us split data into two sets with `80%` for training and `20%` for test purposes. We are also using `seed(random_state=123)` with train_test_split so that we always get the same split and can reproduce results in the future as well.

NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

In :
```from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, stratify=Y_digits, random_state=123)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
```
```Train/Test Sizes :  (1437, 64) (360, 64) (1437,) (360,)
```

## BernoulliNB ¶

The first estimator that we'll be introducing is `BernoulliNB` available with the `naive_bayes` module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of `BernoulliNB` which can give helpful insight once the model is trained.

### Fitting Default Model To Train Data¶

We'll be fitting model to train data by using `fit()` method of estimator passing it train features and train labels. We are fitting a default model to train data without setting any parameter explicitly.

In :
```from sklearn.naive_bayes import BernoulliNB

bernoulli_nb =  BernoulliNB()
bernoulli_nb.fit(X_train, Y_train)
```
Out:
`BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)`

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it.

In :
```Y_preds = bernoulli_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%bernoulli_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%bernoulli_nb.score(X_train, Y_train))
```
```[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.875
Training Accuracy : 0.864
```

### Plotting Confusion Matrix¶

We'll be plotting the confusion matrix to better understand the performance of our model. We have designed the method `plot_confusion_matrix()` which accepts original labels and predicted labels of data. It then plots a confusion matrix. We'll be reusing this method in the future as well when training other estimators.

In :
```from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(Y_test, Y_preds):
conf_mat = confusion_matrix(Y_test, Y_preds)
#print(conf_mat)
fig = plt.figure(figsize=(6,6))
plt.matshow(conf_mat, cmap=plt.cm.Blues, fignum=1)
plt.yticks(range(10), range(10))
plt.xticks(range(10), range(10))
plt.colorbar();
for i in range(10):
for j in range(10):
plt.text(i-0.2,j+0.1, str(conf_mat[j, i]), color='tab:red')
```
In [ ]:
```plot_confusion_matrix(Y_test, bernoulli_nb.predict(X_test))
```

### Important Attributes of BernoulliNB¶

Below are list of important attributes available through estimator instance of BernoulliNB.

• `class_log_prior_` - It represents log probability of each class.
• `feature_log_prob_` - It represents log probability of particular feature based on class. `(n_classes x n_features)`
In :
```bernoulli_nb.class_log_prior_
```
Out:
```array([-2.31, -2.29, -2.31, -2.29, -2.29, -2.29, -2.29, -2.31, -2.34,
-2.3 ])```
In :
```print("Log Probability of Each Feature per class : ", bernoulli_nb.feature_log_prob_.shape)
```
```Log Probability of Each Feature per class :  (10, 64)
```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

• alpha - It accepts float value representing the additive smoothing parameter. The value of `0.0` represents no smoothing. The default value of this parameter is `1.0`.

### GridSearchCV¶

It's a wrapper class provided by sklearn which loops through all parameters provided as `params_grid` parameter with a number of cross-validation folds provided as `cv` parameter, evaluates model performance on all combinations and stores all results in `cv_results_` attribute. It also stores model which performs best in all cross-validation folds in `best_estimator_` attribute and best score in `best_score_` attribute.

NOTE

n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into `3-fold cross-validation`.

In :
```%%time

from sklearn.model_selection import GridSearchCV

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0],
}

bernoulli_nb_grid = GridSearchCV(BernoulliNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
bernoulli_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%bernoulli_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%bernoulli_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%bernoulli_nb_grid.best_score_)
print('Best Parameters : ',bernoulli_nb_grid.best_params_)
```
```Fitting 5 folds for each of 5 candidates, totalling 25 fits
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
```
```Train Accuracy : 0.869
Test Accuracy : 0.883
Best Accuracy Through Grid Search : 0.825
Best Parameters :  {'alpha': 0.01}
CPU times: user 128 ms, sys: 67.1 ms, total: 195 ms
Wall time: 2.48 s
```
```[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    2.3s finished
```

### Plotting Confusion Matrix¶

Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.

In [ ]:
```plot_confusion_matrix(Y_test, bernoulli_nb_grid.best_estimator_.predict(X_test))
```

## GaussianNB ¶

The first estimator that we'll be introducing is `GaussianNB` available with the `naive_bayes` module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of `GaussianNB` which can give helpful insight once the model is trained.

### Fitting Default Model To Train Data¶

In :
```from sklearn.naive_bayes import GaussianNB

gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, Y_train)
```
Out:
`GaussianNB(priors=None, var_smoothing=1e-09)`

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target varible on Test Set passed to it.

In :
```Y_preds = gaussian_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%gaussian_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%gaussian_nb.score(X_train, Y_train))
```
```[5 9 9 6 1 6 6 9 7 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.856
Training Accuracy : 0.875
```

### Plotting Confusion Matrix¶

In [ ]:
```plot_confusion_matrix(Y_test, gaussian_nb.predict(X_test))
```

### Important Attributes of GaussianNB¶

Below are list of important attributes available through estimator instance of GaussianNB.

• `class_log_prior_` - It represents log probability of each class.
• `epsilon_` - It represents absolute additive value to variances.
• `sigma_` - It represents variance of each feature per class. `(n_classes x n_features)`
• `theta_` - It represents mean of feature per class. `(n_classes x n_features)`
In :
```gaussian_nb.class_prior_
```
Out:
`array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])`
In :
```gaussian_nb.epsilon_
```
Out:
`4.217280259413155e-08`
In :
```print("Gaussian Naive Bayes Sigma Shape : ", gaussian_nb.sigma_.shape)
```
```Gaussian Naive Bayes Sigma Shape :  (10, 64)
```
In :
```print("Gaussian Naive Bayes Theta Shape : ", gaussian_nb.theta_.shape)
```
```Gaussian Naive Bayes Theta Shape :  (10, 64)
```

## ComplementNB ¶

The first estimator that we'll be introducing is `ComplementNB` available with the `naive_bayes` module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of `ComplementNB` which can give helpful insight once the model is trained.

### Fitting Default Model To Train Data¶

In :
```from sklearn.naive_bayes import ComplementNB

complement_nb = ComplementNB()
complement_nb.fit(X_train, Y_train)
```
Out:
`ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)`

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it.

In :
```Y_preds = complement_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%complement_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%complement_nb.score(X_train, Y_train))
```
```[5 9 2 6 1 6 6 9 1 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.822
Training Accuracy : 0.823
```

### Plotting Confusion Matrix¶

In [ ]:
```plot_confusion_matrix(Y_test, complement_nb.predict(X_test))
```

### Important Attributes of ComplementNB¶

Below are list of important attributes available through estimator instance of ComplementNB.

• `class_log_prior_` - It represents log probability of each class.
• `feature_log_prob_` - It represents log probability of particular feature based on class. `(n_classes x n_features)`
In :
```complement_nb.class_log_prior_
```
Out:
```array([-2.31, -2.29, -2.31, -2.29, -2.29, -2.29, -2.29, -2.31, -2.34,
-2.3 ])```
In :
```print("Log Probability of Each Feature per class : ", complement_nb.feature_log_prob_.shape)
```
```Log Probability of Each Feature per class :  (10, 64)
```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

• alpha - It accepts float value representing the additive smoothing parameter. The value of `0.0` represents no smoothing. The default value of this parameter is `1.0`.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into `3-fold cross-validation`.

In :
```%%time

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
}

complement_nb_grid = GridSearchCV(ComplementNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
complement_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%complement_nb_grid.best_score_)
print('Best Parameters : ',complement_nb_grid.best_params_)
```
```Fitting 5 folds for each of 5 candidates, totalling 25 fits
Train Accuracy : 0.818
Test Accuracy : 0.839
Best Accuracy Through Grid Search : 0.795
Best Parameters :  {'alpha': 10.0}
CPU times: user 81 ms, sys: 12.3 ms, total: 93.3 ms
Wall time: 156 ms
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  25 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.1s finished
```

### Plotting Confusion Matrix¶

Below we are plotting confusion matrix again with best estimator that we found out using grid search.

In [ ]:
```plot_confusion_matrix(Y_test, complement_nb_grid.best_estimator_.predict(X_test))
```

## MultinomialNB ¶

The first estimator that we'll be introducing is `MultinomialNB` available with the `naive_bayes` module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of `MultinomialNB` which can give helpful insight once the model is trained.

### Fitting Default Model To Train Data¶

In :
```from sklearn.naive_bayes import MultinomialNB

multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train, Y_train)
```
Out:
`MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)`

### Evaluating Trained Model On Test Data.¶

Almost all models in Scikit-Learn API provides `predict()` method which can be used to predict target variable on Test Set passed to it.

In :
```Y_preds = multinomial_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%multinomial_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%multinomial_nb.score(X_train, Y_train))
```
```[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
[5 9 9 6 1 6 6 9 8 7 4 2 1 4 3]
Test Accuracy : 0.917
Training Accuracy : 0.900
```

### Plotting Confusion Matrix¶

In [ ]:
```plot_confusion_matrix(Y_test, multinomial_nb.predict(X_test))
```

### Important Attributes of MultinomialNB¶

Below are list of important attributes available through estimator instance of MultinomialNB.

• `class_log_prior_` - It represents log probability of each class.
• `feature_log_prob_` - It represents log probability of particular feature based on class. `(n_classes x n_features)`
In :
```multinomial_nb.class_log_prior_
```
Out:
```array([-2.31, -2.29, -2.31, -2.29, -2.29, -2.29, -2.29, -2.31, -2.34,
-2.3 ])```
In :
```print("Log Probability of Each Feature per class : ", multinomial_nb.feature_log_prob_.shape)
```
```Log Probability of Each Feature per class :  (10, 64)
```

### Finetuning Model By Doing Grid Search On Various Hyperparameters.¶

Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

• alpha - It accepts float value representing the additive smoothing parameter. The value of `0.0` represents no smoothing. The default value of this parameter is `1.0`.

We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into `3-fold cross-validation`.

In :
```%%time

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
}

multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)
```
```Fitting 5 folds for each of 5 candidates, totalling 25 fits
Train Accuracy : 0.903
Test Accuracy : 0.922
Best Accuracy Through Grid Search : 0.875
Best Parameters :  {'alpha': 10.0}
CPU times: user 68.4 ms, sys: 12.1 ms, total: 80.5 ms
Wall time: 160 ms
```
```[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  25 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.1s finished
```

### Plotting Confusion Matrix¶

Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.

In [ ]:
```plot_confusion_matrix(Y_test, multinomial_nb_grid.best_estimator_.predict(X_test))
```

This ends our small tutorial on introducing various naive Bayes implementation available with scikit-learn. Please feel free to let us know your views in the comments section.

Sunny Solanki