Naive Bayes estimators are probabilistic estimators based on the Bayes theorem
with assumptions that there is strong independence between features. The Bayes Theorem helps us find out the probability of occurring events based on some prior knowledge of conditions that can be related to the event. The naive Bayes classifiers have worked quite well for document classification and spam filtering applications. It requires a small amount of training data to set up with probabilities for Bayes theorem and therefore works quite fast.
Scikit-Learn provides a list of 4 Naive Bayes estimators where each differs from other based on probability of particular feature appearing if particular class appears:
We'll be explaining the usage of each one of the naive Bayes variants with examples.
We'll start by importing the necessary libraries for our tutorial.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
np.set_printoptions(precision=2)
%matplotlib inline
We'll be using digits dataset for our explanation purpose. It has data about every 0-9 digits as an 8x8
pixel image. Each sample image is kept as a vector of size 64
.
from sklearn.datasets import load_boston, load_digits
digits = load_digits()
X_digits, Y_digits = digits.data, digits.target
print('Dataset Size : ', X_digits.shape, Y_digits.shape)
We'll split the dataset into two parts:
Training data
which will be used for the training model.Test data
against which accuracy of the trained model will be checked.train_test_split
function of model_selection
module of sklearn will help us split data into two sets with 80%
for training and 20%
for test purposes. We are also using seed(random_state=123)
with train_test_split so that we always get the same split and can reproduce results in the future as well.
Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_digits, Y_digits, train_size=0.80, test_size=0.20, stratify=Y_digits, random_state=123)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
The first estimator that we'll be introducing is BernoulliNB
available with the naive_bayes
module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of BernoulliNB
which can give helpful insight once the model is trained.
We'll be fitting model to train data by using fit()
method of estimator passing it train features and train labels. We are fitting a default model to train data without setting any parameter explicitly.
from sklearn.naive_bayes import BernoulliNB
bernoulli_nb = BernoulliNB()
bernoulli_nb.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = bernoulli_nb.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%bernoulli_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%bernoulli_nb.score(X_train, Y_train))
We'll be plotting the confusion matrix to better understand the performance of our model. We have designed the method plot_confusion_matrix()
which accepts original labels and predicted labels of data. It then plots a confusion matrix. We'll be reusing this method in the future as well when training other estimators.
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(Y_test, Y_preds):
conf_mat = confusion_matrix(Y_test, Y_preds)
#print(conf_mat)
fig = plt.figure(figsize=(6,6))
plt.matshow(conf_mat, cmap=plt.cm.Blues, fignum=1)
plt.yticks(range(10), range(10))
plt.xticks(range(10), range(10))
plt.colorbar();
for i in range(10):
for j in range(10):
plt.text(i-0.2,j+0.1, str(conf_mat[j, i]), color='tab:red')
plot_confusion_matrix(Y_test, bernoulli_nb.predict(X_test))
Below are list of important attributes available through estimator instance of BernoulliNB.
class_log_prior_
- It represents log probability of each class. feature_log_prob_
- It represents log probability of particular feature based on class. (n_classes x n_features)
bernoulli_nb.class_log_prior_
print("Log Probability of Each Feature per class : ", bernoulli_nb.feature_log_prob_.shape)
Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.
0.0
represents no smoothing. The default value of this parameter is 1.0
.It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid
parameter with a number of cross-validation folds provided as cv
parameter, evaluates model performance on all combinations and stores all results in cv_results_
attribute. It also stores model which performs best in all cross-validation folds in best_estimator_
attribute and best score in best_score_
attribute.
n_jobs parameter is provided by many estimators. It accepts number of cores to use for parallelization. If value of -1 is given then it uses all cores. It uses joblib parallel processing library for running things in parallel in background.
We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation
.
%%time
from sklearn.model_selection import GridSearchCV
params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0],
}
bernoulli_nb_grid = GridSearchCV(BernoulliNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
bernoulli_nb_grid.fit(X_digits,Y_digits)
print('Train Accuracy : %.3f'%bernoulli_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%bernoulli_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%bernoulli_nb_grid.best_score_)
print('Best Parameters : ',bernoulli_nb_grid.best_params_)
Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.
plot_confusion_matrix(Y_test, bernoulli_nb_grid.best_estimator_.predict(X_test))
The first estimator that we'll be introducing is GaussianNB
available with the naive_bayes
module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of GaussianNB
which can give helpful insight once the model is trained.
from sklearn.naive_bayes import GaussianNB
gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target varible on Test Set passed to it.
Y_preds = gaussian_nb.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%gaussian_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%gaussian_nb.score(X_train, Y_train))
plot_confusion_matrix(Y_test, gaussian_nb.predict(X_test))
Below are list of important attributes available through estimator instance of GaussianNB.
class_log_prior_
- It represents log probability of each class. epsilon_
- It represents absolute additive value to variances.sigma_
- It represents variance of each feature per class. (n_classes x n_features)
theta_
- It represents mean of feature per class. (n_classes x n_features)
gaussian_nb.class_prior_
gaussian_nb.epsilon_
print("Gaussian Naive Bayes Sigma Shape : ", gaussian_nb.sigma_.shape)
print("Gaussian Naive Bayes Theta Shape : ", gaussian_nb.theta_.shape)
The first estimator that we'll be introducing is ComplementNB
available with the naive_bayes
module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of ComplementNB
which can give helpful insight once the model is trained.
from sklearn.naive_bayes import ComplementNB
complement_nb = ComplementNB()
complement_nb.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = complement_nb.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%complement_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%complement_nb.score(X_train, Y_train))
plot_confusion_matrix(Y_test, complement_nb.predict(X_test))
Below are list of important attributes available through estimator instance of ComplementNB.
class_log_prior_
- It represents log probability of each class. feature_log_prob_
- It represents log probability of particular feature based on class. (n_classes x n_features)
complement_nb.class_log_prior_
print("Log Probability of Each Feature per class : ", complement_nb.feature_log_prob_.shape)
Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.
0.0
represents no smoothing. The default value of this parameter is 1.0
.We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation
.
%%time
params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
}
complement_nb_grid = GridSearchCV(ComplementNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
complement_nb_grid.fit(X_digits,Y_digits)
print('Train Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%complement_nb_grid.best_score_)
print('Best Parameters : ',complement_nb_grid.best_params_)
Below we are plotting confusion matrix again with best estimator that we found out using grid search.
plot_confusion_matrix(Y_test, complement_nb_grid.best_estimator_.predict(X_test))
The first estimator that we'll be introducing is MultinomialNB
available with the naive_bayes
module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of MultinomialNB
which can give helpful insight once the model is trained.
from sklearn.naive_bayes import MultinomialNB
multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train, Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict target variable on Test Set passed to it.
Y_preds = multinomial_nb.predict(X_test)
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : %.3f'%multinomial_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%multinomial_nb.score(X_train, Y_train))
plot_confusion_matrix(Y_test, multinomial_nb.predict(X_test))
Below are list of important attributes available through estimator instance of MultinomialNB.
class_log_prior_
- It represents log probability of each class. feature_log_prob_
- It represents log probability of particular feature based on class. (n_classes x n_features)
multinomial_nb.class_log_prior_
print("Log Probability of Each Feature per class : ", multinomial_nb.feature_log_prob_.shape)
Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.
0.0
represents no smoothing. The default value of this parameter is 1.0
.We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation
.
%%time
params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
}
multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(X_digits,Y_digits)
print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)
Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.
plot_confusion_matrix(Y_test, multinomial_nb_grid.best_estimator_.predict(X_test))
This ends our small tutorial on introducing various naive Bayes implementation available with scikit-learn. Please feel free to let us know your views in the comments section.
Thank You for visiting our website. If you like our work, please support us so that we can keep on creating new tutorials/blogs on interesting topics (like AI, ML, Data Science, Python, Digital Marketing, SEO, etc.) that can help people learn new things faster. You can support us by clicking on the Coffee button at the bottom right corner. We would appreciate even if you can give a thumbs-up to our article in the comments section below.
If you want to