Regression is a process where we try to predict a continuous target variable based on independent variables. Scikit-Learn offers various regression models for performing regression learning.
Let’s use below scikit-learn's various regression models for our purpose.
Scikit-Learn also provides few datasets in-built with a package that we can load directly into memory and use for our purpose. We'll be using one such dataset called the Boston Housing dataset for our purpose. We'll be predicting the house price of a dataset based on other attributes from the dataset.
Below we are starting with importing necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
import sys
print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
warnings.filterwarnings("ignore") ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)
## Beow magic function fits plot inside of current notebook.
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline
In the Linear Regression Model, we try to fit the line through data in a way that has a minimum distance from all points in the dataset. Once we have found out proper line which has a minimum distance from all points in data and further optimization is not possible then we use that line to do further prediction on unseen data in the future.
It's also known as Ordinary Least Squares
because optimization function tries to minimize the squared distance between the line and all points in Train/Test Set.
We'll load Boston housing data provided by scikit-learn. It returns Bunch object which is almost the same as the dictionary. We'll also print details about the dataset.
from sklearn.datasets import load_boston ## function for loading boston data.
boston = load_boston()
#print(type(boston)) ## It returns Bunch object which is similar to dictionary.
#print(boston.DESCR) ## DESCR attribute describes dataset.
print('Feature Names : ' + str(boston.feature_names))
print('Dataset shape : ' + str(boston.data.shape))
print('Target shape : ' + str(boston.target.shape))
We'll split the dataset into two parts:
Training data
which will be used for the training model. Test data
against which accuracy of the trained model will be checked.train_test_split
function of model_selection
module of sklearn will help us split data into two sets with 80%
for training and 20%
for test purposes. We are also using seed(random_state=123)
with train_test_split
so that we always get the same split and can reproduce results in the future as well.
from sklearn.model_selection import train_test_split # Function for splitting dataset into train/test set.
X = boston.data
Y = boston.target
## We can specify either one of train_size and test_size. Sklearn find out other by itself. I included both for explanation purpose.
## random_state is used to reproduce same data splits again. If we don't set random_state then it generates different splits everytime.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size = 0.2, random_state = 123)
print('Train & Test sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
We are initializing the LinearRegression model below which is the basic model used extensively for regression tasks.
from sklearn.linear_model import LinearRegression ## Linear Regression Implementation
linear_regressor = LinearRegression()
linear_regressor
We can train a model by passing train data and train labels. It returns objects of trained classifier as well after training.
linear_regressor.fit(X_train,Y_train)
Almost all models in Scikit-Learn API provides predict()
method which can be used to predict the target variables on Test Set passed to it.
We are comparing below housing prices predicted by our model with actual house prices of test data and train data.
y_test_pred = linear_regressor.predict(X_test)
print('First Few Actual Housing Prices(Test Data) : ' + str(Y_test[:5]))
print('First Few Predicted Housing Prices(Test Data) : ' + str(y_test_pred[:5]))
Scikit-Learn's LinearRegresson model has a score()
method which returns coefficient of determination $R^2$ based on the dataset and target variables passed to it. It returns a value between [0-1] with 1 being best. If it returns negative value means that the model performed quite bad.
Note: Do not confuse $R^2$ with MSE as both are quite different. One can calculate MSE by using mean_squared_error
provided by the metrics
module of sklearn.
Formula of $R^2:$
$R^2 = (1 - u/v)$
where
$u = MSE =((y_{true} - y_{pred})^2).sum()$
$v=((y_{true} - y_{true}.mean())^2).sum()$
print('R^2 Score on Test Data : %.3f'%linear_regressor.score(X_train, Y_train))
As we discussed above, linear regression tries to generate lines through data in a way that mean squared error between actual labels and target is least. It is also the reason why its referred to as Ordinary Least Squares by many ML Practitioners as it tries to minimize squared differences between predicted and actual labels. We can access coordinates of that line through coef_
and intercept_
attributes of regressor.
print('Weight Coefficients : '+ str(linear_regressor.coef_))
print('\nY-Axis Intercept : '+ str(linear_regressor.intercept_))
sorted_labels_acc_to_test_y = list(sorted(zip(Y_test, y_test_pred), key=lambda x: x[1]))
sorted_test_y, sorted_test_preds = zip(*sorted_labels_acc_to_test_y)
with plt.style.context(('ggplot', 'seaborn')):
plt.scatter(range(len(sorted_test_y)),sorted_test_y, s=75, alpha=0.7, label='Actual')
plt.scatter(range(len(sorted_test_preds)), sorted_test_preds, s=75, alpha=0.7, label='Prediction')
plt.ylabel('House Price')
plt.title('Actual vs Predicted House Prices of Test Data')
plt.legend(loc='best')
Ridge regression is another estimator where we introduce regularization(L2 regularization
) in the cost minimization function. The introduction of this regularization pushes all weights near zero but not making them exactly zero. It makes all the weight quite small.
from sklearn.linear_model import Ridge ## Linear Regression Implementation
ridge_regressor = Ridge()
ridge_regressor
ridge_regressor.fit(X_train,Y_train)
y_test_pred = ridge_regressor.predict(X_test)
print('First Few Actual Housing Prices(Test Data) : ' + str(Y_test[:5]))
print('First Few Predicted Housing Prices(Test Data) : ' + str(y_test_pred[:5]))
print('\nR^2 Score on Test Data : %.3f'%ridge_regressor.score(X_test, Y_test))
sorted_labels_acc_to_test_y = list(sorted(zip(Y_test, y_test_pred), key=lambda x: x[1]))
sorted_test_y, sorted_test_preds = zip(*sorted_labels_acc_to_test_y)
with plt.style.context(('ggplot', 'seaborn')):
plt.scatter(range(len(sorted_test_y)),sorted_test_y, s=75, alpha=0.7, label='Actual')
plt.scatter(range(len(sorted_test_preds)), sorted_test_preds, s=75, alpha=0.7, label='Prediction')
plt.ylabel('House Price')
plt.title('Actual vs Predicted House Prices of Test Data')
plt.legend(loc='best')
Below is a list of hyperparameters that we can tune to get the best estimator for our data.
intercept
in model or not ($y =mx + c$ - here c
is referring to intercept).default=True
default=1.0
default=auto
default=1000
It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid
parameter with a number of cross-validation folds provided as cv
parameter, evaluates model performance on all combinations and stores all results in cv_results_
attribute. It also stores model which performs best in all cross-validation folds in best_estimator_
attribute and best score in best_score_
attribute.
Note: n_jobs
parameter is provided by many estimators. It accepts a number of cores to use for parallelization. If the value of -1
is given then it uses all cores. We are also using %%time
which jupyter notebook cell magic command which prints time taken by that cell to complete running. Time will be different on different computers based on their configurations.
%%time
from sklearn.model_selection import GridSearchCV
params = {'alpha' : [500, 200, 100, 50,10, 1, 0.1, 0.01],
'fit_intercept': [True, False],
'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
grid = GridSearchCV(Ridge(random_state=1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
GridSearchCV object maintains all different parameters tried and results generated for each split of data in an attribute cv_results_
as a dictionary. Below we are loading that cross-validation results as pandas dataframe and printing first few entries.
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Lasso Regression is another estimator where we introduce an L1 type of regularization in cost minimization function. L1 type regularization makes few coefficients zero whichever does not have much influence on target variable prediction.
from sklearn.linear_model import Lasso
lasso_regressor = Lasso()
lasso_regressor
lasso_regressor.fit(X_train,Y_train)
y_test_pred = lasso_regressor.predict(X_test)
print('First Few Actual Housing Prices(Test Data) : ' + str(Y_test[:5]))
print('First Few Predicted Housing Prices(Test Data) : ' + str(y_test_pred[:5]))
print('\nR^2 Score on Test Data : %.3f'%lasso_regressor.score(X_test, Y_test))
sorted_labels_acc_to_test_y = list(sorted(zip(Y_test, y_test_pred), key=lambda x: x[1]))
sorted_test_y, sorted_test_preds = zip(*sorted_labels_acc_to_test_y)
with plt.style.context(('ggplot', 'seaborn')):
plt.scatter(range(len(sorted_test_y)),sorted_test_y, s=75, alpha=0.7, label='Actual')
plt.scatter(range(len(sorted_test_preds)), sorted_test_preds, s=75, alpha=0.7, label='Prediction')
plt.ylabel('House Price')
plt.title('Actual vs Predicted House Prices of Test Data')
plt.legend(loc='best')
Lasso has exactly the same hyperparameters to tune as that of ridge regression except that it does not have different solver available like ridge.
%%time
from sklearn.model_selection import GridSearchCV
params = {'alpha' : [500, 200, 100, 50,10, 1, 0.1, 0.01],
'fit_intercept': [True, False],
}
grid = GridSearchCV(Lasso(random_state=1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
ElasticNet is another estimator that uses both L1 and L2 penalty. It's useful in cases where few features are related to one another.
from sklearn.linear_model import ElasticNet
elasticnet_regressor = ElasticNet()
elasticnet_regressor
elasticnet_regressor.fit(X_train,Y_train)
y_test_pred = elasticnet_regressor.predict(X_test)
print('First Few Actual Housing Prices(Test Data) : ' + str(Y_test[:5]))
print('First Few Predicted Housing Prices(Test Data) : ' + str(y_test_pred[:5]))
print('\nR^2 Score on Test Data : %.3f'%elasticnet_regressor.score(X_test, Y_test))
sorted_labels_acc_to_test_y = list(sorted(zip(Y_test, y_test_pred), key=lambda x: x[1]))
sorted_test_y, sorted_test_preds = zip(*sorted_labels_acc_to_test_y)
with plt.style.context(('ggplot', 'seaborn')):
plt.scatter(range(len(sorted_test_y)),sorted_test_y, s=75, alpha=0.7, label='Actual')
plt.scatter(range(len(sorted_test_preds)), sorted_test_preds, s=75, alpha=0.7, label='Prediction')
plt.ylabel('House Price')
plt.title('Actual vs Predicted House Prices of Test Data')
plt.legend(loc='best')
ElasticNet has all parameters the same as that of Ridge and Lasso with one extra parameter which maintains the proportion of L1 and L2 penalty to be used in the regression model.
default=0.5
%%time
from sklearn.model_selection import GridSearchCV
params = {'alpha' : [500, 200, 100, 50,10, 1, 0.1, 0.01],
'fit_intercept': [True, False],
'l1_ratio': [0,0.3, 0.5, 0.7, 1.0]
}
grid = GridSearchCV(ElasticNet(random_state=1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.
Please make a note of $R^2$ calculated by each model on the train and test data. Ridge performs better than Linear Regression. Lasso Performs better than Ridge and Linear Regression. Elastic Net seems to perform almost the same as Lasso or little better than it. One can try these models on various datasets to check the performance of each one.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to