Before applying ML models to our datasets we generally perform various steps like Imputation(Filling Missing Values), Scaling, Feature Extraction, etc. We can perform this step as an individual or we can learn scikit-learn's make_pipeline
API which lets us combine all these steps as a pipeline which will be performed in that sequence on data.
We'll start by importing necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
np.set_printoptions(precision=2)
%matplotlib inline
We'll be downloading California housing data from the internet. sklearn's datasets
module provides fetch_california_housing
which will be used to download data.
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
X_calif, Y_calif = california.data, california.target
print('Dataset Sizes : ', X_calif.shape, Y_calif.shape)
We'll split a dataset into two parts:
Training data
which will be used for the training model.Test data
against which accuracy of the trained model will be checked.
train_test_split
function of model_selection
module of sklearn will help us split data into two sets with 80%
for training and 20%
for test purpose. We are also using seed(random_state=123)
with train_test_split so that we always get the same split and can reproduce results in the future as well.from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_calif, Y_calif, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Below we'll create a processing pipeline which will consist of 3 main steps in the ML processing pipeline as explained below.
For Imputation, we'll be using SimpleImputer
available in scikit-learn which replaces missing NA values with a mean of that column.
For Scaling, we'll be using RobustScaler
available in scikit-learn which is based on quantile ranges.
For ML Model, we'll be using SVR
which regression model based on the Support Vector Machine.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, make_union
processing_pipeline = make_pipeline(SimpleImputer(), RobustScaler(), SVR())
processing_pipeline
Pipeline
object also has the same API as that of ML Models available in scikit-learn. It has fit()
,predict()
and score()
which executes total preprocessing pipeline on given data to perform that function. We'll be checking the performance of the pipeline on train and test data both.
processing_pipeline.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%processing_pipeline.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%processing_pipeline.score(X_test, Y_test))
As discussed above Pipeline
has the same API like that of ML models, we can use it as a part of GridSearch to search for a list of a parameter to find the best fitting model performing cross-validation. We can also provide arguments for all different models of Pipeline starting with model name followed by parameter name separated by a double underscore.
Note: Please make a note that n_jobs=-1
parameter makes use of all available cores on computer.
params = {'simpleimputer__strategy':['mean','median'],
'robustscaler__quantile_range': [(25.,75.), (30.,70.), (40.,60.)],
'svr__C': [0.1,1.0,10.],
'svr__gamma': ['auto', 0.1,0.3]}
grid = GridSearchCV(processing_pipeline, param_grid=params, n_jobs=-1, cv=3, verbose=3)
grid.fit(X_train, Y_train);
Below we are evaluating the performance of trained pipeline and it seems to perform quite better than that of models/estimators with default values.
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to