Share @ LinkedIn Facebook  ml-pipeline, combining-estimators
Scikit-Learn - Creating ML Processing Pipeline

Pipelining Estimators

Before applying ML models to our datasets we generally perform various steps like Imputation(Filling Missing Values), Scaling, Feature Extraction, etc. We can perform this step as an individual or we can learn scikit-learn's make_pipeline API which lets us combine all these steps as a pipeline which will be performed in that sequence on data.

We'll start by importing necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

np.set_printoptions(precision=2)
%matplotlib inline

Loading Data

We'll be downloading California housing data from the internet. sklearn's datasets module provides fetch_california_housing which will be used to download data.

In [2]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()
X_calif, Y_calif = california.data, california.target
print('Dataset Sizes : ', X_calif.shape, Y_calif.shape)
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /home/sunny/scikit_learn_data
Dataset Sizes :  (20640, 8) (20640,)

Splitting Data Into Train/Test Sets

We'll split a dataset into two parts:

  • Training data which will be used for the training model.
  • Test data against which accuracy of the trained model will be checked. train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purpose. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.
In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_calif, Y_calif, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
Train/Test Sizes :  (16512, 8) (4128, 8) (16512,) (4128,)

Processing Pipeline Creation

Below we'll create a processing pipeline which will consist of 3 main steps in the ML processing pipeline as explained below.

  1. Imputation: It handles NA entries in the dataset.
  2. Scaling: It scales dataset so that it converges fast.
  3. ML Model: Actual ml model for regression task.

For Imputation, we'll be using SimpleImputer available in scikit-learn which replaces missing NA values with a mean of that column.

For Scaling, we'll be using RobustScaler available in scikit-learn which is based on quantile ranges.

For ML Model, we'll be using SVR which regression model based on the Support Vector Machine.

In [4]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, make_union

processing_pipeline = make_pipeline(SimpleImputer(), RobustScaler(), SVR())
processing_pipeline
Out[4]:
Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('robustscaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                     shrinking=True, tol=0.001, verbose=False))],
         verbose=False)

Fitting Processing Pipeline To Train Data & Evaluating On Test Data

Pipeline object also has the same API as that of ML Models available in scikit-learn. It has fit(),predict() and score() which executes total preprocessing pipeline on given data to perform that function. We'll be checking the performance of the pipeline on train and test data both.

In [5]:
processing_pipeline.fit(X_train, Y_train)

print('Train R^2 Score : %.3f'%processing_pipeline.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%processing_pipeline.score(X_test, Y_test))
Train R^2 Score : 0.766
Test R^2 Score : 0.759

Grid Search On Train Data (Hyperparameters Tunning)

As discussed above Pipeline has the same API like that of ML models, we can use it as a part of GridSearch to search for a list of a parameter to find the best fitting model performing cross-validation. We can also provide arguments for all different models of Pipeline starting with model name followed by parameter name separated by a double underscore.

Note: Please make a note that n_jobs=-1 parameter makes use of all available cores on computer.

In [6]:
params = {'simpleimputer__strategy':['mean','median'],
          'robustscaler__quantile_range': [(25.,75.), (30.,70.), (40.,60.)],
          'svr__C': [0.1,1.0,10.],
          'svr__gamma': ['auto', 0.1,0.3]}

grid = GridSearchCV(processing_pipeline, param_grid=params, n_jobs=-1, cv=3, verbose=3)
grid.fit(X_train, Y_train);
Fitting 3 folds for each of 54 candidates, totalling 162 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   56.5s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed:  7.9min finished

Evaluating Pipeline On Test Data

Below we are evaluating the performance of trained pipeline and it seems to perform quite better than that of models/estimators with default values.

In [7]:
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)
Train R^2 Score : 0.796
Test R^2 Score : 0.768
Best R^2 Score Through Grid Search : 0.755
Best Parameters :  {'robustscaler__quantile_range': (25.0, 75.0), 'simpleimputer__strategy': 'mean', 'svr__C': 10.0, 'svr__gamma': 'auto'}


Sunny Solanki  Sunny Solanki