Share @ LinkedIn Facebook  sklearn, data-preprocessing
Scikit-Learn - Data Preprocessing [Scaling, Imputation & One-Hot Encoding]

Scikit-Learn - Data Preprocessing [Scaling, Imputation & One-Hot Encoding]

Table of Contents

Introduction

Unsupervised learning like transformations, dimensionality reduction, manifold learning, etc. finds a new representation of data. Unlike supervised learning, unsupervised learning does not have a target variable to predict. Unsupervised learning techniques like scaling, imputation, one-hot encodings are generally referred to as data preprocessing. It prepares data ready to be fed into supervised/unsupervised machine learning algorithms.

We'll start with simple rescaling and then proceed to dimensionality reduction techniques like PCA, manifold learning, etc. Rescaling is generally referred to as a preprocessing step than learning.

Types of Data Preprocessing

Below is a list of common data preprocessing steps that are generally used to prepare data. We'll discuss all of them one by one in detail.

  • Scaling: It scales values of columns of data and brings them into the same range which helps machine learning algorithms to converge faster.
  • Imputation: It fills in NA values in a data.
  • One Hot Encoding: It transforms categorical columns of data into different columns where each column is binary column representing the presence/absence of one entry of the categorical column.

We'll start by importing all the necessary libraries.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import sys

import sklearn

print("Scikit-Learn Version : ",sklearn.__version__)

np.set_printoptions(precision=2)

%matplotlib inline
Scikit-Learn Version :  0.21.2

1. Scaling

The scaling is a commonly used technique that is used to bring all columns of data on a particular scale hence bringing all of them into a particular range. If different columns of data are in different range then it can prevent the machine-learning algorithm to converge fast to the global minimum. The scaling is generally used when different columns of your data have values in a range that vary a lot (0-1, 0-1000000, etc).

Scikit-Learn provides various scalers which we can use for our purpose.

  • sklearn.preprocessing.StandardScaler: It scales data by subtracting mean and dividing by standard deviation. It centralizes data with unit variance.
  • sklearn.preprocessing.MinMaxScaler: - Scales each feature in range given as input parameter feature_range with min and max value as tuple.

    • Formula:

      X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

      X_scaled = X_std * (max - min) + min

  • sklearn.preprocessing.RobustScaler: - Scales each feature using statics that are robust to the outlier. It scales feature removing median and then scaling according to quartile range (default Inter Quartile Range which is between 1st and 3rd quartiles).
  • sklearn.preprocessing.Normalizer: - Normalizes data according to l1 or l2 norm. l2 is default.
  • sklearn.preprocessing.MaxAbsScaler: - Scales each feature by it's maximum absolute value.

StandardScaler

Quite commonly used scaling technique is called "standardization". Here, we'll rescale data by subtracting mean from it and then dividing by standard deviation. This will make data centered around mean with unit variance (standard deviation=1).

In [2]:
data = np.arange(1,6)
scaled_data = (data - data.mean()) / data.std()
data, scaled_data
Out[2]:
(array([1, 2, 3, 4, 5]), array([-1.41, -0.71,  0.  ,  0.71,  1.41]))

Scikit-learn provides class StandardScaler which provides this functionality. It scales each feature in range (mean-standard_deviation, mean+standard_deviation).

We'll load iris data provided by scikit-learn and will split it into training and test sets. Once data is split then we'll fit standard scaler to train data and apply it later to both train & test data.

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, Y = iris.data, iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, random_state=2)
X.shape, Y.shape, X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
Out[3]:
((150, 4), (150,), (120, 4), (120,), (30, 4), (30,))
In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

Please make a note below that after scaling mean of train data is almost zero and the standard deviation is 1.

In [5]:
print('Mean of train data : ', X_train.mean(axis=0))
print('Standard Deviation of train data : ', X_train.std(axis=0))

print('\nMean of train data through standard scaler: ', scaler.mean_)
print('Standard Deviation of train data through standard scaler : ', scaler.scale_)

print('\nMean of scaled train data : ', X_train_scaled.mean(axis=0))
print('Standard Deviation of scaled train data : ', X_train_scaled.std(axis=0))
Mean of train data :  [5.9  3.06 3.86 1.24]
Standard Deviation of train data :  [0.81 0.44 1.74 0.75]

Mean of train data through standard scaler:  [5.9  3.06 3.86 1.24]
Standard Deviation of train data through standard scaler :  [0.81 0.44 1.74 0.75]

Mean of scaled train data :  [1.16e-15 1.87e-15 4.03e-16 6.00e-16]
Standard Deviation of scaled train data :  [1. 1. 1. 1.]

Make a note here that we are applying the same transformation which was fit on train data to test data hence it'll not have zero scaling as that of train data. Because we are applying mean and standard deviation calculated on train data to scale test data.

In [6]:
X_test_scaled = scaler.transform(X_test)

print("Mean of scaled test data: %s" % X_test_scaled.mean(axis=0))
Mean of scaled test data: [-0.33 -0.03 -0.3  -0.3 ]
NOTE

Please make a note that we are applying scaler trained on train data to test data than training again on test data. Developer might end up doing such mistake that they train on test data as well and then transform test data. Lets visualize below to clarify this notification further.

In [7]:
def plot_data(row,col,i, X_train,X_test,title):
    with plt.style.context(('seaborn', 'ggplot')):
        plt.subplot(row,col,i)
        plt.scatter(X_train[:, 0],X_train[:, 1], color='green', marker= 's', label='Train Data')
        plt.scatter(X_test[:, 0],X_test[:, 1], color='red', marker= 's', label='Test Data')
        plt.legend(loc = 'best')
        plt.title(title)

Please pay attention to the position of train/test data points relative to each other in original, truly scaled, and falsely scaled data visualization. You'll notice that relative position is the same in both Original and Truly Scaled data but has changed quite in falsely scaled data.

In [ ]:
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

X_test_scaled_true = scaler.transform(X_test)

scaler.fit(X_test)
X_test_scaled_false = scaler.transform(X_test)

plt.figure(figsize=(20,6))
plot_data(1,3,1,X_train,X_test, 'Original Data')
plot_data(1,3,2,X_train_scaled,X_test_scaled_false, 'Falsely Scaled Data')
plot_data(1,3,3,X_train_scaled,X_test_scaled_true, 'Truely Scaled Data')

plt.show()

Scikit-Learn - Data Preprocessing [Scaling, Imputation & One-Hot Encoding]

MinMaxScaler

The MinMaxScaler accepts single argument feature_range which accepts two value as tuple specifying range with minimum and maximum for values of each column.

In [9]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler, MaxAbsScaler, Normalizer

minmax_scaler = MinMaxScaler(feature_range=(0,1))
In [10]:
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

print('Max value seen per feature : ',minmax_scaler.data_max_)
print('\nMinimum value seen per feature : ', minmax_scaler.data_min_)
print('\nRange seen per feature (data_max_ - data_min_) : ', minmax_scaler.data_range_)
print('\nPer feature adjustment for minimum (min - X.min(axis=0) * self.scale_)', minmax_scaler.min_)
print('\nPer feature relative scaling of the data ((max - min) / (X.max(axis=0) - X.min(axis=0)))',minmax_scaler.scale_)
Max value seen per feature :  [7.9 4.4 6.9 2.5]

Minimum value seen per feature :  [4.3 2.  1.  0.1]

Range seen per feature (data_max_ - data_min_) :  [3.6 2.4 5.9 2.4]

Per feature adjustment for minimum (min - X.min(axis=0) * self.scale_) [-1.19 -0.83 -0.17 -0.04]

Per feature relative scaling of the data ((max - min) / (X.max(axis=0) - X.min(axis=0))) [0.28 0.42 0.17 0.42]

RobustScaler

The RobustScaler is based on the median and quartile range of data. It scales data based on the interquartile range (0.25-0.75). The reason this scaler performs better than others is that others are based on mean and standard deviation which can be easily influenced by outliers.

In [11]:
robust_scaler = RobustScaler()
In [12]:
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)

print('The median value for each feature : ',robust_scaler.center_)
print('The (scaled) interquartile range for each feature in the training set : ',robust_scaler.scale_)
The median value for each feature :  [5.8  3.   4.45 1.4 ]
The (scaled) interquartile range for each feature in the training set :  [1.2  0.6  3.5  1.43]

MaxAbsScaler

The MaxAbsScaler as its name suggests scales values based on the maximum absolute value of each feature.

In [13]:
maxabs_scaler = MaxAbsScaler()
In [14]:
X_train_maxabs = maxabs_scaler.fit_transform(X_train)
X_test_maxabs = maxabs_scaler.transform(X_test)

print('Per feature maximum absolute value : ',maxabs_scaler.max_abs_)
print('Per feature relative scaling of the data : ',maxabs_scaler.scale_)
Per feature maximum absolute value :  [7.9 4.4 6.9 2.5]
Per feature relative scaling of the data :  [7.9 4.4 6.9 2.5]

Normalizer

It normalizes individual samples of data. Unlike other scaler approaches that work on a column basis, this one works on a row basis. The scikit-learn provides a class named Normalizer which accepts two different types of normalization.

  • L1 Normalization: It is calculated as the sum of absolute values of vector array.
  • L2 Normalization: It is calculated as the square root of the sum of squares of values of vector array.
In [15]:
normalizer_l2 = Normalizer() ## Default is l2
normalizer_l1 = Normalizer(norm='l1')
In [16]:
X_train_normalized_l1 = normalizer_l1.fit_transform(X_train)
X_test_normalized_l1 = normalizer_l1.transform(X_test)

print(normalizer_l1.get_params())
{'copy': True, 'norm': 'l1'}
In [17]:
X_train_normalized_l2 = normalizer_l2.fit_transform(X_train)
X_test_normalized_l2 = normalizer_l2.transform(X_test)

print(normalizer_l2.get_params())
{'copy': True, 'norm': 'l2'}

Visualize Scaled Data

Below we'll visualize scaled data generated using all 6 scalers above. We'll use the first 2 features of the IRIS dataset for visualization purposes.


NOTE

Please look at the scale below on X and Y axis for all scalers. Also notice positions of points in original dataset and scaled data is not changed.

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(20,15))

    plot_data(3,3,1,X_train,X_test, 'Original Data')
    plot_data(3,3,2,X_train_scaled,X_test_scaled, 'Standard Scaler')
    plot_data(3,3,3,X_train_minmax,X_test_minmax, 'MinMax Scaler')
    plot_data(3,3,4,X_train_maxabs,X_test_maxabs, 'MaxAbs Scaler')
    plot_data(3,3,5,X_train_robust,X_test_robust, 'Robust Scaler')
    plot_data(3,3,6,X_train_normalized_l1,X_test_normalized_l1, 'L1 Normalizer')
    plot_data(3,3,7,X_train_normalized_l2,X_test_normalized_l2, 'L2 Normalizer')

    plt.show()

Scikit-Learn - Data Preprocessing [Scaling, Imputation & One-Hot Encoding]

2. Imputation

The real-world datasets generally have missing values which can be due to many reasons like data not collected, lost in the process, legal restrictions, etc. All machine learning algorithms need input data without any missing values. The scikit-learn ML algorithms fail if data is provided with missing values. Hence we either need to remove missing data samples or we need some way to fill in missing data. The missing data is generally encoded as no value, NANs, or by any other values in many of the datasets.

SimpleImputer

Scikit-Learn provides SimpleImputer class which provides various approaches to fill in missing values. We'll explain its usage below with examples. We'll be creating dummy data with NaNs for explanation purposes.

In [19]:
rng = np.random.RandomState(123)
data = rng.rand(10,10)
data[data>0.9] = np.nan
data
Out[19]:
array([[0.7 , 0.29, 0.23, 0.55, 0.72, 0.42,  nan, 0.68, 0.48, 0.39],
       [0.34, 0.73, 0.44, 0.06, 0.4 , 0.74, 0.18, 0.18, 0.53, 0.53],
       [0.63, 0.85, 0.72, 0.61, 0.72, 0.32, 0.36, 0.23, 0.29, 0.63],
       [0.09, 0.43, 0.43, 0.49, 0.43, 0.31, 0.43, 0.89,  nan, 0.5 ],
       [0.62, 0.12, 0.32, 0.41, 0.87, 0.25, 0.48,  nan, 0.52, 0.61],
       [0.12, 0.83, 0.6 , 0.55, 0.34, 0.3 , 0.42, 0.68, 0.88, 0.51],
       [0.67, 0.59, 0.62, 0.67, 0.84, 0.08, 0.76, 0.24, 0.19, 0.57],
       [0.1 , 0.89, 0.63, 0.72, 0.02, 0.59, 0.56, 0.16, 0.15, 0.7 ],
       [0.32, 0.69, 0.55, 0.39,  nan, 0.84, 0.36, 0.04, 0.3 , 0.4 ],
       [0.7 ,  nan, 0.36, 0.76, 0.59, 0.69, 0.15, 0.4 , 0.24, 0.34]])

The default version of SimpleImputer will replace all np.nan values with the average value of that column. We can directly call the fit_transform() method on an instance of SimpleImputer and it'll transform data replacing NANs.

In [20]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()
In [21]:
imputer.fit_transform(data)
Out[21]:
array([[0.7 , 0.29, 0.23, 0.55, 0.72, 0.42, 0.41, 0.68, 0.48, 0.39],
       [0.34, 0.73, 0.44, 0.06, 0.4 , 0.74, 0.18, 0.18, 0.53, 0.53],
       [0.63, 0.85, 0.72, 0.61, 0.72, 0.32, 0.36, 0.23, 0.29, 0.63],
       [0.09, 0.43, 0.43, 0.49, 0.43, 0.31, 0.43, 0.89, 0.4 , 0.5 ],
       [0.62, 0.12, 0.32, 0.41, 0.87, 0.25, 0.48, 0.39, 0.52, 0.61],
       [0.12, 0.83, 0.6 , 0.55, 0.34, 0.3 , 0.42, 0.68, 0.88, 0.51],
       [0.67, 0.59, 0.62, 0.67, 0.84, 0.08, 0.76, 0.24, 0.19, 0.57],
       [0.1 , 0.89, 0.63, 0.72, 0.02, 0.59, 0.56, 0.16, 0.15, 0.7 ],
       [0.32, 0.69, 0.55, 0.39, 0.55, 0.84, 0.36, 0.04, 0.3 , 0.4 ],
       [0.7 , 0.6 , 0.36, 0.76, 0.59, 0.69, 0.15, 0.4 , 0.24, 0.34]])

We can access the statistics_ parameter of a column which can hint us which column was filled with which values.

In [22]:
imputer.statistics_
Out[22]:
array([0.43, 0.6 , 0.49, 0.52, 0.55, 0.46, 0.41, 0.39, 0.4 , 0.52])

Important Parameters of SimpleImputer

Below we have given a list of important parameters of SimpleImputer which can be used to change the default behavior.

  • missing_values: It lets us specify a missing value. The default is nan but we can specify any other string, integer, None, etc.
  • strategy: It accepts string specifying which strategy to use to fill in missing values. Below is a list of possible values for this parameter.

    • mean: It takes the mean of the column and replaces missing values with a mean of that column.
    • median: It takes a median of the column and replaces missing values with a mean of that column.
    • most_frequent: It takes the most frequently occurring value and replaces missing values with that value.
    • constant: It takes value specified in parameter fill_value and replaces all missing value with that one value.
  • fill_value: It accepts an integer or string value which will be used to replace all missing values when the strategy is constant. If we do not specify this value with strategy constant then it'll take 0 for numerical column and missing_value for string column.

Below we have transformed data again but using median as a strategy.

In [23]:
imputer = SimpleImputer(strategy="median")
imputer.fit_transform(data)
Out[23]:
array([[0.7 , 0.29, 0.23, 0.55, 0.72, 0.42, 0.42, 0.68, 0.48, 0.39],
       [0.34, 0.73, 0.44, 0.06, 0.4 , 0.74, 0.18, 0.18, 0.53, 0.53],
       [0.63, 0.85, 0.72, 0.61, 0.72, 0.32, 0.36, 0.23, 0.29, 0.63],
       [0.09, 0.43, 0.43, 0.49, 0.43, 0.31, 0.43, 0.89, 0.3 , 0.5 ],
       [0.62, 0.12, 0.32, 0.41, 0.87, 0.25, 0.48, 0.24, 0.52, 0.61],
       [0.12, 0.83, 0.6 , 0.55, 0.34, 0.3 , 0.42, 0.68, 0.88, 0.51],
       [0.67, 0.59, 0.62, 0.67, 0.84, 0.08, 0.76, 0.24, 0.19, 0.57],
       [0.1 , 0.89, 0.63, 0.72, 0.02, 0.59, 0.56, 0.16, 0.15, 0.7 ],
       [0.32, 0.69, 0.55, 0.39, 0.59, 0.84, 0.36, 0.04, 0.3 , 0.4 ],
       [0.7 , 0.69, 0.36, 0.76, 0.59, 0.69, 0.15, 0.4 , 0.24, 0.34]])
In [24]:
imputer.statistics_
Out[24]:
array([0.48, 0.69, 0.5 , 0.55, 0.59, 0.37, 0.42, 0.24, 0.3 , 0.52])

Below we have created another artificial dataset where missing values are represented by -1. We'll try to replace this missing value with constant strategy and fill the value of 0.0.

In [25]:
data = rng.rand(10,10)
data[data>0.9] = -1
data
Out[25]:
array([[ 0.51,  0.67,  0.11,  0.13,  0.32,  0.66,  0.85,  0.55,  0.85,
         0.38],
       [ 0.32,  0.35,  0.17,  0.83,  0.34,  0.55,  0.58,  0.52,  0.  ,
        -1.  ],
       [-1.  ,  0.21,  0.29,  0.52, -1.  , -1.  ,  0.26,  0.56,  0.81,
         0.39],
       [ 0.73,  0.16,  0.6 ,  0.87, -1.  ,  0.08,  0.43,  0.2 ,  0.45,
         0.55],
       [ 0.09,  0.3 , -1.  ,  0.57,  0.46,  0.75,  0.74,  0.05,  0.71,
         0.84],
       [ 0.17,  0.78,  0.29,  0.31,  0.67,  0.11,  0.66,  0.89,  0.7 ,
         0.44],
       [ 0.44,  0.77,  0.57,  0.08,  0.58,  0.81,  0.34, -1.  ,  0.75,
         0.57],
       [ 0.75,  0.08,  0.86,  0.82, -1.  ,  0.13,  0.08,  0.14,  0.4 ,
         0.42],
       [ 0.56,  0.12,  0.2 ,  0.81,  0.47,  0.81,  0.01,  0.55, -1.  ,
         0.58],
       [ 0.21,  0.72,  0.38,  0.67,  0.03,  0.64,  0.03,  0.74,  0.47,
         0.12]])
In [26]:
imputer = SimpleImputer(missing_values=-1, strategy="constant", fill_value=0.0)
imputer.fit_transform(data)
Out[26]:
array([[0.51, 0.67, 0.11, 0.13, 0.32, 0.66, 0.85, 0.55, 0.85, 0.38],
       [0.32, 0.35, 0.17, 0.83, 0.34, 0.55, 0.58, 0.52, 0.  , 0.  ],
       [0.  , 0.21, 0.29, 0.52, 0.  , 0.  , 0.26, 0.56, 0.81, 0.39],
       [0.73, 0.16, 0.6 , 0.87, 0.  , 0.08, 0.43, 0.2 , 0.45, 0.55],
       [0.09, 0.3 , 0.  , 0.57, 0.46, 0.75, 0.74, 0.05, 0.71, 0.84],
       [0.17, 0.78, 0.29, 0.31, 0.67, 0.11, 0.66, 0.89, 0.7 , 0.44],
       [0.44, 0.77, 0.57, 0.08, 0.58, 0.81, 0.34, 0.  , 0.75, 0.57],
       [0.75, 0.08, 0.86, 0.82, 0.  , 0.13, 0.08, 0.14, 0.4 , 0.42],
       [0.56, 0.12, 0.2 , 0.81, 0.47, 0.81, 0.01, 0.55, 0.  , 0.58],
       [0.21, 0.72, 0.38, 0.67, 0.03, 0.64, 0.03, 0.74, 0.47, 0.12]])
In [27]:
imputer.statistics_
Out[27]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

3. One Hot Encoding

Many real-world structured datasets have categorical columns that have a list of values getting repeated over time. These categorical columns can have values as strings or integers. All machine learning algorithms only accept float values as input hence we need to convert these categorical columns to a particular representation.


NOTE

The main reason behind one hot encoding is that we get to know how each individual values of categorical column is contributing to prediction of target variable. If we just maps values of categorical column to plain integers then we won't be getting exact idea about how each individual value is impacting prediction.

OneHotEncoder

Scikit-Learn provides us with an estimator for this which will help us convert categorical columns of data to one-hot encoded columns. The final dataset has as many columns as there were different unique values in the categorical column. We have created a dummy dataset for explanation purposes.

In [28]:
from sklearn.preprocessing import OneHotEncoder
In [29]:
data  = np.array([ ["High", "Asia"],
          ["Low", "Africa"],
          ["High", "North America"],
          ["Medium", "South America"],
          ["High", "Europe"],
          ["Medium", "Atlantic"],
          ["Low", "Arctic"],
        ])

The transformed data returned by OneHotEncoder is scipy CSR sparsed array. We can convert it to python list of list or numpy array as well.

In [30]:
one_hot = OneHotEncoder()

transformed_data = one_hot.fit_transform(data)
transformed_data
Out[30]:
<7x10 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>
In [31]:
print("Transformed Data Shape : ", transformed_data.shape)
transformed_data.toarray()
Transformed Data Shape :  (7, 10)
Out[31]:
array([[1., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 0., 0.]])

We can access categories that algorithm found out for each column using the categories_ parameter of OneHotEncoder class.

In [32]:
one_hot.categories_
Out[32]:
[array(['High', 'Low', 'Medium'], dtype='<U13'),
 array(['Africa', 'Arctic', 'Asia', 'Atlantic', 'Europe', 'North America',
        'South America'], dtype='<U13')]

Important Parameters of OneHotEncoder

Below is a list of important parameters of OneHotEncoder which we can modify in order to change the default behavior of the transformer.

  • categories: We can provide a list of the list where the inside list will hold unique values per column. If we want to consider only a particular subset of values for categorical columns to consider then we can use this parameter.
  • handle_unknown: It accepts two values.

    • error: It raises an error when encoding fails. The encoding can fail when the column has more categorical values than specified in the categories parameter.
    • ignore: It prevents encoding from failing. When it finds value in a column that is not specified in the categories parameter then it sets that value of all other column encodings to 0.

    We have explained the usage of the above parameters in the below example. We are ignoring Arctic and Antarctica continents below hence all other column values will be set to 0 whenever they occur.

In [33]:
profiles = ["High", "Low", "Medium" ]
continents = ['Africa', 'Asia', 'Europe', 'North America', 'South America']

one_hot = OneHotEncoder(categories=[profiles, continents], handle_unknown="ignore")

transformed_data = one_hot.fit_transform(data)
In [34]:
print("Transformed Data Shape : ", transformed_data.shape)
transformed_data.toarray()
Transformed Data Shape :  (7, 8)
Out[34]:
array([[1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.]])
In [35]:
one_hot.categories_
Out[35]:
[array(['High', 'Low', 'Medium'], dtype='<U13'),
 array(['Africa', 'Asia', 'Europe', 'North America', 'South America'],
       dtype='<U13')]

This ends our small tutorial on data preprocessing using scikit-learn. Please feel free to let us know your views in the comments section.

References


Sunny Solanki  Sunny Solanki