Unsupervised learning like transformations, dimensionality reduction, manifold learning, etc. finds a new representation of data. Unlike supervised learning, unsupervised learning does not have a target variable to predict. Unsupervised learning techniques like scaling, imputation, one-hot encodings are generally referred to as data preprocessing. It prepares data ready to be fed into supervised/unsupervised machine learning algorithms.
We'll start with simple rescaling and then proceed to dimensionality reduction techniques like PCA, manifold learning, etc. Rescaling is generally referred to as a preprocessing step than learning.
Below is a list of common data preprocessing steps that are generally used to prepare data. We'll discuss all of them one by one in detail.
We'll start by importing all the necessary libraries.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import sys
import sklearn
print("Scikit-Learn Version : ",sklearn.__version__)
np.set_printoptions(precision=2)
%matplotlib inline
The scaling is a commonly used technique that is used to bring all columns of data on a particular scale hence bringing all of them into a particular range. If different columns of data are in different range then it can prevent the machine-learning algorithm to converge fast to the global minimum. The scaling is generally used when different columns of your data have values in a range that vary a lot (0-1, 0-1000000, etc).
Scikit-Learn provides various scalers which we can use for our purpose.
sklearn.preprocessing.MinMaxScaler: - Scales each feature in range given as input parameter feature_range
with min and max value as tuple.
Formula:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
l1
or l2
norm. l2
is default.Quite commonly used scaling technique is called "standardization"
. Here, we'll rescale data by subtracting mean from it and then dividing by standard deviation. This will make data centered around mean with unit variance (standard deviation=1).
data = np.arange(1,6)
scaled_data = (data - data.mean()) / data.std()
data, scaled_data
Scikit-learn provides class StandardScaler
which provides this functionality. It scales each feature in range (mean-standard_deviation, mean+standard_deviation)
.
We'll load iris data provided by scikit-learn and will split it into training and test sets. Once data is split then we'll fit standard scaler to train data and apply it later to both train & test data.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, Y = iris.data, iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, random_state=2)
X.shape, Y.shape, X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
Please make a note below that after scaling mean of train data is almost zero and the standard deviation is 1.
print('Mean of train data : ', X_train.mean(axis=0))
print('Standard Deviation of train data : ', X_train.std(axis=0))
print('\nMean of train data through standard scaler: ', scaler.mean_)
print('Standard Deviation of train data through standard scaler : ', scaler.scale_)
print('\nMean of scaled train data : ', X_train_scaled.mean(axis=0))
print('Standard Deviation of scaled train data : ', X_train_scaled.std(axis=0))
Make a note here that we are applying the same transformation which was fit on train data to test data hence it'll not have zero scaling as that of train data. Because we are applying mean and standard deviation calculated on train data to scale test data.
X_test_scaled = scaler.transform(X_test)
print("Mean of scaled test data: %s" % X_test_scaled.mean(axis=0))
Please make a note that we are applying scaler trained on train data to test data than training again on test data. Developer might end up doing such mistake that they train on test data as well and then transform test data. Lets visualize below to clarify this notification further.
def plot_data(row,col,i, X_train,X_test,title):
with plt.style.context(('seaborn', 'ggplot')):
plt.subplot(row,col,i)
plt.scatter(X_train[:, 0],X_train[:, 1], color='green', marker= 's', label='Train Data')
plt.scatter(X_test[:, 0],X_test[:, 1], color='red', marker= 's', label='Test Data')
plt.legend(loc = 'best')
plt.title(title)
Please pay attention to the position of train/test data points relative to each other in original, truly scaled, and falsely scaled data visualization. You'll notice that relative position is the same in both Original and Truly Scaled data but has changed quite in falsely scaled data.
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled_true = scaler.transform(X_test)
scaler.fit(X_test)
X_test_scaled_false = scaler.transform(X_test)
plt.figure(figsize=(20,6))
plot_data(1,3,1,X_train,X_test, 'Original Data')
plot_data(1,3,2,X_train_scaled,X_test_scaled_false, 'Falsely Scaled Data')
plot_data(1,3,3,X_train_scaled,X_test_scaled_true, 'Truely Scaled Data')
plt.show()
The MinMaxScaler
accepts single argument feature_range
which accepts two value as tuple specifying range with minimum and maximum for values of each column.
from sklearn.preprocessing import MinMaxScaler, RobustScaler, MaxAbsScaler, Normalizer
minmax_scaler = MinMaxScaler(feature_range=(0,1))
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)
print('Max value seen per feature : ',minmax_scaler.data_max_)
print('\nMinimum value seen per feature : ', minmax_scaler.data_min_)
print('\nRange seen per feature (data_max_ - data_min_) : ', minmax_scaler.data_range_)
print('\nPer feature adjustment for minimum (min - X.min(axis=0) * self.scale_)', minmax_scaler.min_)
print('\nPer feature relative scaling of the data ((max - min) / (X.max(axis=0) - X.min(axis=0)))',minmax_scaler.scale_)
The RobustScaler
is based on the median and quartile range of data. It scales data based on the interquartile range (0.25-0.75). The reason this scaler performs better than others is that others are based on mean and standard deviation which can be easily influenced by outliers.
robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)
print('The median value for each feature : ',robust_scaler.center_)
print('The (scaled) interquartile range for each feature in the training set : ',robust_scaler.scale_)
The MaxAbsScaler
as its name suggests scales values based on the maximum absolute value of each feature.
maxabs_scaler = MaxAbsScaler()
X_train_maxabs = maxabs_scaler.fit_transform(X_train)
X_test_maxabs = maxabs_scaler.transform(X_test)
print('Per feature maximum absolute value : ',maxabs_scaler.max_abs_)
print('Per feature relative scaling of the data : ',maxabs_scaler.scale_)
It normalizes individual samples of data. Unlike other scaler approaches that work on a column basis, this one works on a row basis. The scikit-learn provides a class named Normalizer
which accepts two different types of normalization.
normalizer_l2 = Normalizer() ## Default is l2
normalizer_l1 = Normalizer(norm='l1')
X_train_normalized_l1 = normalizer_l1.fit_transform(X_train)
X_test_normalized_l1 = normalizer_l1.transform(X_test)
print(normalizer_l1.get_params())
X_train_normalized_l2 = normalizer_l2.fit_transform(X_train)
X_test_normalized_l2 = normalizer_l2.transform(X_test)
print(normalizer_l2.get_params())
Below we'll visualize scaled data generated using all 6 scalers above. We'll use the first 2 features of the IRIS dataset for visualization purposes.
Please look at the scale below on X and Y axis for all scalers. Also notice positions of points in original dataset and scaled data is not changed.
with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(20,15))
plot_data(3,3,1,X_train,X_test, 'Original Data')
plot_data(3,3,2,X_train_scaled,X_test_scaled, 'Standard Scaler')
plot_data(3,3,3,X_train_minmax,X_test_minmax, 'MinMax Scaler')
plot_data(3,3,4,X_train_maxabs,X_test_maxabs, 'MaxAbs Scaler')
plot_data(3,3,5,X_train_robust,X_test_robust, 'Robust Scaler')
plot_data(3,3,6,X_train_normalized_l1,X_test_normalized_l1, 'L1 Normalizer')
plot_data(3,3,7,X_train_normalized_l2,X_test_normalized_l2, 'L2 Normalizer')
plt.show()
The real-world datasets generally have missing values which can be due to many reasons like data not collected, lost in the process, legal restrictions, etc. All machine learning algorithms need input data without any missing values. The scikit-learn ML algorithms fail if data is provided with missing values. Hence we either need to remove missing data samples or we need some way to fill in missing data. The missing data is generally encoded as no value, NANs, or by any other values in many of the datasets.
Scikit-Learn provides SimpleImputer
class which provides various approaches to fill in missing values. We'll explain its usage below with examples. We'll be creating dummy data with NaNs for explanation purposes.
rng = np.random.RandomState(123)
data = rng.rand(10,10)
data[data>0.9] = np.nan
data
The default version of SimpleImputer
will replace all np.nan
values with the average value of that column. We can directly call the fit_transform()
method on an instance of SimpleImputer
and it'll transform data replacing NANs.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
imputer.fit_transform(data)
We can access the statistics_
parameter of a column which can hint us which column was filled with which values.
imputer.statistics_
SimpleImputer
¶Below we have given a list of important parameters of SimpleImputer
which can be used to change the default behavior.
nan
but we can specify any other string, integer, None, etc.strategy: It accepts string specifying which strategy to use to fill in missing values. Below is a list of possible values for this parameter.
fill_value
and replaces all missing value with that one value.fill_value: It accepts an integer or string value which will be used to replace all missing values when the strategy is constant
. If we do not specify this value with strategy constant then it'll take 0 for numerical column and missing_value
for string column.
Below we have transformed data again but using median
as a strategy.
imputer = SimpleImputer(strategy="median")
imputer.fit_transform(data)
imputer.statistics_
Below we have created another artificial dataset where missing values are represented by -1
. We'll try to replace this missing value with constant strategy
and fill the value of 0.0
.
data = rng.rand(10,10)
data[data>0.9] = -1
data
imputer = SimpleImputer(missing_values=-1, strategy="constant", fill_value=0.0)
imputer.fit_transform(data)
imputer.statistics_
Many real-world structured datasets have categorical columns that have a list of values getting repeated over time. These categorical columns can have values as strings or integers. All machine learning algorithms only accept float values as input hence we need to convert these categorical columns to a particular representation.
The main reason behind one hot encoding is that we get to know how each individual values of categorical column is contributing to prediction of target variable. If we just maps values of categorical column to plain integers then we won't be getting exact idea about how each individual value is impacting prediction.
Scikit-Learn provides us with an estimator for this which will help us convert categorical columns of data to one-hot encoded columns. The final dataset has as many columns as there were different unique values in the categorical column. We have created a dummy dataset for explanation purposes.
from sklearn.preprocessing import OneHotEncoder
data = np.array([ ["High", "Asia"],
["Low", "Africa"],
["High", "North America"],
["Medium", "South America"],
["High", "Europe"],
["Medium", "Atlantic"],
["Low", "Arctic"],
])
The transformed data returned by OneHotEncoder
is scipy CSR sparsed array. We can convert it to python list of list or numpy array as well.
one_hot = OneHotEncoder()
transformed_data = one_hot.fit_transform(data)
transformed_data
print("Transformed Data Shape : ", transformed_data.shape)
transformed_data.toarray()
We can access categories that algorithm found out for each column using the categories_
parameter of OneHotEncoder
class.
one_hot.categories_
OneHotEncoder
¶Below is a list of important parameters of OneHotEncoder
which we can modify in order to change the default behavior of the transformer.
handle_unknown: It accepts two values.
error
: It raises an error when encoding fails. The encoding can fail when the column has more categorical values than specified in the categories
parameter.ignore
: It prevents encoding from failing. When it finds value in a column that is not specified in the categories
parameter then it sets that value of all other column encodings to 0
.We have explained the usage of the above parameters in the below example. We are ignoring Arctic
and Antarctica
continents below hence all other column values will be set to 0
whenever they occur.
profiles = ["High", "Low", "Medium" ]
continents = ['Africa', 'Asia', 'Europe', 'North America', 'South America']
one_hot = OneHotEncoder(categories=[profiles, continents], handle_unknown="ignore")
transformed_data = one_hot.fit_transform(data)
print("Transformed Data Shape : ", transformed_data.shape)
transformed_data.toarray()
one_hot.categories_
This ends our small tutorial on data preprocessing using scikit-learn. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to