Unsupervised learning like transformations, dimensionality reduction, manifold learning, etc. finds a new representation of data. Unlike supervised learning, unsupervised learning does not have a target variable to predict. Unsupervised learning techniques like scaling, imputation, one-hot encodings are generally referred to as data preprocessing. It prepares data ready to be fed into supervised/unsupervised machine learning algorithms.

We'll start with simple rescaling and then proceed to dimensionality reduction techniques like PCA, manifold learning, etc. Rescaling is generally referred to as a preprocessing step than learning.

Below is a list of common data preprocessing steps that are generally used to prepare data. We'll discuss all of them one by one in detail.

**Scaling:**It scales values of columns of data and brings them into the same range which helps machine learning algorithms to converge faster.**Imputation:**It fills in NA values in a data.**One Hot Encoding:**It transforms categorical columns of data into different columns where each column is binary column representing the presence/absence of one entry of the categorical column.

We'll start by importing all the necessary libraries.

```
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import sys
import sklearn
print("Scikit-Learn Version : ",sklearn.__version__)
np.set_printoptions(precision=2)
%matplotlib inline
```

The scaling is a commonly used technique that is used to bring all columns of data on a particular scale hence bringing all of them into a particular range. If different columns of data are in different range then it can prevent the machine-learning algorithm to converge fast to the global minimum. The scaling is generally used when different columns of your data have values in a range that vary a lot (0-1, 0-1000000, etc).

Scikit-Learn provides various scalers which we can use for our purpose.

**sklearn.preprocessing.StandardScaler:**It scales data by subtracting mean and dividing by standard deviation. It centralizes data with unit variance.**sklearn.preprocessing.MinMaxScaler:**- Scales each feature in range given as input parameter`feature_range`

with min and max value as tuple.**Formula:**`X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))`

`X_scaled = X_std * (max - min) + min`

**sklearn.preprocessing.RobustScaler:**- Scales each feature using statics that are robust to the outlier. It scales feature removing median and then scaling according to quartile range (default Inter Quartile Range which is between 1st and 3rd quartiles).**sklearn.preprocessing.Normalizer:**- Normalizes data according to`l1`

or`l2`

norm.`l2`

is default.**sklearn.preprocessing.MaxAbsScaler:**- Scales each feature by it's maximum absolute value.

Quite commonly used scaling technique is called `"standardization"`

. Here, we'll rescale data by subtracting mean from it and then dividing by standard deviation. This will make data centered around mean with unit variance (standard deviation=1).

```
data = np.arange(1,6)
scaled_data = (data - data.mean()) / data.std()
data, scaled_data
```

Scikit-learn provides class `StandardScaler`

which provides this functionality. It scales each feature in range `(mean-standard_deviation, mean+standard_deviation)`

.

We'll load iris data provided by scikit-learn and will split it into training and test sets. Once data is split then we'll fit standard scaler to train data and apply it later to both train & test data.

```
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, Y = iris.data, iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, random_state=2)
X.shape, Y.shape, X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
```

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
```

Please make a note below that after scaling mean of train data is almost zero and the standard deviation is 1.

```
print('Mean of train data : ', X_train.mean(axis=0))
print('Standard Deviation of train data : ', X_train.std(axis=0))
print('\nMean of train data through standard scaler: ', scaler.mean_)
print('Standard Deviation of train data through standard scaler : ', scaler.scale_)
print('\nMean of scaled train data : ', X_train_scaled.mean(axis=0))
print('Standard Deviation of scaled train data : ', X_train_scaled.std(axis=0))
```

Make a note here that we are applying the same transformation which was fit on train data to test data hence it'll not have zero scaling as that of train data. Because we are applying mean and standard deviation calculated on train data to scale test data.

```
X_test_scaled = scaler.transform(X_test)
print("Mean of scaled test data: %s" % X_test_scaled.mean(axis=0))
```

NOTE

Please make a note that we are applying scaler trained on train data to test data than training again on test data. Developer might end up doing such mistake that they train on test data as well and then transform test data. Lets visualize below to clarify this notification further.

```
def plot_data(row,col,i, X_train,X_test,title):
with plt.style.context(('seaborn', 'ggplot')):
plt.subplot(row,col,i)
plt.scatter(X_train[:, 0],X_train[:, 1], color='green', marker= 's', label='Train Data')
plt.scatter(X_test[:, 0],X_test[:, 1], color='red', marker= 's', label='Test Data')
plt.legend(loc = 'best')
plt.title(title)
```

Please pay attention to the position of train/test data points relative to each other in original, truly scaled, and falsely scaled data visualization. You'll notice that relative position is the same in both Original and Truly Scaled data but has changed quite in falsely scaled data.

```
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled_true = scaler.transform(X_test)
scaler.fit(X_test)
X_test_scaled_false = scaler.transform(X_test)
plt.figure(figsize=(20,6))
plot_data(1,3,1,X_train,X_test, 'Original Data')
plot_data(1,3,2,X_train_scaled,X_test_scaled_false, 'Falsely Scaled Data')
plot_data(1,3,3,X_train_scaled,X_test_scaled_true, 'Truely Scaled Data')
plt.show()
```

The `MinMaxScaler`

accepts single argument `feature_range`

which accepts two value as tuple specifying range with minimum and maximum for values of each column.

```
from sklearn.preprocessing import MinMaxScaler, RobustScaler, MaxAbsScaler, Normalizer
minmax_scaler = MinMaxScaler(feature_range=(0,1))
```

```
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)
print('Max value seen per feature : ',minmax_scaler.data_max_)
print('\nMinimum value seen per feature : ', minmax_scaler.data_min_)
print('\nRange seen per feature (data_max_ - data_min_) : ', minmax_scaler.data_range_)
print('\nPer feature adjustment for minimum (min - X.min(axis=0) * self.scale_)', minmax_scaler.min_)
print('\nPer feature relative scaling of the data ((max - min) / (X.max(axis=0) - X.min(axis=0)))',minmax_scaler.scale_)
```

The `RobustScaler`

is based on the median and quartile range of data. It scales data based on the interquartile range (0.25-0.75). The reason this scaler performs better than others is that others are based on mean and standard deviation which can be easily influenced by outliers.

```
robust_scaler = RobustScaler()
```

```
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)
print('The median value for each feature : ',robust_scaler.center_)
print('The (scaled) interquartile range for each feature in the training set : ',robust_scaler.scale_)
```

The `MaxAbsScaler`

as its name suggests scales values based on the maximum absolute value of each feature.

```
maxabs_scaler = MaxAbsScaler()
```

```
X_train_maxabs = maxabs_scaler.fit_transform(X_train)
X_test_maxabs = maxabs_scaler.transform(X_test)
print('Per feature maximum absolute value : ',maxabs_scaler.max_abs_)
print('Per feature relative scaling of the data : ',maxabs_scaler.scale_)
```

It normalizes individual samples of data. Unlike other scaler approaches that work on a column basis, this one works on a row basis. The scikit-learn provides a class named `Normalizer`

which accepts two different types of normalization.

**L1 Normalization:**It is calculated as the sum of absolute values of vector array.**L2 Normalization:**It is calculated as the square root of the sum of squares of values of vector array.

```
normalizer_l2 = Normalizer() ## Default is l2
normalizer_l1 = Normalizer(norm='l1')
```

```
X_train_normalized_l1 = normalizer_l1.fit_transform(X_train)
X_test_normalized_l1 = normalizer_l1.transform(X_test)
print(normalizer_l1.get_params())
```

```
X_train_normalized_l2 = normalizer_l2.fit_transform(X_train)
X_test_normalized_l2 = normalizer_l2.transform(X_test)
print(normalizer_l2.get_params())
```

Below we'll visualize scaled data generated using all 6 scalers above. We'll use the first 2 features of the IRIS dataset for visualization purposes.

NOTE

Please look at the scale below on X and Y axis for all scalers. Also notice positions of points in original dataset and scaled data is not changed.

```
with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(20,15))
plot_data(3,3,1,X_train,X_test, 'Original Data')
plot_data(3,3,2,X_train_scaled,X_test_scaled, 'Standard Scaler')
plot_data(3,3,3,X_train_minmax,X_test_minmax, 'MinMax Scaler')
plot_data(3,3,4,X_train_maxabs,X_test_maxabs, 'MaxAbs Scaler')
plot_data(3,3,5,X_train_robust,X_test_robust, 'Robust Scaler')
plot_data(3,3,6,X_train_normalized_l1,X_test_normalized_l1, 'L1 Normalizer')
plot_data(3,3,7,X_train_normalized_l2,X_test_normalized_l2, 'L2 Normalizer')
plt.show()
```

The real-world datasets generally have missing values which can be due to many reasons like data not collected, lost in the process, legal restrictions, etc. All machine learning algorithms need input data without any missing values. The scikit-learn ML algorithms fail if data is provided with missing values. Hence we either need to remove missing data samples or we need some way to fill in missing data. The missing data is generally encoded as no value, NANs, or by any other values in many of the datasets.

Scikit-Learn provides `SimpleImputer`

class which provides various approaches to fill in missing values. We'll explain its usage below with examples. We'll be creating dummy data with NaNs for explanation purposes.

```
rng = np.random.RandomState(123)
data = rng.rand(10,10)
data[data>0.9] = np.nan
data
```

The default version of `SimpleImputer`

will replace all `np.nan`

values with the average value of that column. We can directly call the `fit_transform()`

method on an instance of `SimpleImputer`

and it'll transform data replacing NANs.

```
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
```

```
imputer.fit_transform(data)
```

We can access the `statistics_`

parameter of a column which can hint us which column was filled with which values.

```
imputer.statistics_
```

`SimpleImputer`

¶Below we have given a list of important parameters of `SimpleImputer`

which can be used to change the default behavior.

**missing_values:**It lets us specify a missing value. The default is`nan`

but we can specify any other string, integer, None, etc.**strategy:**It accepts string specifying which strategy to use to fill in missing values. Below is a list of possible values for this parameter.**mean:**It takes the mean of the column and replaces missing values with a mean of that column.**median:**It takes a median of the column and replaces missing values with a mean of that column.**most_frequent:**It takes the most frequently occurring value and replaces missing values with that value.**constant:**It takes value specified in parameter`fill_value`

and replaces all missing value with that one value.

**fill_value:**It accepts an integer or string value which will be used to replace all missing values when the strategy is`constant`

. If we do not specify this value with strategy constant then it'll take 0 for numerical column and`missing_value`

for string column.

Below we have transformed data again but using `median`

as a strategy.

```
imputer = SimpleImputer(strategy="median")
imputer.fit_transform(data)
```

```
imputer.statistics_
```

Below we have created another artificial dataset where missing values are represented by `-1`

. We'll try to replace this missing value with `constant strategy`

and fill the value of `0.0`

.

```
data = rng.rand(10,10)
data[data>0.9] = -1
data
```

```
imputer = SimpleImputer(missing_values=-1, strategy="constant", fill_value=0.0)
imputer.fit_transform(data)
```

```
imputer.statistics_
```

Many real-world structured datasets have categorical columns that have a list of values getting repeated over time. These categorical columns can have values as strings or integers. All machine learning algorithms only accept float values as input hence we need to convert these categorical columns to a particular representation.

NOTE

The main reason behind one hot encoding is that we get to know how each individual values of categorical column is contributing to prediction of target variable. If we just maps values of categorical column to plain integers then we won't be getting exact idea about how each individual value is impacting prediction.

Scikit-Learn provides us with an estimator for this which will help us convert categorical columns of data to one-hot encoded columns. The final dataset has as many columns as there were different unique values in the categorical column. We have created a dummy dataset for explanation purposes.

```
from sklearn.preprocessing import OneHotEncoder
```

```
data = np.array([ ["High", "Asia"],
["Low", "Africa"],
["High", "North America"],
["Medium", "South America"],
["High", "Europe"],
["Medium", "Atlantic"],
["Low", "Arctic"],
])
```

The transformed data returned by `OneHotEncoder`

is scipy CSR sparsed array. We can convert it to python list of list or numpy array as well.

```
one_hot = OneHotEncoder()
transformed_data = one_hot.fit_transform(data)
transformed_data
```

```
print("Transformed Data Shape : ", transformed_data.shape)
transformed_data.toarray()
```

We can access categories that algorithm found out for each column using the `categories_`

parameter of `OneHotEncoder`

class.

```
one_hot.categories_
```

`OneHotEncoder`

¶Below is a list of important parameters of `OneHotEncoder`

which we can modify in order to change the default behavior of the transformer.

**categories:**We can provide a list of the list where the inside list will hold unique values per column. If we want to consider only a particular subset of values for categorical columns to consider then we can use this parameter.**handle_unknown:**It accepts two values.`error`

: It raises an error when encoding fails. The encoding can fail when the column has more categorical values than specified in the`categories`

parameter.`ignore`

: It prevents encoding from failing. When it finds value in a column that is not specified in the`categories`

parameter then it sets that value of all other column encodings to`0`

.

We have explained the usage of the above parameters in the below example. We are ignoring

`Arctic`

and`Antarctica`

continents below hence all other column values will be set to`0`

whenever they occur.

```
profiles = ["High", "Low", "Medium" ]
continents = ['Africa', 'Asia', 'Europe', 'North America', 'South America']
one_hot = OneHotEncoder(categories=[profiles, continents], handle_unknown="ignore")
transformed_data = one_hot.fit_transform(data)
```

```
print("Transformed Data Shape : ", transformed_data.shape)
transformed_data.toarray()
```

```
one_hot.categories_
```

This ends our small tutorial on data preprocessing using scikit-learn. Please feel free to let us know your views in the comments section.

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our **YouTube** channel.

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at **coderzcolumn07@gmail.com**. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

If you want to

- provide some suggestions on topic
- share your views
- include some details in tutorial
- suggest some new topics on which we should create tutorials/blogs