Share @ LinkedIn Facebook  sklearn, feature-selection
Scikit-Learn - Feature Selection

Scikit-Learn - Feature Selection

Table of Contents

Introduction

Feature selection is a process where we select a subset of features that are most important from the list of available features. It can happen many times that a list of all collected might not be useful in prediction. If the list of features is too high then it can even impact the performance of the model. It's generally a good idea to select a subset of features that contributing most to prediction to improve performance and generalize as well.

Ideally one can try all possible combinations of features to select which ones are giving the best performance but due to large sets of features generally available it won't be possible to try all subsets.

sklearn provides several ways to select a subset of features from a list of all features. We'll start by importing necessary libraries.

In [1]:
import numpy as np
import pandas as pd

import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Loading Data

We'll be loading below mentioned two for our purpose.

  • Digits Dataset: We'll be using digits dataset which has images of size 8x8 for digits 0-9. We'll use digits data for classification tasks below.
  • Boston Housing Dataset: We'll be using the Boston housing dataset which has information about various house properties like average no of rooms, per capita crime rate in town, etc. We'll be using it for regression tasks.

Sklearn provides both of this dataset as a part of the datasets module. We can load them by calling load_digits() and load_boston() methods. It returns dictionary-like object BUNCH which can be used to retrieve features and target.

1. Classification

In [2]:
from sklearn.datasets import load_wine, load_boston

wine = load_wine()
X_wine, Y_wine= wine.data, wine.target
print('Dataset Sizes : ', X_wine.shape, Y_wine.shape)
Dataset Sizes :  (178, 13) (178,)

2. Regression

In [3]:
boston = load_boston()
X_boston, Y_boston = boston.data, boston.target
print('Dataset Sizes : ', X_boston.shape, Y_boston.shape)
Dataset Sizes :  (506, 13) (506,)

Adding Noise

We'll be generating random data as the almost the same size of original data and append it to original data to create our final datasets. This noise is added to original data to explain the usage of the feature selection process which only selects features that are closely related to target variables hence noise features added by us would be ignored by feature selection estimators. We'll try to prove it below with various examples.

1. Classification

In [4]:
rng = np.random.RandomState(123)
noise = rng.normal(size=(X_wine.shape[0], X_wine.shape[1]))

X_wine = np.hstack([X_wine, noise])
print('Dataset Sizes : ', X_wine.shape, Y_wine.shape)
Dataset Sizes :  (178, 26) (178,)

2. Regression

In [5]:
rng = np.random.RandomState(123)
noise = rng.normal(size=(X_boston.shape[0], X_boston.shape[1]))

X_boston = np.hstack([X_boston, noise])
print('Dataset Sizes : ', X_boston.shape, Y_boston.shape)
Dataset Sizes :  (506, 26) (506,)

Train/Test Splits

We are splitting both wine and Boston housing datasets into train and test sets.

  • Train Set (80%)
  • Test Set (20%)


NOTE

Please make a note that we are also using stratify parameter which will prevent unequal distribution of all classes in train and test sets.For each classes, we'll have 80% samples in train set and 20% samples in test set. This will make sure that we don't have any dominating class in either train or test set.

In [6]:
X_train_wine, X_test_wine, Y_train_wine, Y_test_wine = train_test_split(X_wine, Y_wine,
                                                    train_size=0.80, test_size=0.20,
                                                    stratify=Y_wine, random_state=123)

print('Train/Test Sizes : ',X_train_wine.shape, X_test_wine.shape, Y_train_wine.shape, Y_test_wine.shape)
Train/Test Sizes :  (142, 26) (36, 26) (142,) (36,)
In [7]:
X_train_boston, X_test_boston, Y_train_boston, Y_test_boston = train_test_split(X_boston, Y_boston,
                                                    train_size=0.80, test_size=0.20,
                                                    random_state=123)

print('Train/Test Sizes : ',X_train_boston.shape, X_test_boston.shape, Y_train_boston.shape, Y_test_boston.shape)
Train/Test Sizes :  (404, 26) (102, 26) (404,) (102,)

Correlation Between Features And Target Variable

The correlation between the two arrays helps us understand how related are two arrays. We'll first try to check the correlation between features of the dataset and their target variable. It'll help us better understand the relationship between them and the target variable. We'll be plotting a bar chart explaining the relationship between features and target variables. We'll also be plotting heatmap which will help us understand the relation between feature and target as well as the relation between various features as well as giving further insights.

1. Regression

We are creating a pandas dataframe consisting of features and target variables. The pandas dataframe provides easy to use function corr() which can help us get a correlation between various columns of dataframe easily.

In [8]:
df_regression = pd.DataFrame(X_boston, columns=list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)])
df_regression['House Price'] = Y_boston
df_regression.head()
Out[8]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... Noise5 Noise6 Noise7 Noise8 Noise9 Noise10 Noise11 Noise12 Noise13 House Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 ... -0.578600 1.651437 -2.426679 -0.428913 1.265936 -0.866740 -0.678886 -0.094709 1.491390 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 ... 2.186786 1.004054 0.386186 0.737369 1.490732 -0.935834 1.175829 -1.253881 -0.637752 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 ... -0.255619 -2.798589 -1.771533 -0.699877 0.927462 -0.173636 0.002846 0.688223 -0.879536 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 ... 0.573806 0.338589 -0.011830 2.392365 0.412912 0.978736 2.238143 -1.294085 -1.038788 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 ... 0.890706 1.754886 1.495644 1.069393 -0.772709 0.794863 0.314272 -1.326265 1.417299 36.2

5 rows × 27 columns

Below we have plotted bar chart showing the correlation between features of dataset and target variables. We can notice from the chart that original variables have quite a high correlation compared to noise data added later. The noise features have very little or almost no relation with the target.

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(15,8))
    df_regression.corr()['House Price'].drop('House Price').plot(kind='bar', width=0.8,
                                                                 title="Correlation between house price and various features of data");

Scikit-Learn - Feature Selection

Below we have plotted heatmap showing the relationship between various features of the dataset as well as between features and target. This gives us a bigger picture giving insights about feature relations with one another.

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    fig = plt.figure(figsize=(18,18))
    plt.matshow(df_regression.corr().values,fignum=1, cmap = plt.cm.Blues)
    plt.xticks(range(0,27),list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)]+['House Price'], rotation='vertical')
    plt.yticks(range(0,27),list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)]+['House Price'], rotation='horizontal')
    plt.colorbar()
    plt.grid(b=False)
    for i in range(0,27):
        for j in range(0,27):
            if df_regression.corr().values[i, j] < 0:
                plt.text(i-0.4, j+0.1, '%.1f'%df_regression.corr().values[i, j], color='tab:red', fontsize=12);
            else:
                plt.text(i-0.3, j+0.1, '%.1f'%df_regression.corr().values[i, j], color='tab:red', fontsize=12);

Scikit-Learn - Feature Selection

2. Classification

We'll now create a dataframe for the wine dataset exactly like the previous Boston housing dataset. We'll be using this dataframe for finding out the correlation between features and the target variable.

In [11]:
df_classif = pd.DataFrame(X_wine, columns=list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)])
df_classif['Wine Type'] = Y_wine
df_classif.head()
Out[11]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity ... Noise5 Noise6 Noise7 Noise8 Noise9 Noise10 Noise11 Noise12 Noise13 Wine Type
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 ... -0.578600 1.651437 -2.426679 -0.428913 1.265936 -0.866740 -0.678886 -0.094709 1.491390 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 ... 2.186786 1.004054 0.386186 0.737369 1.490732 -0.935834 1.175829 -1.253881 -0.637752 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 ... -0.255619 -2.798589 -1.771533 -0.699877 0.927462 -0.173636 0.002846 0.688223 -0.879536 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 ... 0.573806 0.338589 -0.011830 2.392365 0.412912 0.978736 2.238143 -1.294085 -1.038788 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 ... 0.890706 1.754886 1.495644 1.069393 -0.772709 0.794863 0.314272 -1.326265 1.417299 0

5 rows × 27 columns

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(15,8))
    df_classif.corr()['Wine Type'].drop('Wine Type').plot(kind='bar', width=0.8,
                                                          title="Correlation between wine category and other features of data.");

Scikit-Learn - Feature Selection

In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    fig = plt.figure(figsize=(18,18))
    plt.matshow(df_classif.corr().values,fignum=1, cmap = plt.cm.Blues)
    plt.xticks(range(0,27),list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)]+['Wine Type'], rotation='vertical')
    plt.yticks(range(0,27),list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)]+['Wine Type'], rotation='horizontal')
    plt.colorbar()
    plt.grid(b=False)
    for i in range(0,27):
        for j in range(0,27):
            if df_classif.corr().values[i, j] < 0:
                plt.text(i-0.4, j+0.1, '%.1f'%df_classif.corr().values[i, j], color='tab:red', fontsize=12);
            else:
                plt.text(i-0.3, j+0.1, '%.1f'%df_classif.corr().values[i, j], color='tab:red', fontsize=12);

Scikit-Learn - Feature Selection

We'll now start with an explanation of various feature selection estimators available with scikit-learn and use them for classification and regression datasets to select appropriate features.

Univariate Statics

It looks at each feature individually and selects features through the statistical test which are closely related to the target. This kind of test is known as Analysis of Variance (ANOVA). Below is a list of estimators that sklearn provides which selects features from data based on univariate statistics.

  • SelectPercentile
  • SelectKBest
  • SelectFpr
  • SelectFdr
  • SelectFwe

SelectPercentile

The SelectPercentile estimator available as a part of the feature_selection module of sklearn, let us select the percentage of highest scoring features according to univariate statistical tests.

Below are important parameters of SelectPercentile:

  • score_func -It accepts callable (function). The function should accept (features, target) as input and return two arrays (scores, p-values) or at least one array with scores for each feature. Based on these scores, features selection is made. The default value is the f_classif function available in the feature_selection module of sklearn.
  • percentile - It let us select that many percentages of features from the original feature set.

We'll now try SelectPercentile on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by SelectPercentile. We'll plot features that were selected by SelectPercentile to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification - f_classif


NOTE

Please make a note that we have tried all univariate statistics based feature selection estimators with f_classif for classification problems and f_regression for regression problems. Scikit-Learn also provides chi2 & mutual_info_classif functions for classification and mutual_info_regression for regression problems. The mutual info functions measures mutual relation between features and target variables. It then uses this information for feature selection.

In [14]:
from sklearn.feature_selection import f_regression, f_classif, chi2, mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectPercentile


print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_percentile_classif = SelectPercentile(score_func=f_classif, percentile=50)
select_percentile_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_percentile_classif.transform(X_train_wine)
X_test_selected  = select_percentile_classif.transform(X_test_wine)

print('Train/Test Sizes After 50 Percentile Feature Selection: ',X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (142, 13) (36, 13)
In [15]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)

print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Features Recovered

The feature selection estimator has a method named get_support() which returns an array of the same size as a number of features. Its boolean array indicates whether a particular feature got selected by the feature selection estimator or not. Below we are printing values of that array as well as plotting it.

In [16]:
feature_selection_mask_classif = select_percentile_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]

We have plotted features selection array which got returned by the get_support() method below. Here dark blue values represent features which got selected by model and light blue represents features which model ignored.

In [ ]:
plt.figure(figsize=(10,11))
plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

Visualising P-Values For Classification (f_classif)

sklearn provides f_classif and f_regression functions which returns F-Values and P-values for particular datasets. Lower p-values generally refer to informative features. The f-values which are generally referred to as scores of features are used by feature selection models to select features based on importance. We can notice from the below graphs that our original 13 features for classification tasks which are in beginning have low p-values compared to another 13 feature noise introduced by us later.

In [18]:
F_classif, p_value_classif = f_classif(X_wine, Y_wine)
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(14,5))
    plt.subplot(121)
    plt.plot(p_value_classif, 'o', c = 'tab:green')
    plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')

    plt.title('P Values Of Classification')
    plt.subplot(122)
    plt.plot(F_classif, 'o', c = 'tab:red')
    plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
    plt.title('F Values Of Classification');

Scikit-Learn - Feature Selection

2. Classification - chi2

Below we are using chi2 statistical estimation function as a part of SelectPercentile estimator for feature selection. Its commonly used for classification problems.


NOTE

Please make a note that features passed to chi2 function should be positive. It fails if negative values are passed as a part of array. We have hence used MinMaxScaler to scale all values to be positive before passing them to chi2.

In [20]:
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

mms = MinMaxScaler()
X_wine_scaled = mms.fit_transform(X_wine, Y_wine)
X_train_mms = mms.transform(X_train_wine)
X_test_mms = mms.transform(X_test_wine)

select_percentile_classif_chi2 = SelectPercentile(score_func=chi2, percentile=50)
select_percentile_classif_chi2.fit(X_train_mms, Y_train_wine)

X_train_selected = select_percentile_classif_chi2.transform(X_train_mms)
X_test_selected  = select_percentile_classif_chi2.transform(X_test_mms)

print('Train/Test Sizes After 50 Percentile Feature Selection: ',X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (142, 13) (36, 13)
In [21]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)

print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  0.9722222222222222
Train Accuracy :  0.9929577464788732

Features Recovered

In [22]:
feature_selection_mask_classif_chi2 = select_percentile_classif_chi2.get_support()
print('\nClassification Mask(Chi2) : ', feature_selection_mask_classif_chi2)
Classification Mask(Chi2) :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,11))
plt.imshow(feature_selection_mask_classif_chi2[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

Visualising P-Values For Classification (chi2)

In [24]:
chi_2, p_value_classif_chi2 = chi2(X_wine_scaled, Y_wine)
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(14,5))

    plt.subplot(121)
    plt.plot(p_value_classif_chi2, 'o', c = 'tab:green')
    plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
    plt.title('P Values Of Classification')

    plt.subplot(122)
    plt.plot(chi_2, 'o', c = 'tab:red')
    plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
    plt.title('Chi2 Values Of Classification');

Scikit-Learn - Feature Selection

3. Regression - f_regression

Below we are using f_regression statistical estimation function as a part of SelectPercentile estimator for feature selection. Its commonly used for regression problems.

In [26]:
print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

select_percentile_regression = SelectPercentile(score_func=f_regression, percentile=50)
select_percentile_regression.fit(X_train_boston, Y_train_boston)

X_train_selected = select_percentile_regression.transform(X_train_boston)
X_test_selected  = select_percentile_regression.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ',X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 13) (102, 13)
In [27]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)

print('Test R^2 Score : ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score :  0.6324950662083493
Train R^2 Score :  0.7603356609103591

Features Recovered

In [28]:
feature_selection_mask_regression = select_percentile_regression.get_support()
print('\nRegression Mask : ',feature_selection_mask_regression)
Regression Mask :  [ True  True  True False  True  True  True  True  True  True  True  True
  True False  True False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,11))
plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

Visualising P-Values For Regression (f_regression)

In [30]:
F_regression, p_value_regression = f_regression(X_boston, Y_boston)
In [ ]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(14,5))
    plt.subplot(121)

    plt.plot(p_value_regression, 'o', c = 'tab:green')
    plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
    plt.title('P Values Of Regression')

    plt.subplot(122)
    plt.plot(F_regression, 'o', c = 'tab:red')
    plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
    plt.title('F Values Of Regression');

Scikit-Learn - Feature Selection

SelectKBest

Sklearn provides SelectKBest estimator as a part of the feature_selection module which lets us select K best features based on scores from univariate statistical functions.

Below are important parameters of SelectKBest:

  • score_func -It accepts callable (function). The function should accept (features, target) as input and return two arrays (scores, p-values) or at least one array with scores for each feature. Based on these scores, features selection is made. The default value is the f_classif function available in the feature_selection module of sklearn.
  • k - It accepts int or 'all' as values. We can number of features to select to it as an integer.

We'll now try SelectKBest on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by SelectKBest. We'll plot features that were selected by SelectKBest to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [32]:
from sklearn.feature_selection import SelectKBest

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_k_best_classif = SelectKBest(score_func=f_classif, k=X_wine.shape[1]//2)
select_k_best_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_k_best_classif.transform(X_train_wine)
X_test_selected  = select_k_best_classif.transform(X_test_wine)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (142, 13) (36, 13)
In [33]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)

print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Features Recovered

In [34]:
feature_selection_mask_classif = select_k_best_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

2. Regression

In [36]:
print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

select_k_best_regressor = SelectKBest(score_func=f_regression, k=X_boston.shape[1]//2)
select_k_best_regressor.fit(X_train_boston, Y_train_boston)

X_train_selected = select_k_best_regressor.transform(X_train_boston)
X_test_selected  = select_k_best_regressor.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 13) (102, 13)
In [37]:
lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)
print('Test R^2 Score: ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score:  0.6324950662083493
Train R^2 Score :  0.7603356609103591

Features Recovered

In [38]:
feature_selection_mask_regression = select_k_best_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True  True  True False  True  True  True  True  True  True  True  True
  True False  True False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

SelectFpr

SelectFpr estimator is provided by the feature_selection module of sklearn. It let us select features based on the false-positive rate. It tries to control the total amount of false predictions per class.

Below we have given important attributes of SelectFpr estimator:

  • score_func -It accepts callable (function). The function should accept (features, target) as input and return two arrays (scores, p-values) or at least one array with scores for each feature. Based on these scores, features selection is made. The default value is the f_classif function available in the feature_selection module of sklearn.
  • alpha - It let us specify p-value. All the features whose p-value is below this will be selected. The default value is 0.05.

We'll now try SelectFpr on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by SelectFpr. We'll plot features that were selected by SelectFpr to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [40]:
from sklearn.feature_selection import SelectFpr

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_fpr_classif = SelectFpr(score_func=f_classif)
select_fpr_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_fpr_classif.transform(X_train_wine)
X_test_selected  = select_fpr_classif.transform(X_test_wine)

print('Train/Test Sizes After Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After Feature Selection:  (142, 13) (36, 13)
In [41]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)

print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Features Recovered

In [42]:
feature_selection_mask_classif = select_fpr_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

2. Regression

In [44]:
print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

select_fpr_regressor = SelectFpr(score_func=f_regression)
select_fpr_regressor.fit(X_train_boston, Y_train_boston)

X_train_selected = select_fpr_regressor.transform(X_train_boston)
X_test_selected  = select_fpr_regressor.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 14) (102, 14)
In [45]:
lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)
print('Test R^2 Score: ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score:  0.6474577804048167
Train R^2 Score :  0.7614162346590613

Features Recovered

In [46]:
feature_selection_mask_regression = select_fpr_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False  True False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

SelectFdr

SelectFdr estimator is provided by the feature_selection module of sklearn. It let us select features based on false discovery rate. It tries to control the total amount of false predictions per class.

Below we have given important attributes of SelectFpr estimator:

  • score_func -It accepts callable (function). The function should accept (features, target) as input and return two arrays (scores, p-values) or at least one array with scores for each feature. Based on these scores, features selection is made. The default value is the f_classif function available in the feature_selection module of sklearn.
  • alpha - It let us specify highest uncorrected p-value for features. The default value is 0.05.

We'll now try SelectFdr on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by SelectFdr. We'll plot features that were selected by SelectFdr to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [48]:
from sklearn.feature_selection import SelectFdr

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_fdr_classif = SelectFdr(score_func=f_classif)
select_fdr_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_fdr_classif.transform(X_train_wine)
X_test_selected  = select_fdr_classif.transform(X_test_wine)

print('Train/Test Sizes After Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After Feature Selection:  (142, 13) (36, 13)
In [49]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)

print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Features Recovered

In [50]:
feature_selection_mask_classif = select_fdr_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

2. Regression

In [52]:
print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

select_fdr_regressor = SelectFdr(score_func=f_regression)
select_fdr_regressor.fit(X_train_boston, Y_train_boston)

X_train_selected = select_fdr_regressor.transform(X_train_boston)
X_test_selected  = select_fdr_regressor.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 14) (102, 14)
In [53]:
lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)
print('Test R^2 Score: ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score:  0.6474577804048167
Train R^2 Score :  0.7614162346590613

Features Recovered

In [54]:
feature_selection_mask_regression = select_fdr_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False  True False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

SelectFwe

SelectFwe estimator is provided by the feature_selection module of sklearn. It let us select features based on the family-wise error rate.

Below we have given important attributes of SelectFwe estimator:

  • score_func -It accepts callable (function). The function should accept (features, target) as input and return two arrays (scores, p-values) or at least one array with scores for each feature. Based on these scores, features selection is made. The default value is the f_classif function available in the feature_selection module of sklearn.
  • alpha - It let us specify highest uncorrected p-value for features to keep. The default value is 0.05.

We'll now try SelectFwe on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by SelectFwe. We'll plot features that were selected by SelectFwe to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [56]:
from sklearn.feature_selection import SelectFwe

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_fwe_classif = SelectFwe(score_func=f_classif)
select_fwe_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_fwe_classif.transform(X_train_wine)
X_test_selected  = select_fwe_classif.transform(X_test_wine)

print('Train/Test Sizes After Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After Feature Selection:  (142, 13) (36, 13)
In [57]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)

print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Features Recovered

In [58]:
feature_selection_mask_classif = select_fwe_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

2. Regression

In [60]:
print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

select_fwe_regressor = SelectFwe(score_func=f_regression)
select_fwe_regressor.fit(X_train_boston, Y_train_boston)

X_train_selected = select_fwe_regressor.transform(X_train_boston)
X_test_selected  = select_fwe_regressor.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 13) (102, 13)
In [61]:
lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)
print('Test R^2 Score: ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score:  0.6324950662083493
Train R^2 Score :  0.7603356609103591

Features Recovered

In [62]:
feature_selection_mask_regression = select_fwe_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True  True  True False  True  True  True  True  True  True  True  True
  True False  True False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

Model Based Feature Selection

Model-based feature selection selects features from datasets by training good model like RandomForest/KnearestNeighbors and selecting features which are important from that trained model perspective. This kind of trained model will generally only keeps features that have a strong relationship with the target variable.

SelectFromModel

Scikit-Learn provides an estimator by name SelectFromModel as a part of the feature_selection module for performing recursive feature elimination to select features. It takes other machine learning models as input based on which decision regarding feature selection will be made. It only works with a model that generates coefs_ and feature_importance_ as it selects based on values returns by these attributes of the model.

Below is a list of important parameters of SelectFromModel:

  • estimator - It accepts the sklearn estimator which has coef_ and feature_importance_ attribute available once the model is trained.
  • threshold - It accepts string of float value as input. All features whose importance is greater or equal to the threshold are selected. The string value it accepts are median and mean which will set the threshold to the median and mean of coef_ or feature_importance_ attribute of the estimator. The default value is None which will take mean of feature importance as the threshold.
  • max_features - It accepts int or None. It selects that many numbers of features which are above the threshold from total features. We can disable the threshold parameter by setting it to -np.inf, then SelectFromModel will only select features specified as max_features.
  • prefit - It accepts boolean value. If set to True then we can pass already trained model to SelectFromModel.

We'll now try SelectFromModel on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by SelectFromModel. We'll plot features that were selected by SelectFromModel to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [64]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_from_model_classif = SelectFromModel(ExtraTreesClassifier(max_depth=6, n_estimators=200, random_state=1), threshold='median')
select_from_model_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_from_model_classif.transform(X_train_wine)
X_test_selected  = select_from_model_classif.transform(X_test_wine)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (142, 13) (36, 13)
In [65]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)
print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Feature Recovered

In [66]:
feature_selection_mask_classif = select_from_model_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

2. Regression

In [68]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston, train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sizes Before Feature Selection : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

select_from_model_regressor = SelectFromModel(ExtraTreesRegressor(max_depth=3, n_estimators=150, random_state=1), threshold='median')
select_from_model_regressor.fit(X_train, Y_train)

X_train_selected = select_from_model_regressor.transform(X_train)
X_test_selected  = select_from_model_regressor.transform(X_test)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape, Y_train.shape, Y_test.shape)

lr = LinearRegression()
lr.fit(X_train_selected, Y_train)
print('Test Accuracy : ', lr.score(X_test_selected, Y_test))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train))
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26) (404,) (102,)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 13) (102, 13) (404,) (102,)
Test Accuracy :  0.641916280579303
Train Accuracy :  0.7542232765994495

Feature Recovered

In [69]:
feature_selection_mask_regression = select_from_model_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True False  True  True  True  True  True  True  True  True  True  True
  True False False False False False False  True False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

Recursive Feature Elimination

Recursive Feature Elimination selects all sets of features and recursively eliminates a single feature at a time. It recursively eliminates features until the desired number of features is selected.

RFE

Scikit-Learn provides an estimator by name RFE as a part of the feature_selection module for performing recursive feature elimination to select features. It takes other machine learning models as input based on which decision regarding feature selection will be made. It only works with a model that generates coef_ and feature_importance_ as it selects based on values returned by these attributes of the model.

Below is a list of important parameters of RFE estimator:

  • estimator - It accepts scikit-learn estimator which has coef_ or feature_importance_ attributes because the selection of features will be made based on these parameters. The estimator is trained and features which model thinks important are kept.
  • n_features_to_select - It accepts int or None as value. We can specify the number of features to select from the original dataset as an integer. If we don't pass count of features to select then it'll select half of the features from the original dataset.
  • step - It accepts int or float value. If an integer value greater than 1 is passed then that many features are eliminated at each iteration of RFE. If float between 0-1 is passed then that many percentages of features are eliminated at each iteration of RFE.

Unlike SelectFromModel which selects features by training estimator only once and selecting features, RFE trains estimator and drop features based on step parameter and then repeat the same process until n_features_to_select features are selected.

We'll now try RFE on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by RFE. We'll plot features that were selected by RFE to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [71]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

print('Train/Test Sizes Before Feature Selection : ', X_train_wine.shape, X_test_wine.shape)

rfe_classif = RFE(ExtraTreesClassifier(max_depth=6, n_estimators=200, random_state=1), n_features_to_select=13)
rfe_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = rfe_classif.transform(X_train_wine)
X_test_selected  = rfe_classif.transform(X_test_wine)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (142, 13) (36, 13)
In [72]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)
print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9647887323943662

Feature Recovered

In [73]:
feature_selection_mask_classif = rfe_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

2. Regression

In [75]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor

print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

rfe_regressor = RFE(ExtraTreesRegressor(max_depth=3, n_estimators=150, random_state=1), n_features_to_select=13)
rfe_regressor.fit(X_train_boston, Y_train_boston)

X_train_selected = rfe_regressor.transform(X_train_boston)
X_test_selected  = rfe_regressor.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 13) (102, 13)
In [76]:
lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)
print('Test R^2 Score: ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score:  0.65924665103541
Train R^2 Score :  0.7559380876016175

Feature Recovered

In [77]:
feature_selection_mask_regression = rfe_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True  True  True  True  True  True  True  True  True  True  True  True
  True False False False False False False False False False False False
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

VarianceThreshold

The VarianceThreshold feature selection estimator as its name suggests selects only features that meet a particular threshold. It removes features whose variance does not meet that defined threshold. VarianceThereshold estimator accepts only 1 parameter threshold. If we do not provide any threshold values then it removes features which do not have variance (meaning all values are same).

We'll now try VarianceThereshold on the classification and regression datasets that we created above. We'll also check the performance of LinearRegression and LogisticRegression estimators on features that were selected by VarianceThereshold. We'll plot features that were selected by VarianceThreshold to check how many features it selected from original data and whether it was able to get rid of all noise features that were added to original data.

1. Classification

In [79]:
from sklearn.feature_selection import VarianceThreshold

print('Train/Test Sizes Before Feature Selection : ',X_train_wine.shape, X_test_wine.shape)

select_var_classif = VarianceThreshold(threshold=0.99)
select_var_classif.fit(X_train_wine, Y_train_wine)

X_train_selected = select_var_classif.transform(X_train_wine)
X_test_selected  = select_var_classif.transform(X_test_wine)

print('Train/Test Sizes After Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (142, 26) (36, 26)
Train/Test Sizes After Feature Selection:  (142, 11) (36, 11)
In [80]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train_selected, Y_train_wine)
print('Test Accuracy : ', lr.score(X_test_selected, Y_test_wine))
print('Train Accuracy : ', lr.score(X_train_selected, Y_train_wine))
Test Accuracy :  1.0
Train Accuracy :  0.9577464788732394

Features Recovered

In [81]:
feature_selection_mask_classif = select_var_classif.get_support()
print('Classification Mask : ', feature_selection_mask_classif)
Classification Mask :  [False  True False  True  True False  True False False  True False False
  True  True False False False  True False  True  True False False  True
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_classif[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(wine.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Classification: Wine Dataset Feature Recovery');

Scikit-Learn - Feature Selection

Important Attributes of VarianceThreshold

Below is a list of important features of VarianceThereshold:

  • variances_ - It represents variances for each feature of the dataset.
In [83]:
print("Variance Size : ", select_var_classif.variances_.shape)
select_var_classif.variances_
Variance Size :  (26,)
Out[83]:
array([6.49436783e-01, 1.20195942e+00, 7.83922486e-02, 1.12887840e+01,
       2.10210524e+02, 4.11172867e-01, 1.01636302e+00, 1.48658798e-02,
       3.37288137e-01, 5.33334251e+00, 4.97146219e-02, 4.81511268e-01,
       1.05406922e+05, 1.12194399e+00, 8.40849442e-01, 7.93189236e-01,
       9.40671253e-01, 1.08606246e+00, 9.22587379e-01, 1.04410512e+00,
       1.01761370e+00, 8.90917548e-01, 8.25530234e-01, 1.03736076e+00,
       7.93740522e-01, 8.27104581e-01])

2. Regression

In [84]:
print('Train/Test Sizes Before Feature Selection : ',X_train_boston.shape, X_test_boston.shape)

select_var_regressor = VarianceThreshold(threshold=0.99)
select_var_regressor.fit(X_train_boston, Y_train_boston)

X_train_selected = select_var_regressor.transform(X_train_boston)
X_test_selected  = select_var_regressor.transform(X_test_boston)

print('Train/Test Sizes After 50 Percentile Feature Selection: ', X_train_selected.shape, X_test_selected.shape)
Train/Test Sizes Before Feature Selection :  (404, 26) (102, 26)
Train/Test Sizes After 50 Percentile Feature Selection:  (404, 14) (102, 14)
In [85]:
lr = LinearRegression()
lr.fit(X_train_selected, Y_train_boston)
print('Test R^2 Score: ', lr.score(X_test_selected, Y_test_boston))
print('Train R^2 Score : ', lr.score(X_train_selected, Y_train_boston))
Test R^2 Score:  0.6419014946940487
Train R^2 Score :  0.681600904961011

Features Recovered

In [86]:
feature_selection_mask_regression = select_var_regressor.get_support()
print('Regression Mask : ', feature_selection_mask_regression)
Regression Mask :  [ True  True  True False False False  True  True  True  True  True  True
  True  True False False  True False  True False False False False  True
 False False]
In [ ]:
plt.figure(figsize=(10,8))

plt.imshow(feature_selection_mask_regression[np.newaxis,:], cmap=plt.cm.Blues)
plt.xticks(range(26), list(boston.feature_names) + ['Noise'+str(i) for i in range(1,14)], rotation='vertical')
plt.yticks([])
plt.title('Regression: Boston Dataset Feature Recovery');

Scikit-Learn - Feature Selection

This ends our small tutorial on feature selection using scikit-learn. Please feel free to let us know your views in the comments section.

References


Sunny Solanki  Sunny Solanki