Updated On : Sep-29,2021 Tags sweetviz, EDA
Sweetviz: Automate Exploratory Data Analysis (EDA)

Sweetviz: Automate Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is the process of analyzing datasets using different visualizations and basic summary statics to understand the various relationships, distributions, etc of data variables. It is generally the first step performed with new datasets to get insights about data. Doing EDA manually, where we create various visualizations and statistics by our selves can sometimes result in mistakes. It can also take a lot of time. The time which can be otherwise utilized in performing other more important tasks.

Sweetviz is a wonderful and very useful Python library that provides us with the EDA of a given dataset with just 2 lines of code. It generates an independent HTML page report with interactive visualizations of a dataset. It can save a lot of our time which would have otherwise been spent doing EDA manually. It also saves us from mistakes which we could introduce when doing things by ourselves.

Functionalities Provided by Sweetviz

Sweetviz let us perform a list of different analyses as mentioned below.

  • Single Dataset Analysis - It shows summary statistics (min, max, median, quantiles, etc.) about each data column, visualizations showing the distribution of it (histograms, quantile charts, etc), missing counts, correlation with other data columns, etc.
  • Target Variable Analysis - It includes all details mentioned in single dataset analysis along with the relationship of each data column with the target variable (column that we want to predict in ML). It highlights the target variable column separately in the application as well.
  • Compare two datasets (train vs validation, train vs test, test vs validation, etc) - It provides summary statistics, visualizations of relations, correlation, etc details for two datasets next to each other. It can help us understand how different data columns are distributed in two different datasets.
  • Divide Dataset using boolean variable and Compare them - This analysis works like comparing two datasets only but for this we don't give two different datasets, instead we give series/list of boolean values and comparison will happen between datasets generated dividing original dataset based on this True/False boolean values. The series/list of boolean values should be of the same length as our original dataset. The boolean values are given generally from our original dataset which has boolean columns like gender column (male vs female), etc.

We'll now start explaining how to use sweetviz with examples.

Important Sections of Tutorial

We have imported the necessary libraries for our purpose. We'll be using various datasets available from scikit-learn for explanation purposes.

In [9]:
import pandas as pd

import sweetviz

print("SweetViz Version : {}".format(sweetviz.__version__))
SweetViz Version : 2.1.3

Load Datasets

Below we have loaded 3 datasets available from scikit-learn which we'll be using in our examples. We have loaded each dataset as a pandas dataframe and displayed the first few lines for each to give an idea about the contents of the datasets.

  • Wine Dataset - This is a classification dataset that has information about ingredients (alcohol, malic acid, magnesium, ash, etc) used in 3 different types of wines.
  • Diabetes Dataset - This is a regression dataset that has information about ten baseline variables based on which measure of disease progression will happen after 1 year. The quantitative measure of disease progression after 1 year is the target variable.
  • Boston Dataset - This dataset has information about a number of attributes related to housing in the Boston area. The target variable is a median value of a home in 1000's dollars.
In [10]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
In [11]:
wine = datasets.load_wine()

wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)

#wine_df["WineType"] = [wine.target_names[typ] for typ in wine.target]
wine_df["WineType"] = wine.target

wine_df.head()
Out[11]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline WineType
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0
In [12]:
diabetes = datasets.load_diabetes()

diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)

diabetes_df["Progression"] = diabetes.target

diabetes_df.head()
Out[12]:
age sex bmi bp s1 s2 s3 s4 s5 s6 Progression
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0
In [13]:
boston = datasets.load_boston()

boston_df = pd.DataFrame(data=boston.data, columns=boston.feature_names)

boston_df["Price"] = boston.target

boston_df.head()
Out[13]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

EDA Using Sweetviz

As a part of this section, we'll explain how to perform EDA using the datasets loaded above. Sweetviz provides 3 different methods primarily for performing exploratory data analysis. We have given definitions of each so that it becomes easy to use them.


  1. analyze(source=None,target_feat=None,feat_cf=None,pairwise_analysis='auto') - This method let us perform EDA on a single dataset given as first parameter (source).
    • The source parameter takes as input pandas dataframe or tuple of dataframe and dataset name. The name given as input will be used in visualization.
    • The target_feat parameter takes as input string which is the target variable. We want to see a relation between all other data columns and the target variable. This parameter is optional.
    • The feat_cfg parameter takes as input instance of FeatureConfig. The FeatureConfig lets us declare which data columns should be used as numeric, which should be used as categorical and which should be used as text. This lets us explicitly specify data column types if we want to override default settings.
    • The pairwise_analysis takes as input one of the three input strings ('auto', 'on' and 'off'). It'll show a correlation between different columns based on this parameter. The default is 'auto' which will show correlation whenever possible. If we set the value as 'off' then it won't show a relationship.
  2. compare(source=None,compare=None,target_feat=None,feat_cfg=None,pairwise_analysis=None) - This method takes two dataset as input and let us perform EDA on both at the same time. It shows EDA for each dataset next to each other for better comparison. This method can be useful for performing EDA on a combination of train/test, train/validation, and test/validation datasets.
    • The source and compare parameter takes as input dataframe or tuple of dataframe and dataframe name. The dataframe names provided will be used in visualizations.
    • All other parameters work exactly as they work in analyze() method.
  3. compare_intra(source_df=None,condition_series=None,names=(),target_feat=None,feat_cfg=None,pairwise_analysis=None) - This method takes as input dataset and one boolean series of same length as dataset. It'll then divide the dataset based on boolean values of the series and perform EDA on two datasets generated this way. It generates EDA for both datasets next to each other so we can make a comparison based on it. The series is generally one of the boolean column of a dataset based on which we want to perform EDA.
    • The source_df dataset takes as input pandas dataframe.
    • The condition_series parameter takes as input boolean series based on which dataset will be divided into two datasets and comparison will happen between them.
    • All other parameters work exactly like they do in analyze() and compare() methods.

Please make a NOTE that each of the above-mentioned methods returns an instance of DataframeReport. This instance has two important methods which let us show interactive EDA reports either as an independent HTML application or inside of a jupyter notebook.

Each of the above-mentioned methods will show a progress bar when it's generating EDA.

Important Methods of DataFrameReport Object

  1. show_html(filepath='SWEETVIZ_REPORT.html',open_browser=True,layout='widescreen',scale=None) - This method will open report in browser as separate application. We can then interact with it and look at EDA results.
  2. show_notebook(w=None,h=None,scale=None,layout=None,filepath=None) - This method will open EDA results report inside of jupyter notebook. It let us provide width and height using parameters.

1. Single Dataset Analysis

Below we have generated an EDA report for the wine dataset using analyze() method. It returns an instance of type DataframeReport. We can then use it to display reports.

In [ ]:
report = sweetviz.analyze(wine_df)
In [15]:
report
Out[15]:
<sweetviz.dataframe_report.DataframeReport at 0x7f2301946dd8>

Below we have called report_html() method on DataframeReport object. It'll open an HTML report in a browser.

In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

We'll now explain individual parts of the report. The dashboard generated by all three methods (analyze(), compare(), and compare_intra()) will be the same with few more details present based on the method called.

Summary

Sweetviz : Automate Exploratory Data Analysis (EDA)

The summary section gives summary stats about the dataset like the number of samples, a number of features, duplicates, RAM usage, categorical features count, numerical features count, and text feature count. It'll show count for two datasets if we have called compare() or compare_intra() methods. In this case, it'll show summary stats about our whole wine dataset. We are also provided with a button named Associations in this section. If we click on that button, it'll generate a correlation heatmap showing the correlation between all features of the dataset (Only numeric features).

Associations

Sweetviz : Automate Exploratory Data Analysis (EDA)

When we click on Associations button from the summary section, the correlation heatmap will appear on the right-hand side of the screen. The heatmap has either squares or circles present in each tile. The circles represent Pearson correlation in the range [-1, 1]. The squares represent categorical associations. The categorical associations go row-wise and show how much association a feature represented by row name on left has with all other features of data. The categorical associations range from [0,1]. The heatmap will have a circle whenever showing the relation between numerical features and squares when showing the relation between categorical features or numerical and categorical features. The diagonal of the chart is left blank as each feature has a total relationship with itself. In our example, the WineType feature is categorical hence row and column representing WineType has squares whereas all other cells have circles because all other features are numerical.

Individual Column Stats

Sweetviz : Automate Exploratory Data Analysis (EDA)

Below the summary section, there is a tab for each feature of our dataset. It has also a tab for the target variable if we have provided a column name to be treated as the target variable. The tab has basic stats about the feature like total values, missing count, min, max, median, average, quantiles, range, standard deviation, etc. It also has a histogram showing the distribution of feature data. We can click on the tab and it'll open one more tab on the right-hand side showing more details about the feature. If we have provided a target variable name then the tab for it'll be present first and it'll be colored black to differentiate it from other columns.

More Stats and Relation with Different Columns

Sweetviz : Automate Exploratory Data Analysis (EDA)

The tab which gets displayed when we click on the feature tab below the summary section has information like actual values of numerical and categorical associations of feature with all other features, few frequent values, few largest values, and few smallest values. It also shows the histogram of feature data distribution again.

The sweetviz also let us show reports inside of jupyter notebook using show_notebook() method which we had explained earlier. Below we have displayed the report inside of the jupyter notebook. We have provided a height parameter with the value of 1500 pixels to increase the height of the report displayed.

In [ ]:
report.show_notebook(h=1500)

Sweetviz : Automate Exploratory Data Analysis (EDA)

2. Target Variable Analysis

As a part of this section, we'll explain how we can use sweetviz to perform target variable analysis which can be useful to see the relationship between the target variable and all features of the dataset. We can do so by just providing a column name from the dataframe that we want to use as the target variable in analyze() method.

Below we have generated a report from our diabetes dataframe using analyze() method. We have instructed the method to use Progression column as the target variable.

In [ ]:
report = sweetviz.analyze([diabetes_df, "Diabetes Dataset"], target_feat="Progression")

After generating report, we have called show_html() method on DataframeReport object to open it in a browser.

In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

Target Variable Details

Sweetviz : Automate Exploratory Data Analysis (EDA)

We can notice from the output that how the target variable tab is highlighted with black color to differentiate it from other columns.

Apart from this, the target variable values are also plotted as a line inside of histogram of the feature. This can be helpful to understand the relationship between the target variable and feature based on feature values. The value of the target variable is represented by Y-axis drawn on the right. When we click on the tab of any feature, we also see that association of that feature with the target variable is highlighted with black color.

Below we have generated another example of target variable analysis but this time we have used the wine dataset. We have skipped columns proline and magnesium from original dataset and instructed to use WineType column as numerical using FeatureConfig constructor. We can provide configuration for features of the dataset using this constructor. We can explicitly inform the features that we want to exclude from the report, we want to be considered categorical, numerical, or text. The skip parameter accepts a list of column names to skip from the report. The force_cat, force_text, and force_num accept a list of column names that we want to be considered as categorical, text, and numerical respectively.

In [ ]:
config = sweetviz.FeatureConfig(skip=["proline", "magnesium"], force_num=['WineType'])
In [ ]:
report = sweetviz.analyze(source=wine_df, feat_cfg=config, target_feat="WineType")

Below we have displayed the report generated for the wine dataset by providing WineType as the target variable.

In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

Below we have created another example showing usage of analyze() method. We have generated a report for the wine dataset. But this time, we have informed the method not to include pairwise relationships between features. This will not include associations details which we used to include in all reports till now.

In [ ]:
report = sweetviz.analyze(source=wine_df, pairwise_analysis="off")
In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

3. Compare Two Datasets

As a part of this section, we'll explain how we can compare two datasets and generate EDA for both. This will help us better understand the distribution of data between two datasets. We can compare train/test, train/validation, test/test and validation/test datasets.

Sweetviz let us generate EDA for two datasets using compare()* method. It'll show EDA for datasets next to each other.

We have first divided our diabetes dataset into train (80%) and test (20%) sets using scikit-learn's train_test_split() method. We'll be comparing these two datasets.

In [ ]:
train_df, test_df = train_test_split(diabetes_df, train_size=0.8)

train_df.shape, test_df.shape

Below we have generated an EDA report comparing train and test datasets generated from the diabetes dataset. We have informed the method to use Progression column as the target variable. We have then called show_html() on the report to open it in a new window of the browser.

In [ ]:
report = sweetviz.compare(source=train_df, compare=test_df, target_feat="Progression")
In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

We can notice above from the full EDA report image that it shows details for both datasets.

Below we have included images for individual sections as well to give an idea about report sections.

Summary

Sweetviz : Automate Exploratory Data Analysis (EDA)

We can notice that the summary section now has summary details for both datasets. There are two Associations button which shows associations heatmap for both datasets when clicked.

Associations (Test Set)

Sweetviz : Automate Exploratory Data Analysis (EDA)

Individual Column Stats

Sweetviz : Automate Exploratory Data Analysis (EDA)

Individual column stats have now statistics for both datasets given as input. They are highlighted using different colors. The histogram of distribution is also generated for both datasets in a single chart with different colors. There are two lines in the histogram based on target variable values in each dataset.

More Stats and Relation with Different Columns

Sweetviz : Automate Exploratory Data Analysis (EDA)

Target Variable Details

Sweetviz : Automate Exploratory Data Analysis (EDA)

Below we have again generated a report using both datasets but this time we have given names for both datasets when generating a report using compare() method. Please check the report screenshot below to check the names appearing in the summary section.

In [ ]:
report = sweetviz.compare(source=[train_df,"Train Set"], compare=[test_df, "Vaidation Set"],
                          target_feat="Progression")
In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

4. Divide Dataset using boolean variable and Compare them

There are situations when we need to understand data distribution based on some boolean column of dataset like we want to see EDA for all rows with gender male v/s all rows with gender female. We can do this kind of comparison EDA using compare_intra() method. It'll generate results that are almost identical to that generated by compare() method.

We'll be using our Boston housing dataset to generate the report using compare_intra() method. We have provided values of column CHAS as boolean values to condition_series parameter of the method to inform it to divide dataset based on these boolean values and then generate EDA report. The CHAS variable inside of the Boston hosing dataset has boolean information about whether houses are on the bounds of a river or not. The compare_intra() method will divide the Boston dataset into two datasets based on boolean values of column CHAS and generate EDA comparing those two datasets.

We have also included a screenshot of the report generated below by calling show_html() method on the report.

In [ ]:
report = sweetviz.compare_intra(source_df=boston_df,
                                condition_series=boston_df["CHAS"].astype(bool),
                                names=["Bounds River","Doesn't Bounds River"])
In [ ]:
report.show_html()

Sweetviz : Automate Exploratory Data Analysis (EDA)

5. Analysis Directly using DataFrameReport Object

All the three methods that we explained earlier generates a report and return an instance of type DataframeReport which we can display by calling report_html() method on it. We can also directly create an instance of DataframeReport with datasets and it'll just work fine.

Below we have created a report of the wine dataset by creating an instance of DataframeReport using constructor. We can then call show_html() method on it and it'll open the report in a new browser tab.

In [ ]:
report = sweetviz.DataframeReport(source=wine_df)

Below we have explained another example where we provide train and test sets generated earlier from the diabetes dataset.

In [ ]:
report = sweetviz.DataframeReport(source=train_df, compare=test_df, target_feature_name="Progression")

This ends our small tutorial explaining how to use sweetviz library. Please feel free to let us know your views in the comments section.

Reference



Sunny Solanki  Sunny Solanki