Share @ LinkedIn Facebook  missing-data, visualization
missingno - Visualize Missing Data in Python

missingno - Visualize Missing Data

Table of Contents

Introduction

Python has a list of data visualization libraries for analyzing data from various perspectives. All of the data analysis tasks concentrate on the relationship between various attributes, distribution of attributes, etc. But many real-world datasets often has many missing values present in them. It might be due to many reasons like data not available, data lost in the process, etc. The missing data needs special handling before feeding it to machine learning algorithms as they can not handle missing data. We need a way to better understand the distribution of missing data as well in our datasets.

Python has a library named missingno which provides a few graphs that let us visualize missing data from a different perspective. This can help us a lot in the handling of missing data. The missingno library is based on matplotlib hence all graphs generated by it'll be static. We'll be explaining the usage of this library as a part of this tutorial.

Chart Types Available with missingno

missingno provides 4 plot as of now for the understanding distribution of missing data in our dataset:

  1. Bar Chart: It displays a count of values present per columns ignoring missing values
  2. Matrix: The nullity matrix chart lets us understand the distribution of data within the whole dataset in all columns at the same time which can help us understand the distribution of data better. It also displays sparkline which highlights rows with maximum and minimum nullity in a dataset.
  3. Heatmap: The chart displays nullity correlation between columns of the dataset. It lets us understand how the missing value of one column is related to missing values in other columns.
  4. Dendrogram: The dendrogram like heatmap groups columns based on nullity relation between them. It groups columns together where there is more nullity relation.

This ends our small introduction to the library. Let’s get started with coding part without further delay.

We'll start importing all the necessary libraries.

In [2]:
import missingno

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

1. Load Datasets

We'll start loading our datasets that we'll be using as our dataset for analyzing the distribution of missing data first. We'll be loading below mentioned 2 datasets as pandas dataframe, to begin with.

  • Starbucks Store Locations Dataset: The dataset has information about Starbucks locations worldwide.
  • London Housing Dataset: The dataset has information about London housing prices along with other information like crimes, sales, salaries, etc which are collected monthly and yearly.

Both of the datasets are available from kaggle. We suggest that you download datasets to follow along with our tutorial.

In [3]:
starbucks_locations = pd.read_csv("datasets/starbucks_store_locations.csv")
starbucks_locations.head()
Out[3]:
Brand Store Number Store Name Ownership Type Street Address City State/Province Country Postcode Phone Number Timezone Longitude Latitude
0 Starbucks 47370-257954 Meritxell, 96 Licensed Av. Meritxell, 96 Andorra la Vella 7 AD AD500 376818720 GMT+1:00 Europe/Andorra 1.53 42.51
1 Starbucks 22331-212325 Ajman Drive Thru Licensed 1 Street 69, Al Jarf Ajman AJ AE NaN NaN GMT+04:00 Asia/Dubai 55.47 25.42
2 Starbucks 47089-256771 Dana Mall Licensed Sheikh Khalifa Bin Zayed St. Ajman AJ AE NaN NaN GMT+04:00 Asia/Dubai 55.47 25.39
3 Starbucks 22126-218024 Twofour 54 Licensed Al Salam Street Abu Dhabi AZ AE NaN NaN GMT+04:00 Asia/Dubai 54.38 24.48
4 Starbucks 17127-178586 Al Ain Tower Licensed Khaldiya Area, Abu Dhabi Island Abu Dhabi AZ AE NaN NaN GMT+04:00 Asia/Dubai 54.54 24.51
In [4]:
london_housing = pd.read_csv("datasets/housing_in_london_yearly_variables.csv")
london_housing.head()
Out[4]:
code area date median_salary life_satisfaction mean_salary recycling_pct population_size number_of_jobs area_size no_of_houses borough_flag
0 E09000001 city of london 1999-12-01 33020.0 NaN 48922 0 6581.0 NaN NaN NaN 1
1 E09000002 barking and dagenham 1999-12-01 21480.0 NaN 23620 3 162444.0 NaN NaN NaN 1
2 E09000003 barnet 1999-12-01 19568.0 NaN 23128 8 313469.0 NaN NaN NaN 1
3 E09000004 bexley 1999-12-01 18621.0 NaN 21386 18 217458.0 NaN NaN NaN 1
4 E09000005 brent 1999-12-01 18532.0 NaN 20911 6 260317.0 NaN NaN NaN 1

We'll now explain each graph type one by one with examples.

2. Bar Chart

2.1 London Housing Dataset Missing Data Bar Chart

Below we are plotting count of values per columns ignoring missing values for London housing dataset with default settings of missingno.

In [ ]:
missingno.bar(london_housing, figsize=(10,5), fontsize=12);

missingno - Visualize Missing Data

2.2 London Housing Dataset Missing Data Bar Chart [Sorted]

Below we are plotting count of values per columns ignoring missing values for the London housing dataset. We also have sorted columns based on missing values.

In [ ]:
missingno.bar(london_housing, color="dodgerblue", sort="ascending", figsize=(10,5), fontsize=12);

missingno - Visualize Missing Data

2.3 Starbucks Locations Dataset Missing Data Bar Chart [Normal and Logarithmic Y-Axis]

Below we are plotting count of values per columns ignoring missing values as well as a log of that counts for Starbucks locations dataset. We have combined both charts into one figure using a matplotlib axes configuration. We have changed bar color, figure size, and font size of the chart as well to improve graph aesthetics.

In [ ]:
fig = plt.figure(figsize=(15,7))

ax1 = fig.add_subplot(1,2,1)
missingno.bar(starbucks_locations, color="tomato", fontsize=12, ax=ax1);

ax2 = fig.add_subplot(1,2,2)
missingno.bar(starbucks_locations, log=True, color="tab:green", fontsize=12, ax=ax2);

plt.tight_layout()

missingno - Visualize Missing Data

3. Missing Data Matrix

3.1 Starbucks Locations Dataset Missing Data Matrix Chart

Below we are plotting the first matrix plot showing the distribution of missing values for Starbucks locations dataset. We can see that all columns except Postcode and Phone Number has data present into them.

In [ ]:
missingno.matrix(starbucks_locations,figsize=(10,5), fontsize=12);

missingno - Visualize Missing Data

3.2 London Housing Dataset Missing Data Matrix Chart

Below we are plotting the first matrix plot showing the distribution of missing values for Starbucks locations dataset. We can notice that columns median_salary, life_satisfaction, recycling_pct, population_size, number_of_jobs, area_sizeand no_of_houses has missing values. We can also see that area_size and no_of_houses has almost the same distribution of missing values.

From sparkline we can also see that, minimum of 5 columns have values present always and maximum of 12 columns has values present less often.

In [ ]:
missingno.matrix(london_housing, figsize=(10,5), fontsize=12, color=(1, 0.38, 0.27));

missingno - Visualize Missing Data

3.3 London Housing Dataset Missing Data Matrix Chart [Without Sparkline]

Below we are plotting the first matrix plot showing the distribution of missing values for Starbucks locations dataset but without sparkline. We also have changed the color of the plot along with figures size and font size of the chart.

In [ ]:
missingno.matrix(london_housing, sparkline=False, figsize=(10,5), fontsize=12, color=(0.27, 0.52, 1.0));

missingno - Visualize Missing Data

4. Missing Data Heatmap

4.1 Starbucks Locations Dataset Missing Data Heatmap

Below we are plotting heatmap showing nullity correlation between various columns of Starbucks locations dataset. The majority of entries are empty in heatmap because Starbucks locations dataset has less missing values.

The nullity correlation ranges from -1 to 1.

  • -1 - Exact Negative correlation represents that if the value of one variable is present then the value of other variables is definitely absent.
  • 0 - No correlation represents that variables values present or absent do not have any effect on one another.
  • 1 - Exact Positive correlation represents that if the value of one variable is present then the value of the other is definitely present.

We can see from the dataset that Longitude and Latitude has a correlation of 1.0 which highlights that if Longitude value is missing then Latitude value will be missing as well. There is little correlation between Postcode and Phone Number as well which we had noticed above when visualizing matrix chart.

In [ ]:
missingno.heatmap(starbucks_locations, figsize=(10,5), fontsize=12);

missingno - Visualize Missing Data

4.2 London Housing Dataset Missing Data Heatmap

Below we are plotting Heatmap showing nullity correlation between various columns of London housing dataset. We have made changes to graph colormap, figure size, and font size as well this time.

We can notice from this heatmap that area_size and no_of_houses has a correlation of 1.0 which means that if the value from one column is missing then the value in another column will be missing as well. We can notice a good correlation of 0.6 between population_size and number_of_jobs which depicts the same missing value relationship. There are few other nullity correlations present in heatmap which we can notice based on the missing value matrix plotted above for this dataset.

In [ ]:
missingno.heatmap(london_housing, cmap="RdYlGn", figsize=(10,5), fontsize=12);

missingno - Visualize Missing Data

5 Missing Data Dendrogram

5.1 London Housing Dataset Missing Data Dendrogram

Below we are plotting dendrogram which shows hierarchical cluster creation based on missing values correlation between various datasets. The columns of the dataset which have a deep connection in missing values between them will be kept in the same cluster.

The dendrogram displays clusters with a tree-like structure that displays hierarchy of clusters.

Below we are creating the dendrogram of the London housing dataset. We can notice that area_size and no_of_houses form one cluster. We noticed above in nullity correlation heatmap as well that area_size and no_of_houses has nullity correlation of 1.0. The borough_flag, mean_salary, code and area creates another cluster. The next cluster is created by adding the median_salary cluster to cluster of borough_flag, mean_salary, code, and area. The cluster after that is created by adding population_size to cluster of median_salary, borough_flag, mean_salary, code and area.It proceeds like that and keeps on creating a bigger cluster until we have all columns in one cluster.

The dendrogram method uses scipy.hierarchy module for the creation of clusters based on data. It uses the average method of that module by default to create clusters. We can also pass other methods present in that scipy module like ward, centroid, etc for creating clusters.

In [ ]:
missingno.dendrogram(london_housing, figsize=(10,5), fontsize=12);

missingno - Visualize Missing Data

5.2 London Housing Dataset & Starbucks Locations Dataset Missing Data Dendrogram

Below we are plotting dendrogram of London housing and Starbucks locations datasets. We have used different methods for creating clusters this time. We have used a centroid method for creating clusters of columns for the Starbucks locations dataset and the ward method for the London housing dataset. We suggest that you try various clustering algorithms available from scipy using this link.

We can notice from the below graph that Longitude and Latitude are kept in one cluster. We had noticed in the above heatmap as well that they had a nullity correlation of 1.0. We can notice that State/Province, Ownership Type, Store Number, Brand, Store Name, Time Zone, and Country are kept together because they don't have any missing value.

In [ ]:
fig = plt.figure(figsize=(15,7))

ax1 = fig.add_subplot(1,2,1)
missingno.dendrogram(starbucks_locations, orientation="right", method="centroid", fontsize=12, ax=ax1);

ax2 = fig.add_subplot(1,2,2)
missingno.dendrogram(london_housing, orientation="top", method="ward", fontsize=12, ax=ax2);

plt.tight_layout()

missingno - Visualize Missing Data

This ends our small tutorial on explaining the usage of missingno library for plotting various visualization to understand the distribution of missing data in our datasets. Please feel free to let us know your views in the comments section.

References



Sunny Solanki  Sunny Solanki