Share @ LinkedIn Facebook  sankey-diagram, holoviews, plotly
How to Plot Sankey Diagram in Python Jupyter Notebook [holoviews & plotly]?

How to Plot Sankey Diagram in Python Jupyter Notebook [holoviews & plotly]

Table of Contents

Introduction

Sankey diagrams are commonly used to display the flow of some property from one source to another. It has various arrows representing the flow of property from one source to another and the size of an arrow is proportional to the amount of property flowing from source to destination. Sankey diagrams are commonly used for purposes like population migration, website user journey, the flow of energy, the flow of other properties (oil, gas, etc.), and many more. Python is preferred nowadays for the majority of data analysis tasks and has a rich set of libraries for visualizing results of data analysis. As a part of this tutorial, we'll be explaining how to create a Sankey diagram in python using libraries holoviews and plotly. We'll also explain various ways to change the styling of the plot and improve its aesthetics.

So without further delay, let’s get started with the coding part.

We'll first need to import all necessary libraries to start with our tutorial which includes pandas, numpy, holoviews, and plotly modules. We'll also set backend of holoviews as bokeh. If you are not aware of holoviews then we suggest that you go through this simple tutorial about holoviews basic plotting which will help you in the future as well.

In [1]:
import pandas as pd
import numpy as np

import holoviews as hv
import plotly.graph_objects as go
import plotly.express as pex
In [ ]:
hv.extension('bokeh')

Holoviews Bokeh Icon

Load Dataset

We'll be using the New Zealand migration dataset for our plotting purpose. It's available on kaggle for download.

It has information about a number of people who departed from and arrived in New Zealand from all continents and countries of the world from 1979 till 2016. We'll be aggregating this data in various ways to create different Sankey diagrams. We suggest that you download this dataset to follow along with us. We'll start loading the dataset as a pandas dataframe.

In [2]:
nz_migration = pd.read_csv("datasets/migration_nz.csv")
nz_migration.head()
Out[2]:
Measure Country Citizenship Year Value
0 Arrivals Oceania New Zealand Citizen 1979 11817.0
1 Arrivals Oceania Australian Citizen 1979 4436.0
2 Arrivals Oceania Total All Citizenships 1979 19965.0
3 Arrivals Antarctica New Zealand Citizen 1979 10.0
4 Arrivals Antarctica Australian Citizen 1979 0.0

We'll then perform a few steps of data cleaning as mentioned below.

  • We'll remove entries other than arrival and departure.
  • We'll remove entries where proper country name is not present.
  • We'll then group dataframe by Measure & Country attributes and sum up all entries.

After performing the above steps, we'll have a dataset where we'll have information about arrivals and departure count from each country and continent of all time.

In [3]:
nz_migration = nz_migration[nz_migration["Measure"]!="Net"]
nz_migration = nz_migration[~nz_migration["Country"].isin(["Not stated", "All countries"])]
nz_migration_grouped = nz_migration.groupby(by=["Measure","Country"]).sum()[["Value"]]
nz_migration_grouped = nz_migration_grouped.reset_index()
nz_migration_grouped.head()
Out[3]:
Measure Country Value
0 Arrivals Afghanistan 1644.0
1 Arrivals Africa and the Middle East 149784.0
2 Arrivals Albania 178.0
3 Arrivals Algeria 143.0
4 Arrivals American Samoa 2412.0

1. Sankey Diagram Using Holoviews

We'll be first introducing plotting of Sankey Diagrams using holoviews as our plotting library.

1.1 Sankey Diagram of Population Migration between New Zealand & Various Continents

For our first Sankey diagram, we need to filter entries of dataframe to keep only entry where the count for each continent is present. Below we are filtering dataset based on continent names to remove all other entries.

In [4]:
continents = ["Asia", "Australia","Africa and the Middle East","Europe", "Americas", "Oceania"]
continent_wise_migration = nz_migration_grouped[nz_migration_grouped.Country.isin(continents)]
continent_wise_migration
Out[4]:
Measure Country Value
1 Arrivals Africa and the Middle East 149784.0
5 Arrivals Americas 267137.0
14 Arrivals Asia 795697.0
15 Arrivals Australia 1057127.0
74 Arrivals Europe 1044693.0
166 Arrivals Oceania 1331987.0
252 Departures Africa and the Middle East 63555.0
256 Departures Americas 245915.0
265 Departures Asia 317603.0
266 Departures Australia 2325398.0
325 Departures Europe 877240.0
417 Departures Oceania 2534100.0

We can simply plot a Sankey diagram using holoviews by passing it above created dataframe. Holoviews needs a dataframe with at least three columns with source, destination, and property flow value from source to destination. It'll consider the first column as the source, the second column as a destination, and the last column as property flow value if no further information is not provided about which column to use for what purpose.

In [ ]:
hv.Sankey(continent_wise_migration)

Sankey Diagram of Population Migration between New Zealand & Various Continents

Below we are creating the same Sankey plot again, but this time specifying which columns from the dataframe to take as source and destination in kdims parameter and which column to use to generate property flow arrow sizes using vdims parameter.

We have also specified various parameters as a part of opts() method called on Sankey plot object which helped us further improve the styling of the diagram further. We have included colormap to use for node & edges, label position in the diagram, column to use for edge color, edge line width, node opacity, graph width & height, graph background color and title attributes which improves the styling of the graph a lot and makes it aesthetically pleasing.

In [ ]:
sankey1 = hv.Sankey(continent_wise_migration, kdims=["Measure", "Country"], vdims=["Value"])

sankey1.opts(cmap='Colorblind',label_position='left',
                                 edge_color='Country', edge_line_width=0,
                                 node_alpha=1.0, node_width=40, node_sort=True,
                                 width=800, height=600, bgcolor="snow",
                                 title="Population Migration between New Zealand and Other Continents")

Sankey Diagram of Population Migration between New Zealand & Various Continents

After going through the above chart, we can see that majority of people departed to Australia, Oceania, and Europe from New Zealand whereas many people arrive in New Zealand from Oceania, Europe, Australia, and Asia. There is very less departure to Asia and Africa from New Zealand. We also noticed that migration to and from the Americas is also quite less compared to other continents.

1.2 Sankey Diagram of Population Migration between New Zealand & Various European Countries

We'll now try to plot another Sankey Diagram depicting population migration between New Zealand and various European countries.

For plotting that, we first need to filter our dataframe keeping entry for only European countries.

In [5]:
european_countries = ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Denmark', 'Finland', 'France',
                      'Germany', 'Greece', 'Ireland', 'Italy', 'Netherlands', 'Norway', 'Poland',
                      'Romania', 'Russia', 'Spain', 'Sweden', 'Switzerland', 'Ukraine']

european_countries_migration = nz_migration_grouped[nz_migration_grouped["Country"].isin(european_countries)]
european_countries_migration.head()
Out[5]:
Measure Country Value
16 Arrivals Austria 3867.0
23 Arrivals Belgium 4710.0
35 Arrivals Bulgaria 792.0
55 Arrivals Croatia 2150.0
62 Arrivals Denmark 5333.0

We are plotting below our third Sankey Diagram showing migration flow between New Zealand and European countries. Please make a note that we have changed the source and target attributes this time by reversing them from the previous one. We have also modified various styling attributes but this time we have modified them using jupyter notebook magic command. This is another way to modify styling and various other graph attributes of holoviews graphs.

In [ ]:
%%opts Sankey (cmap='Category10' edge_color='Country' edge_line_width=0 node_alpha=1.0)
%%opts Sankey [node_sort=False label_position='left' bgcolor="snow" node_width=40 node_sort=True ]
%%opts Sankey [width=900 height=800 title="Population Migration between New Zealand and European Countries"]
%%opts Sankey [margin=0 padding=0]

hv.Sankey(european_countries_migration, kdims=["Country", "Measure"], vdims=["Value"])

Sankey Diagram of Population Migration between New Zealand & Various European Countries

We can see from the above graph that countries like Germany, France, Netherlands, Ireland, and Switzerland have a very high flow of population migration with New Zealand. A lot of peoples arrives in New Zealand from these countries.

1.3 Sankey Diagram of Users Journey on a Website

Our third Sankey Diagram will have more than 2 layers. We'll now try to display the journey of a user on our website by simulating user flow between major pages of our website. Below we are first creating a dataset about the flow of a number of users between various pages of our website.

In [6]:
source_dest = [
    ["Google", "Home"],
    ["Google", "Tutorials Page"],
    ["Google", "Blogs Page"],
    ["Google", "Contact Page"],
    ["Google", "About Page"],

    ["Facebook", "Home"],
    ["Facebook", "Tutorials Page"],
    ["Facebook", "Blogs Page"],
    ["Facebook", "Contact Page"],
    ["Facebook", "About Page"],

    ["Twitter", "Home"],
    ["Twitter", "Tutorials Page"],
    ["Twitter", "Blogs Page"],
    ["Twitter", "Contact Page"],
    ["Twitter", "About Page"],

    ["Bing", "Home"],
    ["Bing", "Tutorials Page"],
    ["Bing", "Blogs Page"],
    ["Bing", "Contact Page"],
    ["Bing", "About Page"],

    ["Direct", "Home"],
    ["Direct", "Tutorials Page"],
    ["Direct", "Blogs Page"],
    ["Direct", "Contact Page"],
    ["Direct", "About Page"],

    ["Home", "Exit"],
    ["Home", "Tutorials Page"],
    ["Home", "Blogs Page"],
    ["Home", "Contact Page"],
    ["Home", "About Page"],

    ["Tutorials Page", "Exit"],
    ["Tutorials Page", "Python Tutorial"],
    ["Tutorials Page", "ML Tutorial"],
    ["Tutorials Page", "AI Tutorial"],
    ["Tutorials Page", "Data Science Tutorial"],
    ["Tutorials Page", "Digital Marketing Tutorial"],
    ["Tutorials Page", "Android Tutorial"],

    ["Blogs Page", "Exit"],
    ["Blogs Page", "Python Blog"],
    ["Blogs Page", "ML Blog"],
    ["Blogs Page", "AI Blog"],
    ["Blogs Page", "Data Science Blog"],
    ["Blogs Page", "Digital Marketing Blog"],
    ["Blogs Page", "Android Blog"],

    ["Python Blog", "Exit"],
    ["Python Blog", "ML Blog"],
    ["Python Blog", "AI Blog"],
    ["Python Blog", "Data Science Blog"],

    ["ML Tutorial", "Python Tutorial"],
    ["ML Tutorial", "Exit"],
    ["ML Tutorial", "AI Tutorial"],
    ["ML Tutorial", "Data Science Tutorial"],

    ["Digital Marketing Blog", "Exit"],
    ["Digital Marketing Blog", "Android Blog"],
]

website_vists = pd.DataFrame(source_dest, columns=["Source", "Dest"])
website_vists["Count"] = np.random.randint(1,1000, size=website_vists.shape[0])
website_vists.head()
Out[6]:
Source Dest Count
0 Google Home 330
1 Google Tutorials Page 405
2 Google Blogs Page 586
3 Google Contact Page 58
4 Google About Page 353

Below we are creating our third Sankey Diagram showing the journey of users on our website based on simulated data create above. We also have modified various styling and graph configuration attributes to improve the aesthetics of it.

In [ ]:
%%opts Sankey (edge_color="Source"  edge_line_width=2 node_cmap="tab20")
%%opts Sankey (node_alpha=1.0 edge_hover_fill_color="red")
%%opts Sankey [node_sort=False label_position='right' node_width=30 node_sort=True ]
%%opts Sankey [title="User Journey on a Website" width=900 height=700]
%%opts Sankey [margin=0 padding=0 bgcolor="grey"]

hv.Sankey(website_vists, kdims=["Source", "Dest"], vdims=["Count"])

Sankey Diagram of Users Journey on a Website

2. Sankey Diagram using Plotly

We'll now try to explain ways to generate Sankey Diagram generated above but using Plotly this time for plotting purpose.

2.1 Sankey Diagram of Population Migration between New Zealand & Various Continents

The process of generating a Sankey Diagram using Plotly is a bit different from holoviews and requires a bit of data processing before actually plotting the graph. Plotly requires that we provide it a list of node names and indexes of source & destination nodes along with flow value separately. We'll need to perform below steps to generate Sankey Diagram using plotly.

  • First, we'll need to create a list of all possible nodes.
  • Then We need to generate indexes list of source nodes and target nodes based on their index in the list of all nodes created in previous steps.
  • We then need to pass all possible nodes to node parameter of Sankey() method of graph_objects module as explained below.
  • We pass all source and destination indices as well as flow value between that nodes to link parameter of Sankey() method of graph_objects plotly module as per the below example.
  • We can then update various plot attributes by using update_layout() method on a figure object.
In [ ]:
all_nodes = continent_wise_migration.Measure.values.tolist() + continent_wise_migration.Country.values.tolist()
source_indices = [all_nodes.index(measure) for measure in continent_wise_migration.Measure]
target_indices = [all_nodes.index(country) for country in continent_wise_migration.Country]

fig = go.Figure(data=[go.Sankey(
    # Define nodes
    node = dict(
      label =  all_nodes,
      color =  "red"
    ),

    # Add links
    link = dict(
      source =  source_indices,
      target =  target_indices,
      value =  continent_wise_migration.Value,
))])

fig.update_layout(title_text="Population Migration between New Zealand and Other Continents",
                  font_size=10)
fig.show()

Sankey Diagram of Population Migration between New Zealand & Various Continents

The above graph had used red color as a default color for all nodes. We can omit that attribute and plotly will use its default colors for different nodes of the graph.

In [ ]:
all_nodes = continent_wise_migration.Measure.values.tolist() + continent_wise_migration.Country.values.tolist()
source_indices = [all_nodes.index(measure) for measure in continent_wise_migration.Measure]
target_indices = [all_nodes.index(country) for country in continent_wise_migration.Country]


fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 20,
      thickness = 20,
      line = dict(color = "black", width = 1.0),
      label =  all_nodes,
    ),

    link = dict(
      source =  source_indices,
      target =  target_indices,
      value =  continent_wise_migration.Value,
))])

fig.update_layout(title_text="Population Migration between New Zealand and Other Continents",
                  font_size=10)
fig.show()

Sankey Diagram of Population Migration between New Zealand & Various Continents

2.2 Sankey Diagram of Population Migration between New Zealand & Various European Countries

Below we are plotting the second Sankey Diagram using plotly. We have this time introduced logic to color various nodes and edges of the diagram. We are maintaining a dictionary of mapping from node name to color. We are then using this dictionary to specify the color of nodes and edges. We suggest that you try various combinations of colors and use different colors for nodes and edges to show them differently.

In [ ]:
all_nodes = european_countries_migration.Country.values.tolist() + european_countries_migration.Measure.values.tolist()
source_indices = [all_nodes.index(country) for country in european_countries_migration.Country]
target_indices = [all_nodes.index(measure) for measure in european_countries_migration.Measure]

colors = pex.colors.qualitative.D3

node_colors_mappings = dict([(node,np.random.choice(colors)) for node in all_nodes])
node_colors = [node_colors_mappings[node] for node in all_nodes]
edge_colors = [node_colors_mappings[node] for node in european_countries_migration.Country]

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 20,
      thickness = 20,
      line = dict(color = "black", width = 1.0),
      label =  all_nodes,
      color =  node_colors,
    ),

    link = dict(
      source =  source_indices,
      target =  target_indices,
      value =  european_countries_migration.Value,
      color = edge_colors,
))])

fig.update_layout(title_text="Population Migration between New Zealand and European Countries",
                  height=600,
                  font_size=10)
fig.show()

Sankey Diagram of Population Migration between New Zealand & Various European Countries

2.3 Sankey Diagram of Users Journey on a Website

We are again generating Sankey Diagram representing the user journey on a website that we had generated above but this time using plotly. We have also tried to improve the look of the graph by modifying its background and other configuration attributes.

In [ ]:
all_nodes = website_vists.Source.values.tolist() + website_vists.Dest.values.tolist()
source_indices = [all_nodes.index(country) for country in website_vists.Source]
target_indices = [all_nodes.index(measure) for measure in website_vists.Dest]

colors = pex.colors.qualitative.D3
node_colors = [np.random.choice(colors) for node in all_nodes]

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 20,
      thickness = 20,
      line = dict(color = "black", width = 1.0),
      label =  all_nodes,
      color =  node_colors,
    ),

    link = dict(
      source =  source_indices,
      target =  target_indices,
      value =  european_countries_migration.Value,
))])

fig.update_layout(title_text="User Journey on Website",
                  height=600,
                  font=dict(size = 10, color = 'white'),
                  plot_bgcolor='black', paper_bgcolor='black')

fig.show()

Sankey Diagram of Users Journey on a Website

This ends our small tutorial of plotting Sankey Diagram in python using holoviews and plotly. Please feel free to let us know your views in the comments section.

References


Sunny Solanki  Sunny Solanki