Graphs or commonly referred to as networks are interesting data structure which is used to show relationships between various entities. Graphs are used to represent many real-life scenarios like social networks, airports, and flights between them, recipe and ingredients, and many more. The analysis of the graph data structure requires a different set of algorithms some of which we'll be discussing in this tutorial. We'll also be explaining various properties of graphs which will be useful when doing analysis to find out meaningful insights.
Graphs are data structure which has two main entities:
Graphs are generally represented as G(V, E) where V represents a list of vertices/nodes and E represents a list of edges between those nodes. When representing graphs as visually each node is represented as a circle and each edge is shown as a line connecting nodes labeling relation between that nodes.
Graphs are generally divided into 2 categories:
1. Directed Graphs: Directed graphs have edges with direction from one node to another hence highlighting that data can be flown in direction of arrow only and not reverse. It represents that the relationship between two nodes has direction.
2. Undirected Graphs: Undirected graphs do not have any kind of direction information associated with it meaning that data can be flown in both directions. It represents that the relationship exists between two nodes mutually.
Python provides a library called networkx
for managing and manipulating graph data structure as well as various methods to analyze the properties of networks. We have already covered small tutorials explaining the usage of networkx which we suggest that you go through as it'll give you a basic idea about network creation using its API. We'll further explain graph creation and networkx API as a part of this tutorial as well. Networkx lets us create both directed and undirected graphs. We'll be using the nxviz
library for visualizing various network plots. We'll be using network datasets available from Konect when explaining various graph concepts using networkx.
So without further delay, let’s get started with the tutorial.
We'll start by importing all the necessary libraries.
# https://coderzcolumn.com/tutorials/python/networkx-basic
# https://snap.stanford.edu/data/
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import warnings
warnings.filterwarnings("ignore")
print('Python Version : '+sys.version)
print('NetworkX version : '+nx.__version__)
%matplotlib inline
The first dataset that we'll load is seventh grader dataset of students from Victoria. It has information about 29 seventh grade students who were asked about their favorite partner for three activities. It's a directed graph dataset. Each student will be represented as node and edge between two students represents that left student preferred the right student for some activity. The edge between students also has a weight between 1-3 which represents how often left students picked the right student as his partner. The graph created out of this dataset will be directed graph
due to edge directions from student to his/her preferred partner.
We'll start loading the dataset and create a graph from it. We'll use this dataset for explaining the graph creation process as well as various properties of graphs available through networkx API.
df = pd.read_csv('datasets/moreno_seventh/out.moreno_seventh_seventh',
skiprows=2, header=None, sep=' ')
df.columns = ['student1', 'student2', 'count']
df.head()
metadata = pd.read_csv('datasets/moreno_seventh/ent.moreno_seventh_seventh.student.gender',header=None)
metadata.columns=["Gender"]
metadata.head()
We can create a directed graph by using DiGraph()
method of networkx. We can then loop through rows of our dataset and add edges to the graph. Directed graph object has method named add_edge()
and add_node()
which can be used to add edge and node respectively to graph. We can also add metadata about each edge and node using these methods. Whatever parameters that we pass after node name and edge names are added as metadata of that node and edge. Below we are adding weight
of each edge as metadata which represents how often left students selected the right student for activities.
directed_G = nx.DiGraph()
for s1, s2, cnt in df.values:
directed_G.add_edge(s1, s2, weight=cnt)
Below we'll now add metadata for each node which will have information about whether the student is male or female.
We can access list nodes and edges of the graph by calling methods nodes()
and edges()
respectively on graph objects.
We can access individual node and edge by attribute nodes
and edges
available in the graph.
for node in directed_G.nodes():
directed_G.node[node]["Gender"] = metadata.loc[node-1]["Gender"]
directed_G.nodes[1], directed_G.edges[1,5]
We can even access node and edge list with their metadata by passing data=True
to nodes()
and edges()
methods.
list(directed_G.nodes())[:5], list(directed_G.nodes(data=True))[:5]
list(directed_G.edges(data=True))[:5]
We can modify existing metadata of nodes and edges of graphs as well as we can add more nodes and edges to the graph. We'll below explain how to add new nodes and edges to the existing graph. We have already explained above how to add metadata to nodes and edges.
directed_G.add_node(30, Gender="male")
directed_G.add_node(31, Gender="female")
directed_G.add_edge(30,31, weight=3)
directed_G.add_edge(31,1, weight=2)
directed_G.add_edge(31,2, weight=1)
directed_G.nodes[30], directed_G.nodes[31]
directed_G.edges[30,31], directed_G.edges[31,1]
We'll now try various visualizations which will help us with looking at our graph from a different perspective. Networkx
has an in-built method to visualize data called draw()
which can be used to visualize networks. Apart from that, we'll be using nxviz
library to visualize other network plots like arc plot, circus plot, and matrix plots. We'll also use the hiveplot
library to visualize hive plots as well.
nx.draw(directed_G, with_labels=True, font_weight='bold', node_color="lime", font_color="red")
The draw()
method available with netowrkx is non-deterministic and gives different representations each time method is called. If there are too many points then plotting them through draw()
does not give good representation as well as insight.
It has other methods like draw_circular()
and draw_shell()
which represents nodes in a circle.
nx.draw_circular(directed_G, with_labels=True, font_weight='bold', node_color="lime", font_color="red")
nx.draw_shell(directed_G, with_labels=True, font_weight='bold', node_color="lime", font_color="red")
We can plot matrix plot available from nxviz which draws a matrix where each row and column is nodes and entry of the matrix is whether there is an edge between them or not.
from nxviz import MatrixPlot
m = MatrixPlot(directed_G)
m.draw()
Arc plot lays down each node of the graph according to groups and then draws arc between nodes if there exists an edge between them. It can be useful to analyze the relationship between nodes. It can let us highlight if some nodes are highly related to other nodes. We'll be using the method available from nxviz
for plotting arc plot.
from nxviz import ArcPlot
node_labels = [node[1]["Gender"] for node in directed_G.nodes(data=True)]
a = ArcPlot(directed_G, node_color='Gender', node_grouping='Gender')
a.draw()
plt.tight_layout()
plt. autoscale()
Circos plot is exactly like arc plot but nodes are drawn in a circle instead of on a single line. We'll be using the method available from nxviz
to plot the circos plot.
from nxviz import CircosPlot
c = CircosPlot(directed_G, node_color='Gender', node_grouping='Gender')
c.draw()
Hive plots are used to highlight groups within a network and their relationships. It can help understand the relationship between groups as well as within-group itself. Below we have divided nodes into two groups (male and female). We also have created color encodings for nodes of each group.
from hiveplot import HivePlot
nodes = {}
nodes['male'] = [n for n,d in directed_G.nodes(data=True) if d['Gender'] == 'male']
nodes['female'] = [n for n,d in directed_G.nodes(data=True) if d['Gender'] == 'female']
edges = {}
edges['group'] = directed_G.edges(data=True)
nodes_cmap = {}
nodes_cmap['male'] = 'dodgerblue'
nodes_cmap['female'] = 'tomato'
edges_cmap = {}
edges_cmap['group'] = 'lime'
h = HivePlot(nodes=nodes, edges=edges, node_colormap=nodes_cmap, edge_colormap=edges_cmap)
h.draw()
Node importance represents how connected a node is to other networks. A node that is connected to many other nodes is generally considered an important node. Let’s take an example of air flights where nodes with many flights will be considered very busy airports and a company can think about moving some flights from there to lessen some burden of these airports.
We can interpret the importance of node by counting its neighbors
or using degree centrality
. We'll try to explain these concepts with examples below.
We'll be using Social Patterns dataset available from konect. This dataset has information about the interaction of people during the exhibition INFECTIOUS: STAY AWAY in 2009 organized at the Science Gallery in Dublin. Here nodes represent individual visitors and edges represents if two individual were facing each other for more than 20 seconds. There can be more than one edges between individuals if there exists contact more than once. The graph created out of this dataset will be undirected graph
as when one person is facing another then another one will be facing first as well.
df = pd.read_csv('datasets/sociopatterns-infectious/out.sociopatterns-infectious', sep=' ', skiprows=2, header=None)
df = df[[0, 1, 2]]
df.columns = ['person1', 'person2', 'weight']
df.head()
undirected_G = nx.Graph()
## Adding edges to graph as well as updating its weight if interaction is more than once
for idx in df.index:
p1 = df.loc[idx]['person1']
p2 = df.loc[idx]['person2']
if undirected_G.has_edge(p1, p2):
undirected_G.edges[p1, p2]['weight'] += 1
else:
undirected_G.add_edge(p1, p2, weight=1)
## Adding node number as attribute to node by sorting them.
for n in sorted(undirected_G.nodes()):
undirected_G.node[n]['order'] = float(n)
list(undirected_G.nodes(data=True))[:5]
list(undirected_G.edges(data=True))[:5]
c = CircosPlot(undirected_G, node_order='order', node_color='order', figsize=(7,7))
c.draw()
The first approach to finding out the importance of any node in a network is by counting its number of nodes. For the directed graph, it'll be a number of nodes pointing to that node whereas, for undirected graph, it'll be a number of edges connecting to it.
Let's count the number of neighbors of an above undirected graph. We'll be making use of a method named neighbors()
available through graph object. We'll pass it node and it'll return an iterator with neighbors of that node. We'll then convert mapping from node to a number of its neighbor to pandas series. We'll then sort it to get the node with the highest neighbors.
node_to_neighbors_mapping = [(node, len(list(undirected_G.neighbors(node)))) for node in undirected_G.nodes()]
node_to_neighbors_mapping[:5]
node_to_neighbors_ser = pd.Series(data=dict(node_to_neighbors_mapping))
node_to_neighbors_ser.sort_values(ascending=False).head()
We can even use degree()
method available with networkx. We can pass it graph object and it'll return an iterator with node and count of its neighbors. We can then sort it according to a number of neighbors to get nodes with the highest neighbors.
sorted(nx.degree(undirected_G), key = lambda x: x[1], reverse=True)[:5]
The degree centrality is another measure for finding the importance of a node in a network. Networkx provides a method named degree_centrality()
which can be used to find out-degree centrality for each node. It returns a dictionary of the node to its degree centrality mapping. We can then sort it to find out nodes with high centrality.
sorted(nx.degree_centrality(undirected_G).items(), key=lambda x : x[1], reverse=True)[:5]
The betweenness centrality refers to a number of shortest paths that pass through that node. This node as important as removing such nodes can sometimes break a network into more networks as well.
sorted(nx.betweenness_centrality(undirected_G, normalized=False).items(), key=lambda x : x[1], reverse=True)[:5]
sorted(nx.betweenness_centrality(undirected_G).items(), key=lambda x : x[1], reverse=True)[:5]
The load centrality refers to a fraction of the shortest paths that pass through that node.
sorted(nx.load_centrality(undirected_G).items(), key=lambda x : x[1], reverse=True)[:5]
There are many other types of centrality related to a node as well which can be explored further from this link.
Traversing through a network and finding a path between nodes is an important exercise. We need to find the shortest path between two nodes of a graph in many situations. The graph traversal helps in understanding the structure of the graph and helps to find a route between nodes of the graph.
We can use graph traversal algorithms like breadth-first search
and depth-first search
to find paths between all nodes of the network. It can let us know whether there exists a path between two nodes of a graph.
Networkx
has a method named has_path()
which can be used to check whether there exists as a path between two nodes or not.
nx.has_path(directed_G, 1, 10)
nx.has_path(directed_G, 1, 31)
nx.has_path(undirected_G, 1, 10)
We can even use depth-first search
and breadth-first search modules
available in networkx to create graphs from the output of that algorithms and then plot it. It'll help us understand how both algorithms traversed our graph.
nx.draw_circular(nx.breadth_first_search.bfs_tree(directed_G, 1), with_labels=True, node_color="lime", font_color="red")
nx.draw_circular(nx.depth_first_search.dfs_tree(directed_G, 1), with_labels=True, node_color="lime", font_color="red")
Finding the shortest path requires many applications of network like finding the shortest route to a particular destination in car, the shortest path for a packet to travel in network, etc. Networkx provides a list of methods to find the shortest path between nodes of the graph. It also lets us choose between Dijkstra
and bellman-ford
algorithms when finding the shortest path.
Below we are trying various methods available in networkx to find the shortest paths between nodes.
nx.shortest_path(directed_G, source=1, target=22)
nx.shortest_path(directed_G, source=1, target=22, method="bellman-ford")
nx.bellman_ford_path(directed_G, source=1, target=22)
nx.dijkstra_path(directed_G, source=1, target=22)
nx.shortest_path_length(directed_G, source=1, target=22)
print(list(nx.shortest_paths.all_pairs_dijkstra(directed_G))[0])
print(list(nx.shortest_paths.all_pairs_dijkstra_path_length(directed_G))[0])
print(list(nx.shortest_paths.all_pairs_bellman_ford_path(directed_G))[0])
print(list(nx.shortest_paths.all_pairs_bellman_ford_path_length(directed_G))[0])
print(list(nx.shortest_paths.all_pairs_shortest_path(directed_G))[0])
print(list(nx.shortest_paths.all_pairs_shortest_path_length(directed_G))[0])
Networkx provides another method named subgraph()
which can be called on graph object by passing a list of nodes. It'll then return another graph which consists of that nodes and edges between those nodes. We'll use this method to create a graph of the shortest path and visualize it.
shortest_path = nx.shortest_path(undirected_G, 1, 400)
print(shortest_path)
subgraph = undirected_G.subgraph(shortest_path)
nx.draw(subgraph, with_labels=True, node_color="lime", font_color="red")
shortest_path = nx.shortest_path(directed_G, 1, 25)
print(shortest_path)
subgraph = directed_G.subgraph(shortest_path)
nx.draw(subgraph, with_labels=True, node_color="lime", font_color="red")
This ends our small tutorial on basic graph analysis. We tried to cover below-mentioned points:
networkx
.Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to