Share @ LinkedIn Facebook  seaborn, statistical-analysis
Seaborn - Using relplot() API to Understand Statistical Relations between Multiple Data Variables

Seaborn: Using relplot() API to Understand Statistical Relations Between Multiple Data Variables

Table of Contents

Introduction

Python has a list of libraries for data visualization each one offering a different set of functionalities. Seaborn is one of such famous data visualization library which is preferred by many data scientist and visualization expert for plotting statistical visualizations. Seaborn is developed keeping statistical analysis of data and visualizing it. Seaborn is built on top of matplotlib and frees the developer from coding many mundane things required by matplotlib. It lets us concentrate on analysis more than getting visualizations right. Seaborn also has close integration with python library pandas which is preferred by many developers for maintaining structured data.

As a part of this tutorial, we'll be exploring seaborn functionalities for understanding the statistical relationship between multiple variables. We'll be using various datasets available by default with seaborn for understanding usage of the library.

So without further delay, let’s get started with coding to understand seaborn usage better.

We'll start by importing necessary libraries.

In [1]:
import seaborn as sns

import pandas as pd
import numpy as np

1. Load Dataset

The first dataset that we'll load is dots dataset available from seaborn. We'll load it and print its size and first few rows to better understand its contents.

In [2]:
dots = sns.load_dataset("dots")
print("Dataset Size : ", dots.shape)
dots.head()
Dataset Size :  (848, 5)
Out[2]:
align choice time coherence firing_rate
0 dots T1 -80 0.0 33.189967
1 dots T1 -80 3.2 31.691726
2 dots T1 -80 6.4 34.279840
3 dots T1 -80 12.8 32.631874
4 dots T1 -80 25.6 35.060487

Our second dataset is a famous auto mpg dataset that has information about car models generated over years from various manufactures. We'll load it, display its size and first few rows to check its contents.

In [3]:
auto_mpg = sns.load_dataset("mpg")
print("Dataset Size : ", auto_mpg.shape)
auto_mpg.head()
Dataset Size :  (398, 9)
Out[3]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino

The third dataset that we'll load is apple OHLC data downloaded from yahoo finance. We suggest that you download it as well from the yahoo finance website as CSV file to follow this step.

In [4]:
apple_df = pd.read_csv("datasets/AAPL.csv")
apple_df["Date"] = pd.to_datetime(apple_df.Date)
apple_df = apple_df.set_index("Date")
print("Dataset Size : ", apple_df.shape)
apple_df.head()
Dataset Size :  (252, 6)
Out[4]:
Open High Low Close Adj Close Volume
Date
2019-04-05 196.449997 197.100006 195.929993 197.000000 194.454758 18526600
2019-04-08 196.419998 200.229996 196.339996 200.100006 197.514709 25881700
2019-04-09 200.320007 202.850006 199.229996 199.500000 196.922470 35768200
2019-04-10 198.679993 200.740005 198.179993 200.619995 198.027985 21695300
2019-04-11 200.850006 201.000000 198.440002 198.949997 196.379578 20900800

2. Visualizing Relations as Scatter Plots

We'll be using seaborn's relplot() method for visualizing the relationship between multiple variables as either scatter plot or as line plot. Seaborn also provides separate methods named scatterplot() for scatter plots and lineplot() for plotting line plots. We'll be using relplot() with most of our examples as it provides very easy to use interface to plot relationships and explore data further.

Below is a list of important parameters of relplot() method which we'll be exploring further:

  • x - Data Variable Representing X-axis. We need to pass the column name of the pandas dataframe.
  • y - Data Variable Representing Y-axis. We need to pass the column name of the pandas dataframe.
  • kind - scatter for scatter plot and line for line plot. The default value is scatter.
  • data - Pandas dataframe containing total data.
  • hue - The categorical column from the dataframe will be used to color points of scatter plot and lines of line plots according to different categories. We need to pass the column name of the pandas dataframe.
  • style - The categorical column from the dataframe which will use different markers (+, ^,o, etc) for different categories of scatter plot and line plot. We need to pass the column name of the pandas dataframe.
  • size - The categorical column from the dataframe will used to decide the size of various points of scatter plot. We need to pass the column name of the pandas dataframe.
  • sizes - This parameter accepts range to decide in which range size of points on scatter plot will be.
  • palette - It's used to decide which color coding scheme to use to color different points/lines of data.
  • alpha - It represents the opacity of points/lines of the chart. The less value represents light and more value represents dark colors.

We'll now plot various scatter plots below explaining the usage of the above-mentioned parameters of replot() method.

2.1 Mpg vs Displacement Scatter Plot

In [ ]:
sns.relplot(x="mpg", y="displacement", data=auto_mpg);

Seaborn relplot API

Below we are plotting the same plot as above one but with using scatterplot() method instead of replot() as both have almost the same API.

In [ ]:
sns.scatterplot(x="mpg", y="displacement", data=auto_mpg);

Seaborn relplot API

2.2 Mpg vs Displacement Scatter Plot Color-Encoded by Origin.

We can also change plot style in seaborn using set() method available with seaborn passing is the style name available.

Below is a list of styles available with seaborn

  • dark
  • darkgrid
  • white
  • whitegrid
  • ticks
In [ ]:
# white, dark, whitegrid, darkgrid, ticks
sns.set(style="dark")

We can see from below plot that different color is used for different origin of cars.

In [ ]:
sns.relplot(x="mpg", y="displacement", hue="origin", data=auto_mpg);

Seaborn relplot API

In [ ]:
sns.scatterplot(x="mpg", y="displacement", hue="origin", data=auto_mpg);

Seaborn relplot API

2.3 Mpg vs Displacement Scatter Plot Color-encoded by Origin and Marker-encoded by Cylinders.

Below we are plotting mpg vs displacement scatter plot where we have used a different color for different categories of origin of cars. We have further used the fourth variable named cylinders whose categorical values are used for different marker styles for points according to different cylinder numbers present per car.

In [ ]:
sns.relplot(x="mpg", y="displacement", hue="origin", style="cylinders", data=auto_mpg);

Seaborn relplot API

In [ ]:
sns.scatterplot(x="mpg", y="displacement", hue="origin", style="cylinders", data=auto_mpg);

Seaborn relplot API

2.4 Mpg vs Displacement Scatter Plot Color-encoded by Cylinders [Different Palette for Colors]

In [ ]:
sns.relplot(x="mpg", y="displacement", hue="cylinders", palette="Oranges", data=auto_mpg);

Seaborn relplot API

2.5 Mpg vs Displacement Scatter Plot Color-encoded by Origin and Size-encoded by Cylinders [Dark2 Palette for Colors]

Below we have used cylinders categorical variable to use a different sizes for various points of scatter plot.

In [ ]:
sns.relplot(x="mpg", y="displacement",
            hue="origin", size="cylinders",
            palette="Dark2",
            data=auto_mpg);

Seaborn relplot API

2.6 Mpg vs Displacement Scatter Plot Color-encoded by Origin and Size-encoded by Cylinders [Different Size Settings for Points]

In [ ]:
sns.relplot(x="mpg", y="displacement",
            hue="origin", size="cylinders", sizes=(50,200),
            palette="Set1", alpha=0.5,
            data=auto_mpg);

Seaborn relplot API

3. Visualizing Relations as a Line Plots

We'll now use relplot() API for plotting various line chart. We need to pass kind="line" in order for it to plotline the plot instead of a scatter plot. The kind attribute has a default value as scatter which will force it to plot scatter plot.

We'll first set our plot styling to darkgrid which is the same as dark style but with grids present.

In [ ]:
sns.set(style="darkgrid")

3.1 Time vs Coherence Line Plot

Below we are plotting time vs coherence line plot whose value is shown aggregated in the below plot. Lighter blue area represents till how much part line values are spread and the actual line represents aggregated data line.

In [ ]:
sns.relplot(x= "time", y="coherence", kind="line", data=dots);

Seaborn relplot API

Below we are plotting exactly the same plot as above one but with using lineplot() method instead of relplot().

In [ ]:
sns.lineplot(x= "time", y="coherence", data=dots);

Seaborn relplot API

3.2 Time vs Coherence Line Plot without Aggregating Line Data

Below we are plotting above mentioned line without aggregating its data hence it looks wiggly.

In [ ]:
sns.relplot(x= "time", y="coherence", kind="line", estimator=None, data=dots);

Seaborn relplot API

In [ ]:
sns.lineplot(x= "time", y="coherence", estimator=None, data=dots);

Seaborn relplot API

3.3 Date vs Open Price Line Plot of Apple Stocks.

In [ ]:
line = sns.relplot(x= "Date", y="Open", kind="line", data=apple_df.reset_index());
line.fig.autofmt_xdate()

Seaborn relplot API

3.4 Multi-line Plot of Date vs Open, High, Low and Close Prices of Apple Stock.

In [ ]:
line = sns.relplot(kind="line",
            dashes=False,
            aspect=1.77,
            data=apple_df[["Open", "High", "Low", "Close"]]);

line.fig.autofmt_xdate()

Seaborn relplot API

3.5 Time vs Coherence Line Plot with Different Lines for Choice Variable of Data.

In [ ]:
sns.relplot(x= "time", y="coherence", hue="choice", kind="line", data=dots);

Seaborn relplot API

3.6 Time vs Coherence Line Plot with Different Lines for Each Choice Category and Different Line Style per Choice Category.

In [ ]:
sns.relplot(x= "time", y="coherence", hue="choice", style="choice",  kind="line", data=dots);

Seaborn relplot API

3.7 Time vs Firing Rate with Different Line for Coherence Categories and Different Line Style for Align Categories.

We have used below coherence for different line colors and align for different line styles.

In [ ]:
sns.relplot(x= "time", y="firing_rate", hue="coherence", style="align", kind="line", data=dots);

Seaborn relplot API

3.8 Time vs Firing Rate with Different Line Size for Coherence Categories and Different Line Style for Align Categories.

We have used below coherence for different line sizes and align for different line styles.

In [ ]:
sns.relplot(x= "time", y="firing_rate", size="coherence", style="align", kind="line", data=dots);

Seaborn relplot API

4. Visualizing Relations with Multiple Plots Based on Categorical Variable.

Till now we have analyzed the relationship between multiple variables within one plot only. We can add one more facet of exploration to our data by plotting different plots for the categorical variables. We can plot multiple charts depicting the relationship between multiple variables with each chart representing one category of a categorical variable. We'll explain it below with few examples where we'll be plotting multiple scatter and line charts per figure.

As a part of this exploration, we'll explore a few more parameters of replot() which we had not discussed above:

  • col - It represents categorical variable for whose one category one plot will be created. Total data will be divided into multiple datasets based on categories of this column and then that divided dataset will be used in each plot.
  • row - It represents categorical variable for whose one category one plot will be created. It has the same interpretation as col. If both col and row are present then for each combinations of categories of both row and col columns will be one plot in the figure. We'll explain it below further with an example to clarify if it’s not clear from the textual description.
  • col_wrap - It's an integer representing how many figures to keep per row.

4.1 Mpg vs Displacement Scatter Plot Color-encoded by Origin with One Plot per Cylinder Category.

In the below figure, we have one plot per cylinder count. We then plot mpg vs displacement scatter plot with different markers for different points according to cylinders.

In [ ]:
sns.set(style="whitegrid")

sns.relplot(x="mpg", y="displacement",
            hue="origin", style="cylinders",
            col="cylinders",
            sizes=(50,200), palette="Dark2",
            col_wrap=3, data=auto_mpg);

Seaborn relplot API

4.2 Mpg vs Displacement Scatter Plot Color-encoded by Origin, Different Markers per Cylinder Categories, Different Marker Size per Cylinder Categories with One Plot per Cylinder Category.

In [ ]:
sns.set(style="darkgrid")

sns.relplot(x="mpg", y="displacement",
            hue="origin", style="cylinders", size="cylinders",
            col="cylinders",
            sizes=(50,200), palette="Dark2",
            col_wrap=3, data=auto_mpg);

Seaborn relplot API

4.3 Time vs Firing Rate Line Plot Having Different Color and Line Style Per Choice Categories with One Plot per Align Category.

In [ ]:
sns.set(style="ticks")

sns.relplot(x="time", y="firing_rate",
            hue="choice", style="choice",
            kind="line",
            col="align",
            palette="Set1",
            data=dots);

Seaborn relplot API

4.4 Time vs Firing Rate Line Plot Having Different Color and Line Style Per Choice Categories with One Plot per different Align and Choice Category combinations.

The below example demonstrates the usage of col and row parameters at the same time. We have two values per choice column and two values per align column hence resulting in 4 (2x2) charts in the figure.

In [ ]:
sns.set(style="darkgrid")

sns.relplot(x="time", y="firing_rate",
            hue="choice", style="choice",
            dashes=False, markers=True,
            kind="line", linewidth=3, height=4, aspect=1.77,
            col="align", row="choice",
            palette="Dark2",
            data=dots);

Seaborn relplot API

This ends our small tutorial on exploring replot() API of seaborn to understand the statistical relationship between multiple variables of a dataset. Please feel free to let us know your views in the comments section.

References



Sunny Solanki  Sunny Solanki