Altair is a Python visualization library based on vega
and vega-lite
. The vega and vega lite are declarative programming languages where you specify properties of the graph as JSON and it plots graph based on that using Canvas or SVG. As Altair is built on top of these libraries, it provides almost the same functionalities as them in python. Altair's API is simple and easy to use which lets the developer spend more time on data analysis than getting visualizations right. We'll be explaining basic plotting using Altair as a part of this tutorial.
We'll first import all the necessary libraries to get started.
import altair as alt
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine, load_boston
We'll be using 3 datasets while explaining how to plot various charts using Altair.
We suggest that you download the apple ohlc dataset from yahoo finance and Starbucks store locations dataset from kaggle to continue with the tutorial.
wine = load_wine()
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df["Category"] = ["Category_%d"%(cat+1) for cat in wine.target]
wine_df.head()
apple_df = pd.read_csv("datasets/AAPL.csv")
apple_df["Date"] = pd.to_datetime(apple_df["Date"])
apple_df = apple_df.set_index("Date")
apple_df.head()
starbucks_locations = pd.read_csv("datasets/starbucks_store_locations.csv")
starbucks_locations.head()
The generation of charts using Altair is a list of steps that are described below. These steps are commonly used to generate a chart using Altair.
Chart
object passing dataframe to it.mark_point(), mark_bar(), etc
) on chart object to select chart type that will be plotted.encode()
method on output from 2nd step passing it various plot properties. As a part of this step, we provide details as to which column of the dataset will be used for what purpose. E.g. x='alcohol' will set alcohol values on x-axis, etc.The first chart type that we'll plot using Altair is a scatter plot. We are plotting below the scatter plot showing the relation between alcohol
and malic_acid
properties of the wine dataset. This is the simplest way to generate a plot using Altair.
alt.Chart(wine_df).mark_point().encode(x="alcohol", y="malic_acid")
Below we are again plotting scatter chart between alcohol
and malic_acid
, but this time we have color-encoded points by category of wine as well.
This time we have created X and Y axes by creating X and Y axes object using Altair which lets us modify properties of x and y axes. We have modified the default names of X and Y axes. The Altair plots generally start x and y axes at 0 and we can modify it as explained below using Scale()
setting not to start from zero. We also have introduced tooltip
property which accepts a list of columns from the dataset whose value will be displayed when the mouse hovers over a particular point of scatter plot.
We have also used properties()
method available with Altair which lets us modify plot size (height & width)
and title
.
We have also called the interactive()
method at last which will convert static plot into an interactive one.
alt.Chart(wine_df).mark_circle(
size=100
).encode(
alt.X("alcohol", title="Alcohol", scale=alt.Scale(zero=False)),
alt.Y("malic_acid", title="Malic Acid", scale=alt.Scale(zero=False)),
color="Category",
tooltip=["alcohol", "malic_acid"]
).properties(
height=300,
width=300,
title="Alcohol vs Malic Acid Color-encoded by Wine Category").interactive()
We have generated another scatter plot which is almost the same as last time but we have used different markers
to show different categories of wine. We have used the shape
attribute for this purpose which accepts the dataframe column name with categorical data.
alt.Chart(wine_df).mark_point(
size=50
).encode(
alt.X("alcohol", title="Alcohol", scale=alt.Scale(zero=False)),
alt.Y("malic_acid", title="Malic Acid", scale=alt.Scale(zero=False)),
color="Category",
shape="Category",
tooltip=["alcohol", "malic_acid"]
).properties(
height=300,
width=300,
title="Alcohol vs Malic Acid Color-encoded by Wine Category").interactive()
The third type of chart that we'll introduce is a bar chart using Altair.
We are first creating dataframe with an average of each wine dataframe column according to wine categories as it'll be used by many successive charts for plotting.
avg_wine_df = wine_df.groupby(by="Category").mean().reset_index()
avg_wine_df
Below we have created our first bar chart using the mark_bar()
method encoding x-axis as wine category and y-axis as average malic acid. We have also set chart width, height, and title as usual.
alt.Chart(avg_wine_df).mark_bar(
color='tomato'
).encode(
x = 'Category', y = 'malic_acid'
).properties(
width=300, height=300,
title="Avg Malic Acid per Wine Category"
)
Below we have created another bar chart which shows the average proline
per wine category. We have also changed X and Y-axis in this case to make a bar chart horizontal
.
alt.Chart(avg_wine_df).mark_bar(
color='dodgerblue'
).encode(
x = 'proline', y = 'Category'
).properties(
width=300, height=300,
title="Avg Proline per Wine Category"
)
The third chart type that we'll be introducing is the histogram. We have used the mark_bar()
method which is used to print bar charts. We have passed the x-axis column as proline
along with bin attribute as True
to inform Altair that we need to bin values of this column. We have also passed the y-axis value as count()
which will be used to count values of proline and then bin them.
alt.Chart(wine_df).mark_bar(
color='lawngreen'
).encode(
x =alt.X('proline', bin=True, title="Proline"),
y="count()"
).properties(
width=300,
height=300,
title="Proline Histogram")
The fourth chart type that we would like to introduce is a line chart.
We are using mark_line()
to plot a line chart showing the close price of Apple stock from April-2019 to March-2020.
alt.Chart(apple_df.reset_index()).mark_line(
color='red'
).encode(
x = 'Date:T', y = alt.X('Close:Q', scale=alt.Scale(zero=False))
).properties(
width=500,
height=300,
title="Apple Close Price from May-2019 to Mar-2020")
When we created the line chart above we have specified the column data category with one character after the column name. We have separated them with one colon (Date:T
, Close:Q
). This gives hint to Altair that date column needs to be considered as datetime column and close column has quantitative data. We can explicitly specify column type like this if Altair is failing to recognize the exact type.
Below we have listed commonly used data category characters in Altair:
The fifth chart type that we have introduced is the Area
chart using Altair. We can plot an area chart using the mark_area()
method of Altair. We are highlighting the area below the close price of Apple stock from April-2019 till March-2020.
alt.Chart(apple_df.reset_index()).mark_area(
color='green'
).encode(
x = 'Date:T', y = alt.X('Close:Q', scale=alt.Scale(zero=False))
).properties(
width=300,
height=300,
title="Apple Close Price from May-2019 to Mar-2020")
The sixth chart type we would like to introduce using Altair is a box plot. We are plotting box plot exploring the distribution of alcohol per wine category using the mark_boxplot()
method.
alt.Chart(wine_df).mark_boxplot(color="tomato").encode(
x=alt.X('Category:N'),
y=alt.Y('alcohol:Q', scale=alt.Scale(zero=False))
).properties(
width=300,
height=300,
title="Distribution of Alcohol per Wine Category")
The seventh chart type that we have introduced using Altair is a scatter matrix chart. We are exploring the relationship between three columns (alcohol, malic_acid, and proline).
We have used a method named repeat()
which accepts row and column names which will be repeated when plotting charts. It works like a loop inside a loop exploring the relationship between all possible combinations of columns. We have also color encoded scatter plots according to wine categories.
alt.Chart(wine_df).mark_circle().encode(
alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
alt.Y(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
color='Category:N'
).properties(
width=150,
height=150,
).repeat(
row=['alcohol', 'malic_acid', 'proline'],
column=['alcohol', 'malic_acid', 'proline']
).properties(
title="ScatterMatrix of 'alcohol', 'malic_acid', 'proline'"
).interactive()
The eight chart type that we have introduced below is a candlestick chart. We are plotting a candlestick chart for apple stock prices for the month of March-2020.
The plotting of the candle stick chart is carried out in 3 steps. In the first step, we create a base plot with proper x and y-axis. We then create a rule chart based on low and high columns by extending the base chart. Then we create a bar chart based on open and close columns by extending the base chart. At last, we merge the bar and rule chart to create a candlestick chart.
apple_mar_2020 = apple_df.loc["2020-3"].reset_index()
open_close_color = alt.condition("datum.Open <= datum.Close",
alt.value("lawngreen"),
alt.value("tomato"))
base = alt.Chart(apple_mar_2020).encode(
alt.X('Date:T',
axis=alt.Axis(
format='%m/%d',
labelAngle=-45,
title='Date in 2009'
)
),
color=open_close_color,
)
rule = base.mark_rule().encode(
alt.Y('Low:Q', title='Price',scale=alt.Scale(zero=False)),
alt.Y2('High:Q')
)
bar = base.mark_bar().encode(
alt.Y('Open:Q'),
alt.Y2('Close:Q')
).properties(
width=500,
height=300,
title="Apple Close Price from May-2019 to Mar-2020")
rule + bar
The last chart type that we would like to introduce is a scatter map. We'll be using the Starbucks store locations dataset for this purpose. We'll also need vega_datasets
library installed for this purpose as it holds information about various world maps.
Below we are creating a world map without any markers added on top of it. We are using vega_datasets
which provides world countries information. We first create a data source using the topo_feature()
method passing it URL from which it'll download world map data. We are downloading data with country wise borders.
We then use this data source to plot the world map using the mark_geoshape()
method. The stroke
property used in mark_geoshape()
refers to the color of country borders.
from vega_datasets import data
source = alt.topo_feature(data.world_110m.url, 'countries')
background = alt.Chart(source).mark_geoshape(
fill='lightgray',
stroke='white'
).properties(
width=500,
height=300
).project('naturalEarth1')
background
Below we are first creating a dataset for plotting to a scatter map. We are grouping the original dataset according to the state to get a count of stores per state. We are then creating another dataframe where we have average latitude and longitude of that state. We merge both data frames to create the final dataframe where we have information about Starbucks store count per state as well as state latitude and longitude. We'll use this information to plot to a scatter map.
mean_long_lat = starbucks_locations.groupby(by="State/Province").mean()[["Longitude", "Latitude"]]
count_per_state = starbucks_locations.groupby(by="State/Province").count()[["Store Number"]].rename(columns={"Store Number":"Count"})
count_per_state = count_per_state.join(mean_long_lat).reset_index()
count_per_state.head()
Below we are creating a scatter plot of longitude versus latitude. We are using the count of the store column of the dataset to show the size of the marker. We then merge this scatter plot with a world map created earlier to create a scatter map.
We can notice from a scatter map easily that California has the highest number of Starbucks stores per stats which is more than 2.5k.
points = alt.Chart(count_per_state).mark_circle(
color="tomato"
).encode(
x="Longitude:Q", y="Latitude:Q", size="Count:Q",
tooltip = ["State/Province", "Count"]
).interactive()
background + points
This ends our small tutorial introducing the basic API of Altair to plot basic charts using it. Please feel free to let us know your views in the comments section.
List of other plotting libraries in python
If you want to