Pandas is the go-to library when for data analysis when working with tabular datasets. It is the best solution available for working with tabular datasets which fit in the main memory of the computer. Pandas provide two data structures for working with data (1. Series 2. DataFrame). Pandas let us give labels to each row by providing index values and name columns of data as well. These labels can then be used to filter pandas’ dataframe/series in a particular dimension. We can use index values to filter rows of a data frame as well as perform operations on them. If we don't provide index values then pandas will internally assign integers from 0 till the end of the data. It creates RangeIndex objects by default for both row labels and column labels. Internally pandas use various kinds of Index objects based on the label's data type to store labels for axes. We can access Index object for rows (axis=0) by calling '.index' property of dataframe and for columns (axis=1) by calling '.columns' property of dataframe. Pandas internally represent labels of both rows and columns using Index objects of various types based on the data type of labels.
The majority of uses case of using Pandas dataframe/series requires a single value per label. We generally have a single label for entry in a particular axis (single label for a particular row / single label for a column). But there can be situations where we want more than one value to be labels of row or column of data. This lets us represent high-dimensional data in our 2D data structure dataframe or 1D data structure series. This kind of indexing is generally referred to as Multi-Level Indexing where we have more than one label value per row/column. We have more than one level in indexing. The values inside a single label can be unrelated or they can be related to each other where the value at a lower level is a sub-value of higher-level value (E.g - In multi-level index,first-level values can be a year, second-level values can be months and third-level values can be the day. Here, lower-level values are sub-values of higher level ). When values in the index at lower levels are sub-values of higher-level values then the index is generally referred to as Hierarchical Index.
As a part of this tutorial, we'll be concentrating on how to create Multi-Level Index using pandas and how to index pandas dataframe if multi-level indexing is present for both rows and columns. Below we have listed important sections of our tutorial to give an overview of the material covered.
Below we have imported pandas and printed the version of them that we have used in this tutorial. We have also imported numpy as we'll be using it to create data for our pandas dataframe.
import pandas as pd
print("Pandas Version : {}".format(pd.__version__))
import numpy as np
In this section, we'll explain how we can create MultiIndex object which is used by pandas to represent an index that has more than one value per label of data. We can use MultiIndex object to represent row labels as well as columns labels. Pandas provide 4 different methods which are available as factory methods from MultiIndex class to create MultiIndex objects.
Below we have created two lists with a list of labels. We have then combined both lists using zip() python method. We'll be creating MultiIndex object using this list of tuples.
index1 = ["A", "B", "C", "D", "E"]
index2 = [1,2,3,4,5]
combined_idx = list(zip(index1, index2))
combined_idx
Below we have created MultiIndex object using from_tuples() method by providing a list of tuples that we created in the previous cell as input. We can also provide names to individual levels of MultiIndex object using names parameter.
combined_idx_obj = pd.MultiIndex.from_tuples(combined_idx, names=["Index1", "Index2"])
combined_idx_obj
Below, we have created MultiIndex object using from_arrays() method by providing two lists we had declared earlier as input. We can notice that the results are exactly the same as from_tuples() method.
combined_idx_obj = pd.MultiIndex.from_arrays([index1, index2], names=["Index1", "Index2"])
combined_idx_obj
Below, we have created a MultiIndex object using from_product() method by providing two lists we had defined earlier. We can notice that from_product() method multiplied all values of the first list with all values of the second list (cartesian product).
combined_idx_obj = pd.MultiIndex.from_product([index1, index2], names=["Index1", "Index2"])
combined_idx_obj
Below, we have created a dataframe using two lists that we had defined earlier. We'll be using this dataframe to create MultiIndex object next.
df = pd.DataFrame(zip(index1,index2), columns=["Index1", "index2"])
df
Below, we have created a MultiIndex object using from_frame() method by providing the dataframe we created in the previous cell.
combined_idx_obj = pd.MultiIndex.from_frame(df)
combined_idx_obj
In this section, we have created a MultiIndex object which represents a hierarchical indexing example. We have created MultiIndex object with 3 levels. We have given three lists as input to from_product() method. The first list is a single value list of the year, the second list is a 2-value list of months and the third list is integers in the range 1-31 representing days.
year = [2021,]
months = [1,2]
days = list(range(1,32))
combined_idx_obj = pd.MultiIndex.from_product([year,months, days], names=["Year", "Month","Day"])
combined_idx_obj[:5]
We can retrieve unique values per level of MultiIndex object using levels property of MultiIndex object.
combined_idx_obj.levels
We can retrieve a list of values per level using get_level_values() method by providing integer or level name as input.
combined_idx_obj.get_level_values(0)
combined_idx_obj.get_level_values("Month")
In this section, we'll create pandas DataFrames which will have either index or column labels or both represented with MultiIndex object. We'll be using various dataframes created in this section, later on, to explain indexing dataframe with multi-level indexing.
np.random.seed(123)
Below we have created aMultiIndex object using from_arrays() method by providing two lists to it. Then in the next cell, we have created a pandas dataframe with random numbers whose index is set as MultiIndex object we created.
index1 = ["A", "B", "C", "D", "E"]
index2 = [1,2,3,4,5]
combined_idx_obj = pd.MultiIndex.from_arrays([index1, index2], names=["Index1", "Index2"])
combined_idx_obj
data= np.random.rand(len(combined_idx_obj), 5)
df1 = pd.DataFrame(data=data, columns=["A1", "A2", "A3", "A4", "A5"], index=combined_idx_obj)
df1
In this section, we have created a hierarchical index using from_product() method by providing three lists to it. We have then created a pandas dataframe whose index is set as MultiIndex object representing hierarchical index.
year = [2021,]
months = [1,2]
days = list(range(1,32))
combined_idx_obj = pd.MultiIndex.from_product([year,months, days], names=["Year", "Month","Day"])
combined_idx_obj[:5]
data = np.random.rand(len(combined_idx_obj), 5)
df2 = pd.DataFrame(data=data, index=combined_idx_obj, columns=["C1", "C2", "C3", "C4", "C5"])
df2
In this section, we have created examples where both row and column headings will be represented using MultiIndex object.
First, we have created hierarchical MultiIndex just like our previous example to represent rows of dataframe. Then, we have created another hierarchical MultiIndex object using from_product() method which will be used to represent column labels.
year = [2021,]
months = [1,2]
days = list(range(1,32))
combined_idx_obj = pd.MultiIndex.from_product([year,months, days], names=["Year", "Month","Day"])
columns_idx = pd.MultiIndex.from_product([["Location1","Location2"],
["Item1", "Item2", "Item3", "Item4", "Item5"]],
names=["Location", "Item"]
)
columns_idx
Below, we have created a pandas dataframe of random numbers whose index and columns are represented using MultiIndex objects.
data = np.random.randint(1, 1000, size=(len(combined_idx_obj), len(columns_idx)))
df3 = pd.DataFrame(data=data, index=combined_idx_obj, columns=columns_idx)
df3
In this section, we have created another example where we are using MultiIndex object to represent labels of rows and columns. We have this time used a list of dates as a second-level index for rows.
locations = ["Location1", "Location2"]
rng = pd.date_range(start="2021-1-1", end="2021-2-28", freq="D")
rows_idx = pd.MultiIndex.from_product([locations,rng], names=["Location", "Date"])
columns_idx = pd.MultiIndex.from_product([["Category1","Category2"],
["Item1", "Item2", "Item3", "Item4", "Item5"]],
names=["Category", "Item"])
columns_idx
data = np.random.randint(1, 1000, size=(len(rows_idx), len(columns_idx)))
df4 = pd.DataFrame(data=data, index=rows_idx, columns=columns_idx)
df4
Below, we have retrieved single level from MultiIndex object using get_level_values() method and we can notice that it's DatetimeIndex object. Internally, MultiIndex object is composed of more than one Index object.
df4.index.get_level_values(1)
In this section, we'll explain how we can perform indexing on a dataframe whose index or column labels are represented using MultiIndex objects. We'll explain various ways of indexing multi-level index in this section. We'll be using '.loc' property of pandas dataframe to perform indexing as it accepts actual index values which can be of any type to perform the indexing. We won't be covering indexing using '.iloc' property which accepts integers for indexing. We'll also explain indexing using xs() function.
We can provide labels of 1st levels of MultiIndex object directly to '.loc' property and it'll retrieve all rows/columns that satisfy that 1st level.
Below we have retrieved rows of pandas dataframe whose 1st level value for MultiIndex object is 'A'.
df1.loc["A"]
Below, we have provided a list of 1st level values to '.loc' property to retrieve rows where these three values are 1st level values of the index.
df1.loc[["A","B","D"]]
Below, we have provided indexing values for both rows and columns to '.loc' property. The values provided for index is the same as our previous example.
df1.loc[["A","B","D"], ["A1","A3","A5"]]
Below, we have created an example to explain how we can give a tuple of values to an index dataframe. Each tuple has two values representing values for both levels of MultiIndex object. This kind of indexing will retrieve rows/columns which exactly match the combinations of values provided.
df1.loc[[("A",1),("B",2),("D",4)]]
Below we have again used only the 1st level value of MultiIndex object to the index dataframe.
df2.loc[2021]
Below, we have provided a two-value tuple to index dataframe which has 3-level index.
df2.loc[(2021,1),]
Below we have provided indexing values for all three levels to index the dataframe.
df2.loc[(2021,1, 1),]
In the below cell, we have explained an example where the second level of the index has date-time values. We can provide date-time values as either string or as date-time objects.
df4.loc[("Location1","2021-1-1"),]
The slice(start,end[,step]) function is normally used to create range of values. It has the same format as integer indexing on numpy arrays but slice() can also accept any data type as input and is used to represent a range of values for that data type based on values present in an index of dataframe/series.
We can call slice() function with just None value and it will represent all values in a particular level of MultiIndex object.
Below we have filtered our dataframe with a 2-level index. We have provided a tuple of two values where the first value is slice(None) and the second value is a list of three integers. This will inform the dataframe to take all entries where second-level values belong to one of the provided three integers. It'll consider all first-level values which have specified second-level values.
df1.loc[(slice(None), [1,2,3]),]
In the below cell, we have filtered our data set at row level using the same logic as our previous example. This time, we have provided filters for columns as well.
row_filter = (slice(None), [1,2,3])
column_filter = ["A1","A3","A5"]
df1.loc[row_filter, column_filter]
In the below cell, we have filtered rows of our dataframe which has a three-level index. For the first level, we have provided a single value. For the second level, we have provided slice(None) which will consider all values based on the values of other levels. For the third level, we have provided five values. This will consider all second-level values where the first-level value is 2021 and the third-level value is one of the five input integers.
df3.loc[(2021, slice(None), [1,2,3,4,5]),]
Below, we have created another example demonstrating usage of slice() function.
df3.loc[(2021, [2,], slice(None)),]
Below, we have created an example, where we are indexing a dataframe whose both rows and columns labels are MultiIndex objects. For filtering rows, we have used the same tuple of three values that we had used in one of our earlier examples. For filtering columns, we have used a tuple of two values where the first value is a single string and the second value is a list of three strings.
df3.loc[(2021, slice(None), [1,2,3,4,5]), (["Location1"], ["Item1","Item2", "Item3"])]
In the below cell, we have tried to reproduce our example from the previous cell but this time instead of providing 5 integers for the last level of filtering rows, we have provided slice(1, 5) function. It'll have the same effect as our previous example.
df3.loc[(2021,slice(None), slice(1,5)), (["Location1"], ["Item1","Item2", "Item3"])]
In the below cell, we have filtered rows of our dataframe where we had datetime index as a part of MultiIndex object.
df4.loc[(slice(None), ["2021-1-1", "2021-1-2", "2021-1-3"]),]
In this section, we'll explain how we can specify a range of values using ':' operator for filtering rows/columns of our dataframe.
Below we have filtered our first dataframe by providing value '"A":"C"' for indexing. This informs the '.loc' property that keep rows of dataframe where first level index values ranges from 'A' to 'C' which is three values ['A','B','C'].
df1.loc["A":"C",]
Below we have created another example where we have provided tuples to filter rows of dataframe. This will keep all entries in the dataframe starting from the first tuple index values till the last tuple index values.
df1.loc[("A",1): ("C",3),]
In the below cell, we have created another example demonstrating the usage of range filtering where we have filtered rows of dataframe as well as columns of dataframe using range operator (':').
df2.loc[(2021,1,1):(2021,1,15), "C3":"C5"]
In the below cell, we have created another example demonstrating the usage of the range operator for filtering rows and columns of the dataframe. This time we have used tuples for columns as well.
df3.loc[(2021,1,1):(2021,1,15), ("Location1", "Item4"): ("Location2","Item1")]
In the below cell, we have used range indexing on our dataframe where values of one level are of datetime data type.
df4.loc["Location1", "2021-1-1":"2021-1-5", :]
In the below cell, we have created one more example demonstrating usage of range operator to filter rows of dataframe where one level values are of datetime data type. We have provided datetime values as strings.
df4.loc["Location1":"Location2", "2021-1-1":"2021-1-5", :]
In this section, we'll explain how we can use xs() function available from the pandas dataframe to filter rows/columns of the dataframe.
The benefit of using this method is that we can provide information about individual levels and all other level labels will be handled internally without we need to explicitly handle them.
Below we have retrieved rows from our dataframe where values for level 'Day' is 1. There are two rows in the dataframe where the value of level 'Day' is 1.
df2.xs(1, level="Day")
We can also provide slice() function call to xs() method.
Below we have retrieved rows from our dataframe where values for level 'Day' are in the range [1-3]. There are six values that satisfies this condition because we have 'Day' values [1,2,3] present for month values 1 and 2 both.
df2.xs(slice(1,3), level="Day")
In the below cell, we have explained how we can use xs() method with columns of our dataframe.
We have filtered columns of our dataframe where level named 'Location' has values 'Location1'. There are 5 columns that satisfy this condition.
df3.xs("Location1", level="Location", axis=1)
In the below cell, we have filtered columns of our dataframe using xs() function where label values are provided by calling slice() function. We have filtered columns of the dataframe where labels of level 'Item' are one of ['Item1','Item2','Item3].
df3.xs(slice("Item1", "Item3"), level="Item", axis=1)
In the below cell, we have created another example demonstrating usage of xs() function. This time we have explained how we can give labels for more than one level. We have provided values for three levels of our dataframe and only one row satisfies those labels.
df3.xs((2021,1,1), level=("Year", "Month","Day"))
In the below cell, we have created another example demonstrating how we can provide labels for multiple levels to xs() method.
df4.xs(("Location1","2021-1-1"), level=("Location", "Date"))
In the below cell, we have again created an example demonstrating how to give labels for multiple levels to xs() method but this time we have used slice() call as one of the level's labels.
df3.xs((2021,slice(None),1), level=("Year", "Month","Day"))
In the next cell, we have created another example demonstrating usage of xs() method. This time we have used two slice() function call. One of the calls refers to all labels at that level and another call refers to the range of values.
df3.xs((2021,slice(None),slice(1,5)), level=("Year", "Month","Day"))
In our last example, we have explained how we can use xs() function when one of the level values is of datetime data type.
df4.xs(("Location1",slice("2021-1-1", "2021-1-5")), level=("Location", "Date"))
This ends our small tutorial explaining how we can create, index, and work with multi-level /hierarchical indexing (MultiIndex) with pandas dataframe. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to