Xarray is an open-source Python library that works almost like numpy but let us name dimensions of the array. Unlike numpy, where the majority of methods require us to specify axis argument to perform operations on a particular axis, xarray lets us specify string dimension names which can be more intuitive. This generally helps a lot when you are looking at your old codebase as operations based on axis won't give a better idea but operations based on some string dimension name will easily remind you of your coding decisions. Apart from dimension names, xarray also let us specify coordinates and attributes for arrays. The coordinates are just like pandas index but present for all dimensions of our array. The attributes are overall details about an array and are not associated with any dimension or coordinates.
Xarray provides two main data structures as a part of its API.
As a part of this tutorial, we'll be primarily concentrating on Dataset data structure of xarray library. We have already covered DataArray data structure in detail in a separate tutorial. Please feel free to check it from the below link if you want to know about it in detail.
Below we have highlighted important sections of our tutorial to give an overview of the topics that we'll be covering.
Below we have imported the necessary libraries that we'll be using as a part of our tutorial.
import xarray as xr
print("Xarray Version : {}".format(xr.__version__))
import numpy as np
import pandas as pd
In this section, we'll explain various ways of creating xarray Dataset using different methods available from the API. We'll be using numpy for data creation purposes. We'll be using datasets created in this section in all of our upcoming sections to explain indexing and other methods of xarray API.
Below we have first set seed for numpy so that all random numbers generated after this code will be the same on different computers for reproducibility purposes.
np.random.seed(123)
The simplest way to create xarray Dataset is by using Dataset constructor available from the library.
Below we have created our first xarray Dataset. We have first created two numpy arrays of random numbers and the same shape. We have then given these arrays to Dataset constructor through the dictionary to parameter data_vars. We have given names of the arrays as dictionary keys and dictionary values are a combination of coordinates and data. As we have one-dimensional arrays, we can provide coordinates names as a single string. We have then specified coordinates by giving a dictionary to coords parameter. We have given a simple range that goes from 0-4 as the value of a single dimension of data. These values 0-4 will be coordinates to index Dataset in x dimension.
arr1 = np.random.randn(5)
arr2 = np.random.randn(5)
dataset1 = xr.Dataset(data_vars={"Array1": ('x', arr1),
"Array2": ('x', arr2)},
coords={"x": np.arange(5)})
dataset1
Below we have created another xarray Dataset using Dataset() constructor. This time we have provided two-dimensional arrays as data of our dataset. We have created both arrays using numpy. One of the arrays is an array of integers and another is an array of random floats in the range 0-1. This example shows that we can combine different kinds of data using Dataset.
As our arrays are two-dimensional, we'll have two dimensions in our data. We have declared dimensions as tuple (('x','y')) in dictionary provided to data_vars parameter of the constructor. The coords parameter is provided with a dictionary where we have simply used a range of integers to represent coordinates.
arr1 = np.random.randint(1,100,size=(3, 5))
arr2 = np.random.randn(3, 5)
dataset2 = xr.Dataset(data_vars={"Array1": (("x","y"), arr1),
"Array2": (("x","y"), arr2)},
coords={"x": np.arange(3),
"y": np.arange(5)})
dataset2
Below we have created another xaray Dataset which has almost the same code as our previous dataset with only a change in the coordinate values for both dimensions. We have provided a list of strings as coordinate values.
arr1 = np.random.randint(1,100,size=(3, 5))
arr2 = np.random.randn(3, 5)
dataset3 = xr.Dataset(data_vars={"Array1": (("x","y"), arr1),
"Array2": (("x","y"), arr2)},
coords={"x": ["x1","x2","x3"],
"y": ["y1","y2","y3","y4","y5"]})
dataset3
In the below cell, we have created another Dataset in which we have provided 3-dimensional arrays. We have specified three-dimension names as tuple (('x','y','z')). The coordinates values are simply a range of integers.
arr1 = np.random.randint(1,100,size=(3, 5,7))
arr2 = np.random.randn(3, 5, 7)
dataset4 = xr.Dataset(data_vars={"Array1": (("x","y","z"), arr1),
"Array2": (("x","y","z"), arr2)},
coords={"x": np.arange(3),
"y": np.arange(5),
"z": np.arange(7)})
dataset4
In the next cell, we have created a Dataset where we have created coordinate by combining two dimensions 'x' and 'y'.
Please make a NOTE of how we have provided value to 'index1' coordinate. The value of the dictionary is a tuple with two elements. The first element is again a tuple of two strings that specifies which dimensions it combines. The second value is an array that has the same shape as the combined shape of dimensions 'x' and 'y'. We have created an array of integers in the range 0-14 and reshaped them as (3,5) array to be used as a coordinate value. When we'll perform indexing on this dataset, value from 'index1' coordinates will be selected based on 'x' and 'y' dimension values used for indexing (E.g - x=0,y=0, index1=0, x=0:2, y=0:2, index1=0,1,3,4). This example explains how we can store some extra information inside of coordinates which can be useful to link more related data. This will become more clear when we explain our next example which is taken from real-life datasets.
In order to perform indexing on this Dataset, we'll still need to provide all three 'x,y, and z' dimensions. But we are storing extra details as 'index1' coordinate which can be a requirement in some situations. When we'll explain indexing/slicing datasets, it'll become more clear how coordinates with values different than normal integer indexing can be used to store more information.
arr1 = np.random.randint(1,100,size=(3, 5,7))
arr2 = np.random.randn(3, 5, 7)
dataset5 = xr.Dataset(data_vars={"Array1": (("x","y","z"), arr1),
"Array2": (("x","y","z"), arr2)},
coords={"index1": (("x","y"), np.arange(15).reshape(3,5)),
"z": np.arange(7)
})
dataset5
Our next example explains the kind of dataset that we can face in real-life situations. It shows how we can combine a different kind of data with Dataset object.
Our dataset consists of 6 different arrays of shape (3,5,7). They all represent measurements of different attributes which are used in weather forecasting.
The dataset has 3 dimensions which are named 'x,y and time'. All dimension names are specified in the dictionary given to data_vars parameter.
The dictionary is given to coords parameter creates two new coordinates named lon and lat which combines dimensions 'x and y'. The value of 'lon' coordinate is a tuple of two values where the first value is a tuple of two strings representing dimensions and the second value is an array of shape (3,5) representing coordinate values. The value of 'lat' coordinate follows the same structure. The 'time' dimension is used as it is to represent coordinates in that dimension. We have specified a list of seven dates as the value of time coordinate using pandas.date_range() function.
When we'll index our dataset by specifying values for 'x,y, and z' dimensions, we'll get unique measurements of temperature, humidity, pressure, wind speed, precipitation, and PM25 measured at a particular time and particular location (longitude, latitude). The location is represented using longitude and latitude which are specified as coordinates and not as part of the data dictionary provided to data_vars parameter.
Apart from data and coordinates, we have also specified attributes of the dataset first time. We have given a dictionary of strings to attrs parameter where we have specified more information explaining what the dataset holds and how to interpret coordinates and dimensions.
This example is inspired by the example present on xarray document hence below image taken from there can be helpful to understand how to look at Dataset to better understand it.
temperature = np.random.randint(1,100, size=(3,5,7))
humidity = np.random.randn(3, 5, 7)
pressure = np.random.randn(3, 5, 7)
windspeed = np.random.randn(3, 5, 7)
precitipation = np.random.randn(3, 5, 7)
pm25 = np.random.randn(3, 5, 7)
dataset6 = xr.Dataset(data_vars={"Temperature": (("x","y","time"), temperature),
"Humidity": (("x","y","time"), humidity),
"Pressure": (("x","y","time"), pressure),
"WindSpeed": (("x","y","time"), windspeed),
"Precipitation": (("x","y","time"), precitipation),
"PM25": (("x","y","time"), pm25),
},
coords={"lon": (("x","y"), np.linspace(1,15,15).reshape(3,5)),
"lat": (("x","y"), np.linspace(15,30,15).reshape(3,5)),
"time": pd.date_range(start="2021-01-01", periods=7)
},
attrs={"Summary": "Dataset holds information like temperature, humidity. pressure, windspeed, precipitation and pm 2.5 particle presence based on location (lon, lat) and time.",
"lon": "Longitude",
"lat": "Latitude",
"time": "Date of Record"
}
)
dataset6
The ones_like() method works like its counterpart in numpy. It takes as input Dataset object and returns another Dataset object which has same dimensions as input Dataset but all values in the Dataset are replaced with 1s.
xr.ones_like(dataset6)
The zeros_like() method works exactly like ones_like() but all the values of Dataset are 0s.
xr.zeros_like(dataset6)
The full_like() method takes as input Dataset object and another value. It then returns another Dataset object which has the same dimensions as input Dataset but all values are replaced with a value given as second input to the method.
xr.full_like(dataset6, 101)
In this section, we'll explain a few useful attributes of Dataset objects and the information stored in them.
The attrs attribute returns dictionary of Dataset attributes.
dataset6.attrs
dataset6.attrs["Summary"]
The coords attribute returns coordinates of the dataset. We can extract individual coordinates values by treating the output of coords attribute as a dictionary. Each individual coordinate is represented using xarray DataArray object.
dataset6.coords
dataset6.coords["lon"]
The data_vars attribute returns data that we provided when creating a Dataset.
dataset6.data_vars
We can access individual data from a list of data by treating the output of data_vars as a dictionary. The result will be xarray DataArray object.
pm25 = dataset6.data_vars["PM25"]
type(pm25)
We can access dimensions of Dataset using dims attribute.
dataset6.dims
The indexes attribute returns indices of different dimensions of Dataset.
dataset3.indexes
dataset1.indexes
dataset6.indexes
The nbytes attribute returns a total number of memory bytes used by Dataset object.
dataset6.nbytes
In this section, we'll explain how we can index Dataset objects. We'll first explain how we can access and index individual DataArray from Dataset and then explain indexing of Dataset as a whole using sel() and isel() methods.
Please make a NOTE that all methods in this section return a new Dataset object based on indexing operation. It does not modify any Dataset object in place.
In this section, we'll explain how we can access individual DataArray and perform indexing on it. If you are interested in learning about indexing on DataArray in detail then please feel free to check our tutorial on it.
Below we have retrieved DataArray which is stored in our Dataset object by Array1 name. We can retrieve it by treating our Dataset object as dictionary-like.
dataset3["Array1"]
We can also retrieve individual DataArray by calling its name as an attribute of Dataset object. The below statement will return the same result as our previous cell.
dataset3.Array1
We can treat DatArray object just like numpy array and perform integer indexing on it. Below we have retrieved 2x2 from our Array1 DataArray.
dataset3.Array1[:2, :2]
We can also use .loc property on our DataArray object just like pandas series/dataframe to retrieve a subset of an array by specifying actual index values which can be another data type than integer indexing.
dataset3.Array1.loc[["x1", "x2", "x3"]]
dataset3.Array1.loc[["x1", "x2", "x3"], ["y1", "y2"]]
In this section, we'll explain how we can use isel() method to index Dataset objects.
The isel() method let us use integer indexing to index our Dataset and provides two different ways to specify indexing details.
Below we called isel() method on one of our Dataset object. We have treated dimension name 'x' of the Dataset object as parameter of isel() method and provided single index value to it. We have basically retrieved the 0th element from Dataset. This indexing will be applied on all DataArray and coordinates of Dataset. We can notice from the result that coordinates 'x' holds single value 0 and DataArray object 'Array1' and 'Array2' also holds single values which is 0th entry in both.
x = dataset1.isel(x=0)
x
In the below cell, we have explained how we can provide indexing details to isel() method as a dictionary. The below method call will have the same impact as our previous cell.
x = dataset1.isel({"x":0})
x
In the below cell, we have again called isel() on one of our Dataset objects. This time we have provided a list of integers as indexing values for dimension 'x' of our Dataset object. This will retrieve the first two elements from the Dataset. We can notice from the results how coordinate x is populated with the first two values and both DataArray objects 'Array1' and 'Array2' are populated with the first two values as per indexing details.
x = dataset1.isel(x=[0,1])
x
In the below cell, we have explained again how we can provide indexing details as a dictionary. The below method call will return the same results as our previous cell method call.
x = dataset1.isel({'x':[0,1]})
x
In the below cell, we have called isel() method on one of our Dataset objects which has two dimensions ('x and y'). We have asked it to select 0th and 1st values from dimension 'x' and 1st and 2nd values from dimension 'y'. It'll return a subset of our original Dataset based on these indexing details. We can notice from the results how 'x' and 'y' coordinate values are retrieved based on indexing. The DataArray objects 'Array1' and 'Array2' both holds 2x2 array.
x = dataset3.isel(x=[0,1], y=[1,2])
x
In the below cell, we have explained indexing on our Dataset with 3 dimensions using isel() method. We have retrieved a subset of Dataset which consists of Dataset formed by first and second values from all three dimensions.
x = dataset4.isel(x=[0,1], y=[0,1], z=[0,1])
x
In the below cell, we have also displayed the contents of two DataArray objects present inside of our Dataset we got using isel() method.
x.Array1
x.Array2
In the below cell, we have again called isel() method on our Dataset which had details about temperature, humidity, pressure, etc. The dimension names in that Dataset were x, y, and time.
We can notice from the results that how a subset of coordinates and DataArray objects are retrieved based on indexing details given to the method.
x = dataset6.isel(x=[0, 1], y=[0,1], time=[0,1])
x
x.Temperature
x.PM25
In this section, we have explained how we can use sel() method to perform indexing on our Dataset object.
The sel() method works exactly like isel() method but it accepts actual values of dimension to index Dataset object. The isel() method only accepts integer indexing values to index Dataset objects but sel() method accepts actual values of dimensions which can be of any data type (integer, string, datetime, etc).
Just like isel() method, it also lets us specify indexing details in two ways.
Below we have used sel() method to retrieve a subset of one of our Dataset objects. The Dataset object used in this example had integers as values of dimensions hence integer indexing is used. When dimension values of Dataset is of type integers then isel() and sel() methods will work same. It's different when the data type of values of dimension is different.
x = dataset2.sel(x=[0,1], y=[0,1,2])
x
In the below example, we have explained how we can provide indexing details as a dictionary to sel() method. The output of the below cell will be the same as our previous cell because the indexing details are the same.
x = dataset2.sel({'x':[0,1], 'y':[0,1,2]})
x
In the below cell, We have provided actual string values of dimensions to sel() method to subset our Dataset object.
x = dataset3.sel(x=["x1", "x2"], y=["y1", "y2", "y3"])
x
In the below cell, we have tried to use sel() method to subset a Dataset object whose one dimension values are of type datetime. We have asked it to retrieve a subset of Dataset with a single date in time dimension.
Please make a NOTE how we provided datetime details as a string. We can provide datetime details as a string or original datetime type as well.
Then in the next few cells after the below cell, we have displayed coordinate and DataArray object detail of subset Dataset object that we got using sel() method.
x = dataset6.sel(x=0, y=0, time="2021-1-1")
x
x.lon
x.Precipitation
In the below cell, we have created another example demonstrating usage of sel() method on Dataset whose one dimension values are of datetime type. This time we have provided a list of two strings specifying two different dates as values of time dimension inside sel() method function call.
Then in the next cell after the below cells, we have also displayed coordinate and DataArray object details of subset Dataset that we got through sel() method call.
x = dataset6.sel(x=[0,1], y=[0,1], time=["2021-1-1","2021-1-2"])
x
x.lon
x.Humidity
In the below cell, we have created another example demonstrating usage of sel() method on Dataset with datetime dimension. This time we have provided datetime values as a list of datetime type values created using pd.date_range() function. We can perform indexing on the dimension with datetime type values in this way as well.
x = dataset6.sel(x=[0,1], y=[0,1], time=pd.date_range(start="2021-1-1", periods=3))
x
In this section, we'll explain commonly performing operations on Dataset objects like transpose, copy, change coordinate details, change attribute details, fill NaNs, add new attributes, etc. We'll explain various methods available from xarray to perform these operations.
Please make a NOTE that all methods in this section return a new Dataset object with details modified. It does not modify any Dataset object in place.
The assign() method lets us add new DataArray to our Dataset object. The method takes as input dictionary in the same format which we provide to Dataset() constructor to add new DataArray objects.
Below we have first created an array of random numbers with shape (3,5). We have then added this array to our Dataset object using assign() method. We have provided array name as key and value is a tuple of dimension names and actual data. We can add more than one DataArray to our Dataset object using this method.
arr3 = np.random.randn(3,5)
dataset2.assign({"Array3": (["x","y"], arr3)})
The assign_attrs() method takes as input dictionary of attributes and adds those attributes to Dataset object. If attributes are already present in Dataset object then provided attributes will add/update attributes. If Dataset does not have attributes then attributes will be added to it.
dataset2.assign_attrs({"x": "Row-Dimension", "y": "Column-Dimension"})
The assign_coords() method let us update coordinates detail of Dataset object. It accepts dictionary specifying coordinate details just like we provide in Dataset() constructor when creating Dataset object.
In the below example, we have replaced the existing integer coordinates of Dataset object with a list of string coordinates.
In the next cell below, we have also tried to retrieve subset Dataset based on these new coordinate values.
dataset2_new_coords = dataset2.assign_coords(coords={"x": ["a1","a2","a3"], "y": ["b1","b2","b3","b4","b5"]})
dataset2_new_coords
dataset2_new_coords.sel(x=["a1","a2"], y=["b1", "b2"])
The clip() method takes range specified as the minimum and maximum number. It then replaces all values which are less than minimum with minimum value and all values greater than maximum with maximum value. All values in the range are kept unchanged.
Below we have restricted values of our Dataset in the range (0.3, 0.6).
dataset2_clipped = dataset2.clip(min=0.3, max=0.6)
dataset2_clipped
dataset2_clipped.Array2
We can easily create a copy of Dataset object by calling copy() method on it.
dataset2_copy = dataset2.copy()
dataset2_copy
We can change data type of DataArray present in Dataset object using astype() method. It accepts python or numpy data types as input to specify the data type.
dataset2_copy = dataset2.copy().astype(np.float32)
dataset2_copy
The fillna() method accepts single value as input and replaces all NaNs in Dataset object with that value. It'll replace NaN values present inside DataArray objects of our Dataset object.
dataset2_copy.Array1[0,0] = np.nan
dataset2_copy.Array1
dataset2_copy = dataset2_copy.fillna(value=9999.)
dataset2_copy.Array1
We can easily drop DataArray objects from Dataset object using drop_vars() method. We need to provide a list of DataArray object names as input to the method and it'll return a new Dataset object with those DataArray objects removed.
dataset6.drop_vars(names=["Temperature", "Pressure", "PM25"])
The drop_isel() method can be used to remove a subset of our Dataset object using integer indexing. It works exactly like isel() indexing method but it removes values that satisfy indexing details provided to it.
The method takes indexing details either as parameters of the method or as dictionary just-like isel() method.
Below we have dropped 0th value of dimension 'x', 0th & 1st value of dimension 'y' and 0th & 1st value of dimension 'time' of our dataset. This will then remove a subset of DataArray objects which satisfies these indexing details as well.
dataset6.drop_isel(x=[0,], y=[0,1], time=[0,1])
In the below cell, we have created an example which is a copy of our previous cell example with the only change that indexing details are provided as a dictionary.
dataset6.drop_isel({'x':[0,], 'y':[0,1], 'time':[0,1]})
The drop_sel() method works exactly like drop_isel() method but it accepts indexing details where actual indexing values are provided instead of integer indexing. The actual indexing values can be of any data type like string, integer, float, datetime, etc.
The drop_sel() method is based on sel() indexing method but it removes entries which satisfies indexing details provided to it.
dataset3.drop_sel(x=["x1",], y=["y4","y5"])
In the below cell, we have created another example demonstrating usage of drop_sel() method which explains how we can give indexing details as a dictionary. The indexing details are the same as our previous cell.
dataset3.drop_sel({'x':["x1",], 'y':["y4","y5"]})
The get() method can be used to access individual DataArray or coordinate by providing names for them.
Below we have retrieved DataArray object using get() method.
dataset2.get("Array1")
In the below cell, we have retrieved coordinate values for coordinate 'lon' using get() method.
dataset6.get("lon")
The isnull() method checks values of each DataArray objects and returns True if value is NaN else False.
dataset2_copy = dataset2.copy().astype(np.float32)
dataset2_copy.Array1[0,0] = np.nan
dataset2_copy.isnull()
The isin() method takes an input array of elements. It then checks values of all DataArray objects and returns True/False based on the presence/absence of values in the given input array.
dataset2.isin([37,14,33, 84, 0.3861864])
The map() function takes as input another function which takes as input a single value and returns the single value after performing some operation on the input value. The map() function applies input function on each value of all DataArray objects. It works exactly like apply() function of pandas.
Below we have multiplied all values by 10 using map() function.
dataset2.map(lambda x : x*10)
In the below cell, we have again called map() method to multiply all values by 10 but this time we have also asked it explicitly to keep all Dataset attributes.
dataset6.map(lambda x : x*10, keep_attrs=True)
The apply() function is simply copy of map() function and works exactly same.
dataset2.apply(lambda x : x * 10)
dataset6.apply(lambda x : x*10, keep_attrs=True)
We can retrieve transpose of our Dataset object using transpose() method. This will transpose all DataArray objects.
dataset2_transpose = dataset2.transpose()
dataset2_transpose
dataset2_transpose.Array1
The query() works exactly like query() method of pandas dataframe. It let us specify python expressions as input to the method to perform filtering on Dataset object. We need to provide expressions for each dimension separately.
If you want to know about how query() method works with pandas dataframe with examples then please feel free to check our tutorial on the same.
Below we have asked to keep only dimension values that are greater than 0.5 for dimension 'x' and values that are greater than 1.5 for dimension 'y' of our dataset. The returned Dataset object will have dimension values that satisfy input conditions.
dataset2.query(x="x > 0.5", y="y > 1.5")
In the below cell, we have created another example explaining the usage of query() method on xarray Dataset object.
dataset3.query(x="x in ['x1','x2']", y="y in ['y4','y5']")