Updated On : Dec-02,2021 Time Investment : ~30 mins

xarray: Multi-Dimensional Labelled Arrays (Dataset)¶

Xarray is an open-source Python library that works almost like numpy but let us name dimensions of the array. Unlike numpy, where the majority of methods require us to specify axis argument to perform operations on a particular axis, xarray lets us specify string dimension names which can be more intuitive. This generally helps a lot when you are looking at your old codebase as operations based on axis won't give a better idea but operations based on some string dimension name will easily remind you of your coding decisions. Apart from dimension names, xarray also let us specify coordinates and attributes for arrays. The coordinates are just like pandas index but present for all dimensions of our array. The attributes are overall details about an array and are not associated with any dimension or coordinates.

Xarray provides two main data structures as a part of its API.

DataArray - This is numpy-like N-Dimensional Labelled array.
Dataset - This is a set of DataArray objects which can be worked together. All the DataArray in Dataset will have same dimensions and coordinates. When we perform any operation based on dimensions or coordinates then that operation will be performed on all DataArrays present in Dataset. This will become clear when we explain with examples.

As a part of this tutorial, we'll be primarily concentrating on Dataset data structure of xarray library. We have already covered DataArray data structure in detail in a separate tutorial. Please feel free to check it from the below link if you want to know about it in detail.

xarray: Simple Guide to Labeled N-Dimensional Array (DataArray)

Below we have highlighted important sections of our tutorial to give an overview of the topics that we'll be covering.

Important Sections of Tutorial¶

Dataset Creation
Dataset Attributes
Dataset Indexing/Slicing
- Indexing By Accessing Individual Arrays
- Indexing using isel() Function
- Indexing using sel() Function
Normal Operations on Dataset
Simple Statistics

Below we have imported the necessary libraries that we'll be using as a part of our tutorial.

import xarray as xr

print("Xarray Version : {}".format(xr.__version__))

Xarray Version : 0.20.1

import numpy as np
import pandas as pd

1. Dataset Creation ¶

In this section, we'll explain various ways of creating xarray Dataset using different methods available from the API. We'll be using numpy for data creation purposes. We'll be using datasets created in this section in all of our upcoming sections to explain indexing and other methods of xarray API.

Below we have first set seed for numpy so that all random numbers generated after this code will be the same on different computers for reproducibility purposes.

np.random.seed(123)

The simplest way to create xarray Dataset is by using Dataset constructor available from the library.

Dataset()¶

Dataset(data_vars={}, coords={}, attrs={}) - This constructor accepts data give as dictionary to data_vars parameter, coordinates of dataset given as dictionary to coords parameter and attributes given as dictionary to attrs parameter. It then creates an instance of Dataset which is a multi-dimensional labeled array consisting of many DataArray instances that can be worked in parallel. Below, we have described how to provide values for parameters to this constructor.
- The data_vars parameter accepts dictionary where keys of this dictionary are the name of the individual DataArray and values are tuples of the form (coordinates,data[,attributes]).
  - The coordinates can be a single string if there is only one dimension of input DataArray else it can be a tuple of strings specifying the dimension of N-dimensional DataArray.
  - The data can be numpy array, python lists, etc.
  - The attribute is an optional dictionary that can hold information about an individual DataArray.
- The coords parameter accepts dictionary, where keys of the dictionary are dimension names and values are indexing values of those dimensions that we'll use to index Dataset. This dictionary can also create coordinates by combining a few dimensions of the dataset. We'll explain this through our examples below to make it more clear.
- The attrs parameter accepts a dictionary which specifies attributes of Dataset holding some information about it.

Below we have created our first xarray Dataset. We have first created two numpy arrays of random numbers and the same shape. We have then given these arrays to Dataset constructor through the dictionary to parameter data_vars. We have given names of the arrays as dictionary keys and dictionary values are a combination of coordinates and data. As we have one-dimensional arrays, we can provide coordinates names as a single string. We have then specified coordinates by giving a dictionary to coords parameter. We have given a simple range that goes from 0-4 as the value of a single dimension of data. These values 0-4 will be coordinates to index Dataset in x dimension.

arr1 = np.random.randn(5)
arr2 = np.random.randn(5)

dataset1 = xr.Dataset(data_vars={"Array1": ('x', arr1),
                                 "Array2": ('x', arr2)},
                      coords={"x": np.arange(5)})

dataset1

Below we have created another xarray Dataset using Dataset() constructor. This time we have provided two-dimensional arrays as data of our dataset. We have created both arrays using numpy. One of the arrays is an array of integers and another is an array of random floats in the range 0-1. This example shows that we can combine different kinds of data using Dataset.

As our arrays are two-dimensional, we'll have two dimensions in our data. We have declared dimensions as tuple (('x','y')) in dictionary provided to data_vars parameter of the constructor. The coords parameter is provided with a dictionary where we have simply used a range of integers to represent coordinates.

arr1 = np.random.randint(1,100,size=(3, 5))
arr2 = np.random.randn(3, 5)

dataset2 = xr.Dataset(data_vars={"Array1": (("x","y"), arr1),
                                 "Array2": (("x","y"), arr2)},
                      coords={"x": np.arange(3),
                              "y": np.arange(5)})

dataset2

Below we have created another xaray Dataset which has almost the same code as our previous dataset with only a change in the coordinate values for both dimensions. We have provided a list of strings as coordinate values.

arr1 = np.random.randint(1,100,size=(3, 5))
arr2 = np.random.randn(3, 5)

dataset3 = xr.Dataset(data_vars={"Array1": (("x","y"), arr1),
                                 "Array2": (("x","y"), arr2)},
                      coords={"x": ["x1","x2","x3"],
                              "y": ["y1","y2","y3","y4","y5"]})

dataset3

In the below cell, we have created another Dataset in which we have provided 3-dimensional arrays. We have specified three-dimension names as tuple (('x','y','z')). The coordinates values are simply a range of integers.

arr1 = np.random.randint(1,100,size=(3, 5,7))
arr2 = np.random.randn(3, 5, 7)

dataset4 = xr.Dataset(data_vars={"Array1": (("x","y","z"), arr1),
                                 "Array2": (("x","y","z"), arr2)},
                      coords={"x": np.arange(3),
                              "y": np.arange(5),
                              "z": np.arange(7)})

dataset4

In the next cell, we have created a Dataset where we have created coordinate by combining two dimensions 'x' and 'y'.

Please make a NOTE of how we have provided value to 'index1' coordinate. The value of the dictionary is a tuple with two elements. The first element is again a tuple of two strings that specifies which dimensions it combines. The second value is an array that has the same shape as the combined shape of dimensions 'x' and 'y'. We have created an array of integers in the range 0-14 and reshaped them as (3,5) array to be used as a coordinate value. When we'll perform indexing on this dataset, value from 'index1' coordinates will be selected based on 'x' and 'y' dimension values used for indexing (E.g - x=0,y=0, index1=0, x=0:2, y=0:2, index1=0,1,3,4). This example explains how we can store some extra information inside of coordinates which can be useful to link more related data. This will become more clear when we explain our next example which is taken from real-life datasets.

In order to perform indexing on this Dataset, we'll still need to provide all three 'x,y, and z' dimensions. But we are storing extra details as 'index1' coordinate which can be a requirement in some situations. When we'll explain indexing/slicing datasets, it'll become more clear how coordinates with values different than normal integer indexing can be used to store more information.

arr1 = np.random.randint(1,100,size=(3, 5,7))
arr2 = np.random.randn(3, 5, 7)

dataset5 = xr.Dataset(data_vars={"Array1": (("x","y","z"), arr1),
                                 "Array2": (("x","y","z"), arr2)},
                      coords={"index1": (("x","y"), np.arange(15).reshape(3,5)),
                              "z": np.arange(7)
                            })

dataset5

Our next example explains the kind of dataset that we can face in real-life situations. It shows how we can combine a different kind of data with Dataset object.

Our dataset consists of 6 different arrays of shape (3,5,7). They all represent measurements of different attributes which are used in weather forecasting.

The dataset has 3 dimensions which are named 'x,y and time'. All dimension names are specified in the dictionary given to data_vars parameter.

The dictionary is given to coords parameter creates two new coordinates named lon and lat which combines dimensions 'x and y'. The value of 'lon' coordinate is a tuple of two values where the first value is a tuple of two strings representing dimensions and the second value is an array of shape (3,5) representing coordinate values. The value of 'lat' coordinate follows the same structure. The 'time' dimension is used as it is to represent coordinates in that dimension. We have specified a list of seven dates as the value of time coordinate using pandas.date_range() function.

When we'll index our dataset by specifying values for 'x,y, and z' dimensions, we'll get unique measurements of temperature, humidity, pressure, wind speed, precipitation, and PM25 measured at a particular time and particular location (longitude, latitude). The location is represented using longitude and latitude which are specified as coordinates and not as part of the data dictionary provided to data_vars parameter.

Apart from data and coordinates, we have also specified attributes of the dataset first time. We have given a dictionary of strings to attrs parameter where we have specified more information explaining what the dataset holds and how to interpret coordinates and dimensions.

This example is inspired by the example present on xarray document hence below image taken from there can be helpful to understand how to look at Dataset to better understand it.

temperature = np.random.randint(1,100, size=(3,5,7))
humidity = np.random.randn(3, 5, 7)
pressure = np.random.randn(3, 5, 7)
windspeed = np.random.randn(3, 5, 7)
precitipation = np.random.randn(3, 5, 7)
pm25 = np.random.randn(3, 5, 7)

dataset6 = xr.Dataset(data_vars={"Temperature": (("x","y","time"), temperature),
                                 "Humidity": (("x","y","time"), humidity),
                                 "Pressure": (("x","y","time"), pressure),
                                 "WindSpeed": (("x","y","time"), windspeed),
                                 "Precipitation": (("x","y","time"), precitipation),
                                 "PM25": (("x","y","time"), pm25),
                                },
                      coords={"lon": (("x","y"), np.linspace(1,15,15).reshape(3,5)),
                              "lat": (("x","y"), np.linspace(15,30,15).reshape(3,5)),
                              "time": pd.date_range(start="2021-01-01", periods=7)
                                },
                      attrs={"Summary": "Dataset holds information like temperature, humidity. pressure, windspeed, precipitation and pm 2.5 particle presence based on location (lon, lat) and time.",
                             "lon": "Longitude",
                             "lat": "Latitude",
                             "time": "Date of Record"
                            }
                     )

dataset6

ones_like()¶

The ones_like() method works like its counterpart in numpy. It takes as input Dataset object and returns another Dataset object which has same dimensions as input Dataset but all values in the Dataset are replaced with 1s.

xr.ones_like(dataset6)

zeros_like()¶

The zeros_like() method works exactly like ones_like() but all the values of Dataset are 0s.

xr.zeros_like(dataset6)

full_like()¶

The full_like() method takes as input Dataset object and another value. It then returns another Dataset object which has the same dimensions as input Dataset but all values are replaced with a value given as second input to the method.

xr.full_like(dataset6, 101)

2. Dataset Attributes ¶

In this section, we'll explain a few useful attributes of Dataset objects and the information stored in them.

The attrs attribute returns dictionary of Dataset attributes.

dataset6.attrs

{'Summary': 'Dataset holds information like temperature, humidity. pressure, windspeed, precipitation and pm 2.5 particle presence based on location (lon, lat) and time.',
 'lon': 'Longitude',
 'lat': 'Latitude',
 'time': 'Date of Record'}

dataset6.attrs["Summary"]

'Dataset holds information like temperature, humidity. pressure, windspeed, precipitation and pm 2.5 particle presence based on location (lon, lat) and time.'

The coords attribute returns coordinates of the dataset. We can extract individual coordinates values by treating the output of coords attribute as a dictionary. Each individual coordinate is represented using xarray DataArray object.

dataset6.coords

Coordinates:
    lon      (x, y) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 11.0 12.0 13.0 14.0 15.0
    lat      (x, y) float64 15.0 16.07 17.14 18.21 ... 26.79 27.86 28.93 30.0
  * time     (time) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-01-07

dataset6.coords["lon"]

<xarray.DataArray 'lon' (x: 3, y: 5)>
array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 11.0 12.0 13.0 14.0 15.0
    lat      (x, y) float64 15.0 16.07 17.14 18.21 ... 26.79 27.86 28.93 30.0
Dimensions without coordinates: x, y

xarray.DataArray

'lon'

x: 3
y: 5

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

Coordinates: (2)

lon

(x, y)

float64

1.0 2.0 3.0 4.0 ... 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

lat

(x, y)

float64

15.0 16.07 17.14 ... 28.93 30.0

array([[15.        , 16.07142857, 17.14285714, 18.21428571, 19.28571429],
       [20.35714286, 21.42857143, 22.5       , 23.57142857, 24.64285714],
       [25.71428571, 26.78571429, 27.85714286, 28.92857143, 30.        ]])

Attributes: (0)

The data_vars attribute returns data that we provided when creating a Dataset.

dataset6.data_vars

Data variables:
    Temperature    (x, y, time) int64 82 3 36 70 38 65 83 ... 15 12 71 24 70 41
    Humidity       (x, y, time) float64 1.093 -0.8485 0.1826 ... -1.53 1.676
    Pressure       (x, y, time) float64 1.176 -1.544 -0.6974 ... -0.9302 -0.5022
    WindSpeed      (x, y, time) float64 0.07895 -0.1061 ... 0.2155 -1.026
    Precipitation  (x, y, time) float64 -0.5524 0.5605 0.3806 ... -0.7978 -1.678
    PM25           (x, y, time) float64 1.149 -1.045 ... 0.01034 -0.07389

We can access individual data from a list of data by treating the output of data_vars as a dictionary. The result will be xarray DataArray object.

pm25 = dataset6.data_vars["PM25"]

type(pm25)

xarray.core.dataarray.DataArray

We can access dimensions of Dataset using dims attribute.

dataset6.dims

Frozen({'x': 3, 'y': 5, 'time': 7})

The indexes attribute returns indices of different dimensions of Dataset.

dataset3.indexes

x: Index(['x1', 'x2', 'x3'], dtype='object', name='x')
y: Index(['y1', 'y2', 'y3', 'y4', 'y5'], dtype='object', name='y')

 dataset1.indexes

x: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='x')

dataset6.indexes

time: DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
                     '2021-01-05', '2021-01-06', '2021-01-07'],
                    dtype='datetime64[ns]', name='time', freq='D')

The nbytes attribute returns a total number of memory bytes used by Dataset object.

dataset6.nbytes

3. Dataset Indexing/Slicing ¶

In this section, we'll explain how we can index Dataset objects. We'll first explain how we can access and index individual DataArray from Dataset and then explain indexing of Dataset as a whole using sel() and isel() methods.

Please make a NOTE that all methods in this section return a new Dataset object based on indexing operation. It does not modify any Dataset object in place.

Indexing By Accessing Individual Arrays¶

In this section, we'll explain how we can access individual DataArray and perform indexing on it. If you are interested in learning about indexing on DataArray in detail then please feel free to check our tutorial on it.

xarray: Simple Guide to Labeled N-Dimensional Array (DataArray)

Below we have retrieved DataArray which is stored in our Dataset object by Array1 name. We can retrieve it by treating our Dataset object as dictionary-like.

dataset3["Array1"]

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[28, 31, 53, 71, 27],
       [81,  7, 15, 76, 55],
       [72,  2, 44, 59, 56]])
Coordinates:
  * x        (x) <U2 'x1' 'x2' 'x3'
  * y        (y) <U2 'y1' 'y2' 'y3' 'y4' 'y5'

xarray.DataArray

'Array1'

x: 3
y: 5

28 31 53 71 27 81 7 15 76 55 72 2 44 59 56

array([[28, 31, 53, 71, 27],
       [81,  7, 15, 76, 55],
       [72,  2, 44, 59, 56]])

Coordinates: (2)

x
(x)
<U2
'x1' 'x2' 'x3'
```
array(['x1', 'x2', 'x3'], dtype='<U2')
```

(y)

<U2

'y1' 'y2' 'y3' 'y4' 'y5'

array(['y1', 'y2', 'y3', 'y4', 'y5'], dtype='<U2')

Attributes: (0)

We can also retrieve individual DataArray by calling its name as an attribute of Dataset object. The below statement will return the same result as our previous cell.

dataset3.Array1

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[28, 31, 53, 71, 27],
       [81,  7, 15, 76, 55],
       [72,  2, 44, 59, 56]])
Coordinates:
  * x        (x) <U2 'x1' 'x2' 'x3'
  * y        (y) <U2 'y1' 'y2' 'y3' 'y4' 'y5'

xarray.DataArray

'Array1'

x: 3
y: 5

28 31 53 71 27 81 7 15 76 55 72 2 44 59 56

array([[28, 31, 53, 71, 27],
       [81,  7, 15, 76, 55],
       [72,  2, 44, 59, 56]])

Coordinates: (2)

x
(x)
<U2
'x1' 'x2' 'x3'
```
array(['x1', 'x2', 'x3'], dtype='<U2')
```

(y)

<U2

'y1' 'y2' 'y3' 'y4' 'y5'

array(['y1', 'y2', 'y3', 'y4', 'y5'], dtype='<U2')

Attributes: (0)

We can treat DatArray object just like numpy array and perform integer indexing on it. Below we have retrieved 2x2 from our Array1 DataArray.

dataset3.Array1[:2, :2]

<xarray.DataArray 'Array1' (x: 2, y: 2)>
array([[28, 31],
       [81,  7]])
Coordinates:
  * x        (x) <U2 'x1' 'x2'
  * y        (y) <U2 'y1' 'y2'

We can also use .loc property on our DataArray object just like pandas series/dataframe to retrieve a subset of an array by specifying actual index values which can be another data type than integer indexing.

dataset3.Array1.loc[["x1", "x2", "x3"]]

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[28, 31, 53, 71, 27],
       [81,  7, 15, 76, 55],
       [72,  2, 44, 59, 56]])
Coordinates:
  * x        (x) <U2 'x1' 'x2' 'x3'
  * y        (y) <U2 'y1' 'y2' 'y3' 'y4' 'y5'

xarray.DataArray

'Array1'

x: 3
y: 5

28 31 53 71 27 81 7 15 76 55 72 2 44 59 56

array([[28, 31, 53, 71, 27],
       [81,  7, 15, 76, 55],
       [72,  2, 44, 59, 56]])

Coordinates: (2)

x
(x)
<U2
'x1' 'x2' 'x3'
```
array(['x1', 'x2', 'x3'], dtype='<U2')
```

(y)

<U2

'y1' 'y2' 'y3' 'y4' 'y5'

array(['y1', 'y2', 'y3', 'y4', 'y5'], dtype='<U2')

Attributes: (0)

dataset3.Array1.loc[["x1", "x2", "x3"], ["y1", "y2"]]

<xarray.DataArray 'Array1' (x: 3, y: 2)>
array([[28, 31],
       [81,  7],
       [72,  2]])
Coordinates:
  * x        (x) <U2 'x1' 'x2' 'x3'
  * y        (y) <U2 'y1' 'y2'

xarray.DataArray

'Array1'

x: 3
y: 2

Indexing using isel() Function¶

In this section, we'll explain how we can use isel() method to index Dataset objects.

The isel() method let us use integer indexing to index our Dataset and provides two different ways to specify indexing details.

We can provide indexing details as if dimension names are parameters of isel() method. The parameter name can be the dimension name of Dataset and the parameter value can be a single integer or list of integers specifying index values for a particular dimension.
We can provide a dictionary where keys are dimension names and values are a list of integer indexes for a particular dimension.

Below we called isel() method on one of our Dataset object. We have treated dimension name 'x' of the Dataset object as parameter of isel() method and provided single index value to it. We have basically retrieved the 0th element from Dataset. This indexing will be applied on all DataArray and coordinates of Dataset. We can notice from the result that coordinates 'x' holds single value 0 and DataArray object 'Array1' and 'Array2' also holds single values which is 0th entry in both.

x = dataset1.isel(x=0)

x

In the below cell, we have explained how we can provide indexing details to isel() method as a dictionary. The below method call will have the same impact as our previous cell.

x = dataset1.isel({"x":0})

x

In the below cell, we have again called isel() on one of our Dataset objects. This time we have provided a list of integers as indexing values for dimension 'x' of our Dataset object. This will retrieve the first two elements from the Dataset. We can notice from the results how coordinate x is populated with the first two values and both DataArray objects 'Array1' and 'Array2' are populated with the first two values as per indexing details.

x = dataset1.isel(x=[0,1])

x

In the below cell, we have explained again how we can provide indexing details as a dictionary. The below method call will return the same results as our previous cell method call.

x = dataset1.isel({'x':[0,1]})

x

In the below cell, we have called isel() method on one of our Dataset objects which has two dimensions ('x and y'). We have asked it to select 0th and 1st values from dimension 'x' and 1st and 2nd values from dimension 'y'. It'll return a subset of our original Dataset based on these indexing details. We can notice from the results how 'x' and 'y' coordinate values are retrieved based on indexing. The DataArray objects 'Array1' and 'Array2' both holds 2x2 array.

x = dataset3.isel(x=[0,1], y=[1,2])

x

In the below cell, we have explained indexing on our Dataset with 3 dimensions using isel() method. We have retrieved a subset of Dataset which consists of Dataset formed by first and second values from all three dimensions.

x = dataset4.isel(x=[0,1], y=[0,1], z=[0,1])

x

In the below cell, we have also displayed the contents of two DataArray objects present inside of our Dataset we got using isel() method.

x.Array1

<xarray.DataArray 'Array1' (x: 2, y: 2, z: 2)>
array([[[ 7, 10],
        [55, 28]],

       [[66, 77],
        [71, 14]]])
Coordinates:
  * x        (x) int64 0 1
  * y        (y) int64 0 1
  * z        (z) int64 0 1

x.Array2

<xarray.DataArray 'Array2' (x: 2, y: 2, z: 2)>
array([[[ 1.01273905,  0.27874086],
        [-0.55210807,  0.12074736]],

       [[ 0.14330773,  0.25381648],
        [ 0.55385617, -0.53067456]]])
Coordinates:
  * x        (x) int64 0 1
  * y        (y) int64 0 1
  * z        (z) int64 0 1

xarray.DataArray

'Array2'

x: 2
y: 2
z: 2

1.013 0.2787 -0.5521 0.1207 0.1433 0.2538 0.5539 -0.5307

array([[[ 1.01273905,  0.27874086],
        [-0.55210807,  0.12074736]],

       [[ 0.14330773,  0.25381648],
        [ 0.55385617, -0.53067456]]])

Coordinates: (3)
- x
  (x)
  int64
  0 1
```
array([0, 1])
```
- y
  (y)
  int64
  0 1
```
array([0, 1])
```
- z
  (z)
  int64
  0 1
```
array([0, 1])
```
Attributes: (0)

In the below cell, we have again called isel() method on our Dataset which had details about temperature, humidity, pressure, etc. The dimension names in that Dataset were x, y, and time.

We can notice from the results that how a subset of coordinates and DataArray objects are retrieved based on indexing details given to the method.

x = dataset6.isel(x=[0, 1], y=[0,1], time=[0,1])

x

x.Temperature

<xarray.DataArray 'Temperature' (x: 2, y: 2, time: 2)>
array([[[82,  3],
        [38, 28]],

       [[51, 28],
        [93, 28]]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 6.0 7.0
    lat      (x, y) float64 15.0 16.07 20.36 21.43
  * time     (time) datetime64[ns] 2021-01-01 2021-01-02
Dimensions without coordinates: x, y

xarray.DataArray

'Temperature'

x: 2
y: 2
time: 2

82 3 38 28 51 28 93 28

array([[[82,  3],
        [38, 28]],

       [[51, 28],
        [93, 28]]])

Coordinates: (3)

lon
(x, y)
float64
1.0 2.0 6.0 7.0
```
array([[1., 2.],
       [6., 7.]])
```

lat

(x, y)

float64

15.0 16.07 20.36 21.43

array([[15.        , 16.07142857],
       [20.35714286, 21.42857143]])

time

(time)

datetime64[ns]

2021-01-01 2021-01-02

array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')

Attributes: (0)

x.PM25

<xarray.DataArray 'PM25' (x: 2, y: 2, time: 2)>
array([[[ 1.14942133, -1.04513297],
        [ 0.82658676, -0.01658263]],

       [[ 1.70862604,  0.83200933],
        [-0.8371662 ,  2.12666983]]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 6.0 7.0
    lat      (x, y) float64 15.0 16.07 20.36 21.43
  * time     (time) datetime64[ns] 2021-01-01 2021-01-02
Dimensions without coordinates: x, y

xarray.DataArray

'PM25'

x: 2
y: 2
time: 2

1.149 -1.045 0.8266 -0.01658 1.709 0.832 -0.8372 2.127

array([[[ 1.14942133, -1.04513297],
        [ 0.82658676, -0.01658263]],

       [[ 1.70862604,  0.83200933],
        [-0.8371662 ,  2.12666983]]])

Coordinates: (3)

lon
(x, y)
float64
1.0 2.0 6.0 7.0
```
array([[1., 2.],
       [6., 7.]])
```

lat

(x, y)

float64

15.0 16.07 20.36 21.43

array([[15.        , 16.07142857],
       [20.35714286, 21.42857143]])

time

(time)

datetime64[ns]

2021-01-01 2021-01-02

array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')

Attributes: (0)

Indexing using sel() Function¶

In this section, we have explained how we can use sel() method to perform indexing on our Dataset object.

The sel() method works exactly like isel() method but it accepts actual values of dimension to index Dataset object. The isel() method only accepts integer indexing values to index Dataset objects but sel() method accepts actual values of dimensions which can be of any data type (integer, string, datetime, etc).

Just like isel() method, it also lets us specify indexing details in two ways.

We can provide indexing details as if dimension names are parameters of sel() method. The parameter name can be the dimension name of Dataset and the parameter value can be a single dimension value or list of values specifying index values for a particular dimension.
We can provide a dictionary where keys are dimension names and values are a list of index values for a particular dimension.

Below we have used sel() method to retrieve a subset of one of our Dataset objects. The Dataset object used in this example had integers as values of dimensions hence integer indexing is used. When dimension values of Dataset is of type integers then isel() and sel() methods will work same. It's different when the data type of values of dimension is different.

x = dataset2.sel(x=[0,1], y=[0,1,2])

x

In the below example, we have explained how we can provide indexing details as a dictionary to sel() method. The output of the below cell will be the same as our previous cell because the indexing details are the same.

x = dataset2.sel({'x':[0,1], 'y':[0,1,2]})

x

In the below cell, We have provided actual string values of dimensions to sel() method to subset our Dataset object.

x  = dataset3.sel(x=["x1", "x2"], y=["y1", "y2", "y3"])

x

In the below cell, we have tried to use sel() method to subset a Dataset object whose one dimension values are of type datetime. We have asked it to retrieve a subset of Dataset with a single date in time dimension.

Please make a NOTE how we provided datetime details as a string. We can provide datetime details as a string or original datetime type as well.

Then in the next few cells after the below cell, we have displayed coordinate and DataArray object detail of subset Dataset object that we got using sel() method.

x = dataset6.sel(x=0, y=0, time="2021-1-1")

x

x.lon

<xarray.DataArray 'lon' ()>
array(1.)
Coordinates:
    lon      float64 1.0
    lat      float64 15.0
    time     datetime64[ns] 2021-01-01

x.Precipitation

<xarray.DataArray 'Precipitation' ()>
array(-0.55236716)
Coordinates:
    lon      float64 1.0
    lat      float64 15.0
    time     datetime64[ns] 2021-01-01

In the below cell, we have created another example demonstrating usage of sel() method on Dataset whose one dimension values are of datetime type. This time we have provided a list of two strings specifying two different dates as values of time dimension inside sel() method function call.

Then in the next cell after the below cells, we have also displayed coordinate and DataArray object details of subset Dataset that we got through sel() method call.

x = dataset6.sel(x=[0,1], y=[0,1], time=["2021-1-1","2021-1-2"])

x

x.lon

<xarray.DataArray 'lon' (x: 2, y: 2)>
array([[1., 2.],
       [6., 7.]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 6.0 7.0
    lat      (x, y) float64 15.0 16.07 20.36 21.43
Dimensions without coordinates: x, y

xarray.DataArray

'lon'

x: 2
y: 2

1.0 2.0 6.0 7.0
```
array([[1., 2.],
       [6., 7.]])
```

Coordinates: (2)

lon
(x, y)
float64
1.0 2.0 6.0 7.0
```
array([[1., 2.],
       [6., 7.]])
```

lat

(x, y)

float64

15.0 16.07 20.36 21.43

array([[15.        , 16.07142857],
       [20.35714286, 21.42857143]])

Attributes: (0)

x.Humidity

<xarray.DataArray 'Humidity' (x: 2, y: 2, time: 2)>
array([[[ 1.09285914, -0.84854387],
        [ 2.09665566,  0.15985731]],

       [[-0.37605972, -0.54091566],
        [ 0.92056561, -0.32548426]]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 6.0 7.0
    lat      (x, y) float64 15.0 16.07 20.36 21.43
  * time     (time) datetime64[ns] 2021-01-01 2021-01-02
Dimensions without coordinates: x, y

xarray.DataArray

'Humidity'

x: 2
y: 2
time: 2

1.093 -0.8485 2.097 0.1599 -0.3761 -0.5409 0.9206 -0.3255

array([[[ 1.09285914, -0.84854387],
        [ 2.09665566,  0.15985731]],

       [[-0.37605972, -0.54091566],
        [ 0.92056561, -0.32548426]]])

Coordinates: (3)

lon
(x, y)
float64
1.0 2.0 6.0 7.0
```
array([[1., 2.],
       [6., 7.]])
```

lat

(x, y)

float64

15.0 16.07 20.36 21.43

array([[15.        , 16.07142857],
       [20.35714286, 21.42857143]])

time

(time)

datetime64[ns]

2021-01-01 2021-01-02

array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')

Attributes: (0)

In the below cell, we have created another example demonstrating usage of sel() method on Dataset with datetime dimension. This time we have provided datetime values as a list of datetime type values created using pd.date_range() function. We can perform indexing on the dimension with datetime type values in this way as well.

x = dataset6.sel(x=[0,1], y=[0,1], time=pd.date_range(start="2021-1-1", periods=3))

x

4. Normal Operations on Dataset ¶

In this section, we'll explain commonly performing operations on Dataset objects like transpose, copy, change coordinate details, change attribute details, fill NaNs, add new attributes, etc. We'll explain various methods available from xarray to perform these operations.

Please make a NOTE that all methods in this section return a new Dataset object with details modified. It does not modify any Dataset object in place.

assign()¶

The assign() method lets us add new DataArray to our Dataset object. The method takes as input dictionary in the same format which we provide to Dataset() constructor to add new DataArray objects.

Below we have first created an array of random numbers with shape (3,5). We have then added this array to our Dataset object using assign() method. We have provided array name as key and value is a tuple of dimension names and actual data. We can add more than one DataArray to our Dataset object using this method.

arr3  = np.random.randn(3,5)

dataset2.assign({"Array3": (["x","y"], arr3)})

assign_attrs()¶

The assign_attrs() method takes as input dictionary of attributes and adds those attributes to Dataset object. If attributes are already present in Dataset object then provided attributes will add/update attributes. If Dataset does not have attributes then attributes will be added to it.

dataset2.assign_attrs({"x": "Row-Dimension", "y": "Column-Dimension"})

assign_coords()¶

The assign_coords() method let us update coordinates detail of Dataset object. It accepts dictionary specifying coordinate details just like we provide in Dataset() constructor when creating Dataset object.

In the below example, we have replaced the existing integer coordinates of Dataset object with a list of string coordinates.

In the next cell below, we have also tried to retrieve subset Dataset based on these new coordinate values.

dataset2_new_coords = dataset2.assign_coords(coords={"x": ["a1","a2","a3"], "y": ["b1","b2","b3","b4","b5"]})

dataset2_new_coords

dataset2_new_coords.sel(x=["a1","a2"], y=["b1", "b2"])

clip()¶

The clip() method takes range specified as the minimum and maximum number. It then replaces all values which are less than minimum with minimum value and all values greater than maximum with maximum value. All values in the range are kept unchanged.

Below we have restricted values of our Dataset in the range (0.3, 0.6).

dataset2_clipped = dataset2.clip(min=0.3, max=0.6)

dataset2_clipped

dataset2_clipped.Array2

<xarray.DataArray 'Array2' (x: 3, y: 5)>
array([[0.6      , 0.3861864, 0.6      , 0.6      , 0.3      ],
       [0.6      , 0.3      , 0.3      , 0.6      , 0.3      ],
       [0.3      , 0.3      , 0.3      , 0.3      , 0.3      ]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

xarray.DataArray

'Array2'

x: 3
y: 5

0.6 0.3862 0.6 0.6 0.3 0.6 0.3 0.3 0.6 0.3 0.3 0.3 0.3 0.3 0.3

array([[0.6      , 0.3861864, 0.6      , 0.6      , 0.3      ],
       [0.6      , 0.3      , 0.3      , 0.6      , 0.3      ],
       [0.3      , 0.3      , 0.3      , 0.3      , 0.3      ]])

Coordinates: (2)
- x
  (x)
  int64
  0 1 2
```
array([0, 1, 2])
```
- y
  (y)
  int64
  0 1 2 3 4
```
array([0, 1, 2, 3, 4])
```
Attributes: (0)

copy()¶

We can easily create a copy of Dataset object by calling copy() method on it.

dataset2_copy = dataset2.copy()

dataset2_copy

astype()¶

We can change data type of DataArray present in Dataset object using astype() method. It accepts python or numpy data types as input to specify the data type.

dataset2_copy = dataset2.copy().astype(np.float32)

dataset2_copy

fillna()¶

The fillna() method accepts single value as input and replaces all NaNs in Dataset object with that value. It'll replace NaN values present inside DataArray objects of our Dataset object.

dataset2_copy.Array1[0,0] = np.nan

dataset2_copy.Array1

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[nan, 79., 37., 97., 81.],
       [69., 50., 56., 68.,  3.],
       [85., 40., 67., 85., 48.]], dtype=float32)
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

xarray.DataArray

'Array1'

x: 3
y: 5

nan 79.0 37.0 97.0 81.0 69.0 50.0 ... 3.0 85.0 40.0 67.0 85.0 48.0

array([[nan, 79., 37., 97., 81.],
       [69., 50., 56., 68.,  3.],
       [85., 40., 67., 85., 48.]], dtype=float32)

Coordinates: (2)
- x
  (x)
  int64
  0 1 2
```
array([0, 1, 2])
```
- y
  (y)
  int64
  0 1 2 3 4
```
array([0, 1, 2, 3, 4])
```
Attributes: (0)

dataset2_copy = dataset2_copy.fillna(value=9999.)

dataset2_copy.Array1

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[9.999e+03, 7.900e+01, 3.700e+01, 9.700e+01, 8.100e+01],
       [6.900e+01, 5.000e+01, 5.600e+01, 6.800e+01, 3.000e+00],
       [8.500e+01, 4.000e+01, 6.700e+01, 8.500e+01, 4.800e+01]],
      dtype=float32)
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

xarray.DataArray

'Array1'

x: 3
y: 5

9.999e+03 79.0 37.0 97.0 81.0 69.0 ... 3.0 85.0 40.0 67.0 85.0 48.0

array([[9.999e+03, 7.900e+01, 3.700e+01, 9.700e+01, 8.100e+01],
       [6.900e+01, 5.000e+01, 5.600e+01, 6.800e+01, 3.000e+00],
       [8.500e+01, 4.000e+01, 6.700e+01, 8.500e+01, 4.800e+01]],
      dtype=float32)

Coordinates: (2)
- x
  (x)
  int64
  0 1 2
```
array([0, 1, 2])
```
- y
  (y)
  int64
  0 1 2 3 4
```
array([0, 1, 2, 3, 4])
```
Attributes: (0)

drop_vars(names)¶

We can easily drop DataArray objects from Dataset object using drop_vars() method. We need to provide a list of DataArray object names as input to the method and it'll return a new Dataset object with those DataArray objects removed.

dataset6.drop_vars(names=["Temperature", "Pressure", "PM25"])

drop_isel()¶

The drop_isel() method can be used to remove a subset of our Dataset object using integer indexing. It works exactly like isel() indexing method but it removes values that satisfy indexing details provided to it.

The method takes indexing details either as parameters of the method or as dictionary just-like isel() method.

Below we have dropped 0th value of dimension 'x', 0th & 1st value of dimension 'y' and 0th & 1st value of dimension 'time' of our dataset. This will then remove a subset of DataArray objects which satisfies these indexing details as well.

dataset6.drop_isel(x=[0,], y=[0,1], time=[0,1])

In the below cell, we have created an example which is a copy of our previous cell example with the only change that indexing details are provided as a dictionary.

dataset6.drop_isel({'x':[0,], 'y':[0,1], 'time':[0,1]})

drop_sel()¶

The drop_sel() method works exactly like drop_isel() method but it accepts indexing details where actual indexing values are provided instead of integer indexing. The actual indexing values can be of any data type like string, integer, float, datetime, etc.

The drop_sel() method is based on sel() indexing method but it removes entries which satisfies indexing details provided to it.

dataset3.drop_sel(x=["x1",], y=["y4","y5"])

In the below cell, we have created another example demonstrating usage of drop_sel() method which explains how we can give indexing details as a dictionary. The indexing details are the same as our previous cell.

dataset3.drop_sel({'x':["x1",], 'y':["y4","y5"]})

get(key)¶

The get() method can be used to access individual DataArray or coordinate by providing names for them.

Below we have retrieved DataArray object using get() method.

dataset2.get("Array1")

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[84, 79, 37, 97, 81],
       [69, 50, 56, 68,  3],
       [85, 40, 67, 85, 48]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

xarray.DataArray

'Array1'

x: 3
y: 5

84 79 37 97 81 69 50 56 68 3 85 40 67 85 48

array([[84, 79, 37, 97, 81],
       [69, 50, 56, 68,  3],
       [85, 40, 67, 85, 48]])

Coordinates: (2)
- x
  (x)
  int64
  0 1 2
```
array([0, 1, 2])
```
- y
  (y)
  int64
  0 1 2 3 4
```
array([0, 1, 2, 3, 4])
```
Attributes: (0)

In the below cell, we have retrieved coordinate values for coordinate 'lon' using get() method.

dataset6.get("lon")

<xarray.DataArray 'lon' (x: 3, y: 5)>
array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 11.0 12.0 13.0 14.0 15.0
    lat      (x, y) float64 15.0 16.07 17.14 18.21 ... 26.79 27.86 28.93 30.0
Dimensions without coordinates: x, y

xarray.DataArray

'lon'

x: 3
y: 5

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

Coordinates: (2)

lon

(x, y)

float64

1.0 2.0 3.0 4.0 ... 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

lat

(x, y)

float64

15.0 16.07 17.14 ... 28.93 30.0

array([[15.        , 16.07142857, 17.14285714, 18.21428571, 19.28571429],
       [20.35714286, 21.42857143, 22.5       , 23.57142857, 24.64285714],
       [25.71428571, 26.78571429, 27.85714286, 28.92857143, 30.        ]])

Attributes: (0)

isnull()¶

The isnull() method checks values of each DataArray objects and returns True if value is NaN else False.

dataset2_copy = dataset2.copy().astype(np.float32)
dataset2_copy.Array1[0,0] = np.nan

dataset2_copy.isnull()

isin(list_of_elements)¶

The isin() method takes an input array of elements. It then checks values of all DataArray objects and returns True/False based on the presence/absence of values in the given input array.

dataset2.isin([37,14,33, 84, 0.3861864])

map(func)¶

The map() function takes as input another function which takes as input a single value and returns the single value after performing some operation on the input value. The map() function applies input function on each value of all DataArray objects. It works exactly like apply() function of pandas.

Below we have multiplied all values by 10 using map() function.

dataset2.map(lambda x : x*10)

In the below cell, we have again called map() method to multiply all values by 10 but this time we have also asked it explicitly to keep all Dataset attributes.

dataset6.map(lambda x : x*10, keep_attrs=True)

apply(func)¶

The apply() function is simply copy of map() function and works exactly same.

dataset2.apply(lambda x : x * 10)

dataset6.apply(lambda x : x*10, keep_attrs=True)

transpose()¶

We can retrieve transpose of our Dataset object using transpose() method. This will transpose all DataArray objects.

dataset2_transpose = dataset2.transpose()

dataset2_transpose

dataset2_transpose.Array1

<xarray.DataArray 'Array1' (y: 5, x: 3)>
array([[84, 69, 85],
       [79, 50, 40],
       [37, 56, 67],
       [97, 68, 85],
       [81,  3, 48]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

xarray.DataArray

'Array1'

y: 5
x: 3

84 69 85 79 50 40 37 56 67 97 68 85 81 3 48

array([[84, 69, 85],
       [79, 50, 40],
       [37, 56, 67],
       [97, 68, 85],
       [81,  3, 48]])

Coordinates: (2)
- x
  (x)
  int64
  0 1 2
```
array([0, 1, 2])
```
- y
  (y)
  int64
  0 1 2 3 4
```
array([0, 1, 2, 3, 4])
```
Attributes: (0)

query()¶

The query() works exactly like query() method of pandas dataframe. It let us specify python expressions as input to the method to perform filtering on Dataset object. We need to provide expressions for each dimension separately.

If you want to know about how query() method works with pandas dataframe with examples then please feel free to check our tutorial on the same.

Pandas query(): Query Pandas DataFrame using Python Expressions

Below we have asked to keep only dimension values that are greater than 0.5 for dimension 'x' and values that are greater than 1.5 for dimension 'y' of our dataset. The returned Dataset object will have dimension values that satisfy input conditions.

dataset2.query(x="x > 0.5", y="y > 1.5")

In the below cell, we have created another example explaining the usage of query() method on xarray Dataset object.

dataset3.query(x="x in ['x1','x2']", y="y in ['y4','y5']")

info()¶

The info() function returns information about our Dataset object like data type, dimension details, etc.

dataset6.info()

xarray.Dataset {
dimensions:
	x = 3 ;
	y = 5 ;
	time = 7 ;

variables:
	int64 Temperature(x, y, time) ;
	float64 Humidity(x, y, time) ;
	float64 Pressure(x, y, time) ;
	float64 WindSpeed(x, y, time) ;
	float64 Precipitation(x, y, time) ;
	float64 PM25(x, y, time) ;
	float64 lon(x, y) ;
	float64 lat(x, y) ;
	datetime64[ns] time(time) ;

// global attributes:
	:Summary = Dataset holds information like temperature, humidity. pressure, windspeed, precipitation and pm 2.5 particle presence based on location (lon, lat) and time. ;
	:lon = Longitude ;
	:lat = Latitude ;
	:time = Date of Record ;
}

dataset3.info()

xarray.Dataset {
dimensions:
	x = 3 ;
	y = 5 ;

variables:
	int64 Array1(x, y) ;
	float64 Array2(x, y) ;
	<U2 x(x) ;
	<U2 y(y) ;

// global attributes:
}

5. Simple Statistics ¶

In this section, we'll explain how we can perform simple statistics like min, max, mean, standard deviation, variance, rolling window functions, etc on Dataset object.

Please make a NOTE that all methods in this section return a new Dataset object based on the operation. It does not modify any Dataset object in place.

min(dim)¶

The min() method let us find minimum element for each DataArray objects of our Dataset object. It even let us provide dimension name if we want to retrieve minimum at a particular dimension of our Dataset.

dataset2.min()

dataset2.min(dim="x")

argmin(dim)¶

The argmin() method let us find index of minimum element for each DataArray objects of our Dataset object. We can also provide dimension names to retrieve minimum element indices at a particular dimension.

dataset2.argmin(dim="x")

max(dim)¶

The max() method helps us find maximum element for each DataArray object of our Dataset object. We can also provide dimension names if we want maximum at a particular dimension.

dataset3.max()

dataset3.max(dim="y")

dataset6.max(dim="time")

argmax(dim)¶

The argmax() method return index of maximum elements for each DataArray objects of our Dataset object.

dataset2.argmax(dim="x")

sum(dim)¶

The sum() method returns sum of all elements for each DataArray objects of our Dataset object. We can provide dimension name as argument if we want sum at a particular dimension of our data.

dataset2.sum()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    Array1   int64 949
    Array2   float64 -4.382

dataset2.sum(dim="x")

mean(dim)¶

The mean() method returns average of all elements for each DataArray objects of our Dataset object. We can provide dimension names as an argument if we want the mean at a particular dimension of our data.

dataset2.mean()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    Array1   float64 63.27
    Array2   float64 -0.2922

dataset2.mean(dim="y")

median(dim)¶

The median() method returns median for each DataArray objects of our Dataset object. We can provide dimension names as an argument if we want a median at a particular dimension of our data.

dataset2.median()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    Array1   float64 68.0
    Array2   float64 -0.2556

dataset2.median(dim="x")

std(dim)¶

The std() method returns standard deviation of all elements for each DataArray objects of our Dataset object. We can provide dimension names as arguments if we want standard deviation at a particular dimension of our data.

dataset2.std()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    Array1   float64 23.76
    Array2   float64 1.198

dataset2.std(dim="x")

var(dim)¶

The var() method returns variance for each DataArray objects of our Dataset object. We can provide dimension names as arguments if we want variance at a particular dimension of our data.

dataset2.var()

<xarray.Dataset>
Dimensions:  ()
Data variables:
    Array1   float64 564.6
    Array2   float64 1.436

dataset2.var(dim="x")

cumsum(dim)¶

The cumsum() method returns cumulative sum at particular dimension for each DataArray objects of our Dataset object.

dataset2.cumsum()

<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Dimensions without coordinates: x, y
Data variables:
    Array1   (x, y) int64 84 163 200 297 378 153 282 ... 624 238 407 567 817 949
    Array2   (x, y) float64 1.004 1.39 2.128 3.618 ... 0.1544 -0.2464 -4.382

xarray.Dataset

Dimensions:
- x: 3
- y: 5
Coordinates: (0)

Data variables: (2)

Array1

(x, y)

int64

84 163 200 297 ... 407 567 817 949

array([[ 84, 163, 200, 297, 378],
       [153, 282, 375, 540, 624],
       [238, 407, 567, 817, 949]])

Array2

(x, y)

float64

1.004 1.39 2.128 ... -0.2464 -4.382

array([[ 1.0040539 ,  1.3902403 ,  2.12760887,  3.6183409 ,  2.68250703],
       [ 2.17988294,  1.31218867,  1.41180575,  3.80964297,  1.4451284 ],
       [ 2.03981422,  0.31036506,  0.15436276, -0.24638912, -4.38243679]])

Attributes: (0)

dataset2.cumsum(dim="x")

cumprod(dim)¶

The cumprod() method returns cumulative product at particular dimension for each DataArray objects of our Dataset object.

dataset2.cumprod()

<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Dimensions without coordinates: x, y
Data variables:
    Array1   (x, y) int64 84 6636 ... 8015426386742269952 3834409401829130240
    Array2   (x, y) float64 1.004 0.3878 0.2859 ... -0.008295 0.03139 -0.07435

xarray.Dataset

Dimensions:
- x: 3
- y: 5
Coordinates: (0)

Data variables: (2)

Array1

(x, y)

int64

84 6636 ... 3834409401829130240

array([[                 84,                6636,              245532,
                   23816604,          1929144924],
       [               5796,            22894200,         47436782400,
            312893016710400,   76033003060627200],
       [             492660,         77840280000,   10806099030720000,
        8015426386742269952, 3834409401829130240]])

Array2

(x, y)

float64

1.004 0.3878 ... 0.03139 -0.07435

array([[ 1.0040539 ,  0.38775196,  0.28591611,  0.4262243 , -0.39887514],
       [ 1.18059574, -0.57168183,  0.26883791,  0.36353627,  0.48605082],
       [-0.16536453, -0.0690048 , -0.00829486,  0.03139103, -0.07435134]])

Attributes: (0)

dataset2.cumprod(dim="y")

rolling()¶

The rolling() method lets us perform rolling window functions on each DataArray objects of our Dataset object. It accepts dimension names at which to apply rolling window and window size as input. We can provide dimension name and window size as a dictionary or as if they are parameters of the method. We can perform various aggregate functions (like min, max, mean, std, var, etc) on a rolling object returned by rolling() method.

Below we have taken the rolling window function on our 'y' dimension with a window size of 2. We have then calculated the mean aggregate function on samples of each window.

If you want to know how to perform moving window functions in pandas then please feel free to check our tutorial on the same where we cover the topic in detail.

Time Series - Resampling & Moving Window Functions in Python using Pandas

rolling_mean = dataset2.rolling({"y": 2}).mean()

rolling_mean

rolling_mean.Array1

<xarray.DataArray 'Array1' (x: 3, y: 5)>
array([[ nan, 81.5, 58. , 67. , 89. ],
       [ nan, 59.5, 53. , 62. , 35.5],
       [ nan, 62.5, 53.5, 76. , 66.5]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

xarray.DataArray

'Array1'

x: 3
y: 5

nan 81.5 58.0 67.0 89.0 nan 59.5 ... 62.0 35.5 nan 62.5 53.5 76.0 66.5

array([[ nan, 81.5, 58. , 67. , 89. ],
       [ nan, 59.5, 53. , 62. , 35.5],
       [ nan, 62.5, 53.5, 76. , 66.5]])

Coordinates: (2)
- x
  (x)
  int64
  0 1 2
```
array([0, 1, 2])
```
- y
  (y)
  int64
  0 1 2 3 4
```
array([0, 1, 2, 3, 4])
```
Attributes: (0)

Below we have created another example demonstrating usage of rolling() function. This time we have performed a rolling window on 'time' dimension with a window size of 5. We have then applied the mean aggregate function on samples of each window.

rolling_mean = dataset6.rolling({"time": 5}).mean()

rolling_mean

rolling_mean.Temperature

<xarray.DataArray 'Temperature' (x: 3, y: 5, time: 7)>
array([[[ nan,  nan,  nan,  nan, 45.8, 42.4, 58.4],
        [ nan,  nan,  nan,  nan, 46.6, 41. , 53. ],
        [ nan,  nan,  nan,  nan, 53.2, 42. , 32.6],
        [ nan,  nan,  nan,  nan, 33. , 40.8, 31.6],
        [ nan,  nan,  nan,  nan, 43.2, 42.8, 54.4]],

       [[ nan,  nan,  nan,  nan, 53.2, 57. , 67. ],
        [ nan,  nan,  nan,  nan, 49.2, 48.2, 43. ],
        [ nan,  nan,  nan,  nan, 64.8, 55. , 56.8],
        [ nan,  nan,  nan,  nan, 54.8, 42.2, 33.6],
        [ nan,  nan,  nan,  nan, 57.8, 60.2, 50. ]],

       [[ nan,  nan,  nan,  nan, 56.2, 56.4, 57.2],
        [ nan,  nan,  nan,  nan, 46.6, 50. , 53.2],
        [ nan,  nan,  nan,  nan, 50.6, 37. , 51.8],
        [ nan,  nan,  nan,  nan, 36. , 37.2, 47.6],
        [ nan,  nan,  nan,  nan, 29.4, 38.4, 43.6]]])
Coordinates:
    lon      (x, y) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 11.0 12.0 13.0 14.0 15.0
    lat      (x, y) float64 15.0 16.07 17.14 18.21 ... 26.79 27.86 28.93 30.0
  * time     (time) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-01-07
Dimensions without coordinates: x, y

xarray.DataArray

'Temperature'

x: 3
y: 5
time: 7

nan nan nan nan 45.8 42.4 58.4 nan ... nan nan nan nan 29.4 38.4 43.6

array([[[ nan,  nan,  nan,  nan, 45.8, 42.4, 58.4],
        [ nan,  nan,  nan,  nan, 46.6, 41. , 53. ],
        [ nan,  nan,  nan,  nan, 53.2, 42. , 32.6],
        [ nan,  nan,  nan,  nan, 33. , 40.8, 31.6],
        [ nan,  nan,  nan,  nan, 43.2, 42.8, 54.4]],

       [[ nan,  nan,  nan,  nan, 53.2, 57. , 67. ],
        [ nan,  nan,  nan,  nan, 49.2, 48.2, 43. ],
        [ nan,  nan,  nan,  nan, 64.8, 55. , 56.8],
        [ nan,  nan,  nan,  nan, 54.8, 42.2, 33.6],
        [ nan,  nan,  nan,  nan, 57.8, 60.2, 50. ]],

       [[ nan,  nan,  nan,  nan, 56.2, 56.4, 57.2],
        [ nan,  nan,  nan,  nan, 46.6, 50. , 53.2],
        [ nan,  nan,  nan,  nan, 50.6, 37. , 51.8],
        [ nan,  nan,  nan,  nan, 36. , 37.2, 47.6],
        [ nan,  nan,  nan,  nan, 29.4, 38.4, 43.6]]])

Coordinates: (3)

lon

(x, y)

float64

1.0 2.0 3.0 4.0 ... 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

lat

(x, y)

float64

15.0 16.07 17.14 ... 28.93 30.0

array([[15.        , 16.07142857, 17.14285714, 18.21428571, 19.28571429],
       [20.35714286, 21.42857143, 22.5       , 23.57142857, 24.64285714],
       [25.71428571, 26.78571429, 27.85714286, 28.92857143, 30.        ]])

time

(time)

datetime64[ns]

2021-01-01 ... 2021-01-07

array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',
       '2021-01-05T00:00:00.000000000', '2021-01-06T00:00:00.000000000',
       '2021-01-07T00:00:00.000000000'], dtype='datetime64[ns]')

Attributes: (0)

resample()¶

The resample() function is useful when one of the dimensions of our Dataset object is datetime and we want to change the existing frequency of that dimension. The frequency can be changed in two ways using resample() function.

Up Sampling - We increase frequency from lower to higher. E.g - Daily frequency to monthly.
Down Sampling - We decrease frequency. E.g - Daily to 6 hourly.

The resample() function takes as input dimension name and new frequency as input. We can provide dimension name and frequency as a dictionary or as if they are parameters of the resample() method.

If you are interested in learning about resampling using pandas then please feel free to check our tutorial where we discuss resampling in detail.

Time Series - Resampling & Moving Window Functions in Python using Pandas

Up Sampling¶

Below we have upsampled our Dataset from daily frequency to 2-day frequency for 'time' dimension. After upsampling, we have called mean() function on upsampled Dataset to replace values with an average of values.

two_day_sampled = dataset6.resample({"time": "2D"})

two_day_sampled

DatasetResample, grouped over '__resample_dim__'
4 groups with labels 2021-01-01, ..., 2021-01-07.

for dt, dset in two_day_sampled:
    print(dt, dset.dims)

2021-01-01T00:00:00.000000000 Frozen({'x': 3, 'y': 5, 'time': 2})
2021-01-03T00:00:00.000000000 Frozen({'x': 3, 'y': 5, 'time': 2})
2021-01-05T00:00:00.000000000 Frozen({'x': 3, 'y': 5, 'time': 2})
2021-01-07T00:00:00.000000000 Frozen({'x': 3, 'y': 5, 'time': 1})

two_day_sampled_mean = two_day_sampled.mean()

two_day_sampled_mean

two_day_sampled_mean.Temperature

<xarray.DataArray 'Temperature' (time: 4, x: 3, y: 5)>
array([[[42.5, 33. , 81. , 54. , 36.5],
        [39.5, 60.5, 46.5, 73. , 70. ],
        [61.5, 35.5, 43. , 12.5, 20. ]],

       [[53. , 60. ,  9. ,  5.5, 60. ],
        [64. , 60.5, 92. , 56. , 60. ],
        [70. , 69.5, 82.5, 64. , 41.5]],

       [[51.5, 28.5, 60. , 54. , 36. ],
        [64.5, 46. , 31. , 25. , 64. ],
        [58. , 31.5,  3. , 21. , 47. ]],

       [[83. , 88. , 25. , 39. , 80. ],
        [78. ,  2. , 38. ,  6. ,  2. ],
        [30. , 64. , 88. , 68. , 41. ]]])
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 2021-01-03 2021-01-05 2021-01-07
    lon      (x, y) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 11.0 12.0 13.0 14.0 15.0
    lat      (x, y) float64 15.0 16.07 17.14 18.21 ... 26.79 27.86 28.93 30.0
Dimensions without coordinates: x, y

xarray.DataArray

'Temperature'

time: 4
x: 3
y: 5

42.5 33.0 81.0 54.0 36.5 39.5 60.5 ... 2.0 30.0 64.0 88.0 68.0 41.0

array([[[42.5, 33. , 81. , 54. , 36.5],
        [39.5, 60.5, 46.5, 73. , 70. ],
        [61.5, 35.5, 43. , 12.5, 20. ]],

       [[53. , 60. ,  9. ,  5.5, 60. ],
        [64. , 60.5, 92. , 56. , 60. ],
        [70. , 69.5, 82.5, 64. , 41.5]],

       [[51.5, 28.5, 60. , 54. , 36. ],
        [64.5, 46. , 31. , 25. , 64. ],
        [58. , 31.5,  3. , 21. , 47. ]],

       [[83. , 88. , 25. , 39. , 80. ],
        [78. ,  2. , 38. ,  6. ,  2. ],
        [30. , 64. , 88. , 68. , 41. ]]])

Coordinates: (3)

time

(time)

datetime64[ns]

2021-01-01 ... 2021-01-07

array(['2021-01-01T00:00:00.000000000', '2021-01-03T00:00:00.000000000',
       '2021-01-05T00:00:00.000000000', '2021-01-07T00:00:00.000000000'],
      dtype='datetime64[ns]')

lon

(x, y)

float64

1.0 2.0 3.0 4.0 ... 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

lat

(x, y)

float64

15.0 16.07 17.14 ... 28.93 30.0

array([[15.        , 16.07142857, 17.14285714, 18.21428571, 19.28571429],
       [20.35714286, 21.42857143, 22.5       , 23.57142857, 24.64285714],
       [25.71428571, 26.78571429, 27.85714286, 28.92857143, 30.        ]])

Attributes: (0)

Down Sampling¶

In this example, we have down sampled our Dataset from daily frequency to 12-hourly freqeuncy. Down sampling generally introdues NaN/Null entries in dataset because we have new datetime entries in dataset which were not present earlier. We can fill NaN/Null entries using xarray functions like fillna(), ffill(), bfill(), etc.

After downsampling, we have taken an average of resampled entries.

twelve_hour_sampled_mean = dataset6.resample({"time": "12H"}).mean()

twelve_hour_sampled_mean

twelve_hour_sampled_mean["time"]

<xarray.DataArray 'time' (time: 13)>
array(['2021-01-01T00:00:00.000000000', '2021-01-01T12:00:00.000000000',
       '2021-01-02T00:00:00.000000000', '2021-01-02T12:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-03T12:00:00.000000000',
       '2021-01-04T00:00:00.000000000', '2021-01-04T12:00:00.000000000',
       '2021-01-05T00:00:00.000000000', '2021-01-05T12:00:00.000000000',
       '2021-01-06T00:00:00.000000000', '2021-01-06T12:00:00.000000000',
       '2021-01-07T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 2021-01-01T12:00:00 ... 2021-01-07

xarray.DataArray

'time'

time: 13

2021-01-01 2021-01-01T12:00:00 ... 2021-01-06T12:00:00 2021-01-07

array(['2021-01-01T00:00:00.000000000', '2021-01-01T12:00:00.000000000',
       '2021-01-02T00:00:00.000000000', '2021-01-02T12:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-03T12:00:00.000000000',
       '2021-01-04T00:00:00.000000000', '2021-01-04T12:00:00.000000000',
       '2021-01-05T00:00:00.000000000', '2021-01-05T12:00:00.000000000',
       '2021-01-06T00:00:00.000000000', '2021-01-06T12:00:00.000000000',
       '2021-01-07T00:00:00.000000000'], dtype='datetime64[ns]')

Coordinates: (1)

time

(time)

datetime64[ns]

2021-01-01 ... 2021-01-07

array(['2021-01-01T00:00:00.000000000', '2021-01-01T12:00:00.000000000',
       '2021-01-02T00:00:00.000000000', '2021-01-02T12:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-03T12:00:00.000000000',
       '2021-01-04T00:00:00.000000000', '2021-01-04T12:00:00.000000000',
       '2021-01-05T00:00:00.000000000', '2021-01-05T12:00:00.000000000',
       '2021-01-06T00:00:00.000000000', '2021-01-06T12:00:00.000000000',
       '2021-01-07T00:00:00.000000000'], dtype='datetime64[ns]')

Attributes: (0)

twelve_hour_sampled_mean["Temperature"]

<xarray.DataArray 'Temperature' (time: 13, x: 3, y: 5)>
array([[[82., 38., 90., 23., 51.],
        [51., 93., 64., 97., 87.],
        [97., 23., 72.,  9., 25.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[ 3., 28., 72., 85., 22.],
        [28., 28., 29., 49., 53.],
        [26., 48., 14., 16., 15.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[36., 47.,  5.,  3., 45.],
        [99., 74., 89., 72., 21.],
        [44., 55., 87., 96., 12.]],

...

       [[38., 47., 86., 46., 23.],
        [59.,  4., 47., 16., 29.],
        [18., 23.,  2., 27., 24.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[65., 10., 34., 62., 49.],
        [70., 88., 15., 34., 99.],
        [98., 40.,  4., 15., 70.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[83., 88., 25., 39., 80.],
        [78.,  2., 38.,  6.,  2.],
        [30., 64., 88., 68., 41.]]])
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 2021-01-01T12:00:00 ... 2021-01-07
    lon      (x, y) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 11.0 12.0 13.0 14.0 15.0
    lat      (x, y) float64 15.0 16.07 17.14 18.21 ... 26.79 27.86 28.93 30.0
Dimensions without coordinates: x, y

xarray.DataArray

'Temperature'

time: 13
x: 3
y: 5

82.0 38.0 90.0 23.0 51.0 51.0 93.0 ... 2.0 30.0 64.0 88.0 68.0 41.0

array([[[82., 38., 90., 23., 51.],
        [51., 93., 64., 97., 87.],
        [97., 23., 72.,  9., 25.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[ 3., 28., 72., 85., 22.],
        [28., 28., 29., 49., 53.],
        [26., 48., 14., 16., 15.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[36., 47.,  5.,  3., 45.],
        [99., 74., 89., 72., 21.],
        [44., 55., 87., 96., 12.]],

...

       [[38., 47., 86., 46., 23.],
        [59.,  4., 47., 16., 29.],
        [18., 23.,  2., 27., 24.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[65., 10., 34., 62., 49.],
        [70., 88., 15., 34., 99.],
        [98., 40.,  4., 15., 70.]],

       [[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]],

       [[83., 88., 25., 39., 80.],
        [78.,  2., 38.,  6.,  2.],
        [30., 64., 88., 68., 41.]]])

Coordinates: (3)

time

(time)

datetime64[ns]

2021-01-01 ... 2021-01-07

array(['2021-01-01T00:00:00.000000000', '2021-01-01T12:00:00.000000000',
       '2021-01-02T00:00:00.000000000', '2021-01-02T12:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-03T12:00:00.000000000',
       '2021-01-04T00:00:00.000000000', '2021-01-04T12:00:00.000000000',
       '2021-01-05T00:00:00.000000000', '2021-01-05T12:00:00.000000000',
       '2021-01-06T00:00:00.000000000', '2021-01-06T12:00:00.000000000',
       '2021-01-07T00:00:00.000000000'], dtype='datetime64[ns]')

lon

(x, y)

float64

1.0 2.0 3.0 4.0 ... 13.0 14.0 15.0

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.]])

lat

(x, y)

float64

15.0 16.07 17.14 ... 28.93 30.0

array([[15.        , 16.07142857, 17.14285714, 18.21428571, 19.28571429],
       [20.35714286, 21.42857143, 22.5       , 23.57142857, 24.64285714],
       [25.71428571, 26.78571429, 27.85714286, 28.92857143, 30.        ]])

Attributes: (0)

This ends our small tutorial explaining how we can use Dataset data structure available from xarray library. Please feel free to let us know your views in the comments section.

References¶

xarray: Simple Guide to Labeled N-Dimensional Array (DataArray)

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

xarray-dataset

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

xarray: Multi-Dimensional Labelled Arrays (Dataset)¶

Important Sections of Tutorial¶

1. Dataset Creation ¶

Dataset()¶

ones_like()¶

zeros_like()¶

full_like()¶

2. Dataset Attributes ¶

3. Dataset Indexing/Slicing ¶

Indexing By Accessing Individual Arrays¶

Indexing using isel() Function¶

Indexing using sel() Function¶

4. Normal Operations on Dataset ¶

assign()¶

assign_attrs()¶

assign_coords()¶

clip()¶

copy()¶

astype()¶

fillna()¶

drop_vars(names)¶

drop_isel()¶

drop_sel()¶

get(key)¶

isnull()¶

isin(list_of_elements)¶

map(func)¶

apply(func)¶

transpose()¶

query()¶

info()¶

5. Simple Statistics ¶

min(dim)¶

argmin(dim)¶

max(dim)¶

argmax(dim)¶

sum(dim)¶

mean(dim)¶

median(dim)¶

std(dim)¶

var(dim)¶

cumsum(dim)¶

cumprod(dim)¶

rolling()¶

resample()¶

Up Sampling¶

Down Sampling¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription