Xarray is a python library that lets us create N-dimensional arrays just like numpy but it let us name the dimension of the N-dimensional array as well. Apart from letting us specify a name for dimensions, it let us specify coordinates data for each dimension. It also lets us record some attributes with our n-dimensional array. All the operations that we perform on a numpy array using integer indexing can be performed on xarray array as well but all those operations can be performed using dimension names as well. The code written using xarray becomes more intuitive as we use dimension names instead of integer indexing. The concept of dimensions, coordinates, and attributes will become more clear when we explain arrays with examples below.
Xarray provides two important data structures to store data.
As a part of this tutorial, we'll be discussing only DataArray data structure. We'll explain with simple examples how to create them, perform indexing, normal array operations, and simple statistics. If you have come to learn about Dataset data structure then please feel free to check the below tutorial where we have covered it in detail with examples.
Below we have highlighted important sections of the tutorial to give an overview of the material covered.
We have imported all necessary libraries at the beginning of our tutorial.
import xarray as xr
print("Xarray Version : {}".format(xr.__version__))
import numpy as np
print("Numpy Version : {}".format(np.__version__))
import pandas as pd
print("Pandas Version : {}".format(pd.__version__))
In this section, we'll explain various ways of creating a xarray DataArray object. We'll explore different methods available from xarray to create arrays.
The first and the most simple way to create a DataArray is by using DataArray() constructor available from xarray. We can provide a numpy array or python list, pandas series object, and pandas dataframe object to this constructor to create DataArray object. Below we have highlighted the signature of DataArray() constructor for reference purposes.
Below we have created our first xarray DataArray using a random numpy array of shape (5,). As it is 1D array, we have given dims parameter with a single name. We have given index name to the single dimension of our array.
arr = xr.DataArray(data=np.random.rand(5), dims=["index"])
arr
Below we have created another example where we have created a 2D DataArray of shape 4x5 using a numpy array of random numbers. We have specified two-dimension names this time as we have 2D array.
arr = xr.DataArray(data=np.random.rand(4,5), dims=["index", "columns"])
arr
We can access data of our array anytime using data attribute of DataArray object.
arr.data
Other array attributes like dtype, shape, size, ndim, nbytes which are available for numpy array are also available for DataArray. The nbytes attribute returns a total number of bytes taken by an array which is 160 (20*8) in this case (20 floats elements each of size 8 bytes).
arr.dtype
arr.nbytes
arr.ndim
arr.shape
arr.size
arr.sizes
Below we have created another example explaining how we can create DataArray of 3D shape.
arr = xr.DataArray(data=np.random.rand(2,3,4), dims=["index", "columns", "items"])
arr
In all our previous examples, we only specified dimension names of DataArray but we did not specify coordinates for those dimensions. Now, we'll explain how we can include coordinates for the dimensions of an array.
Below we have created a 2D DataArray using a random numpy array. We have specified coordinates of our array by providing a dictionary to coords parameter. We have defined two dimensions of data (index, columns). The index represents the first dimension of size 4 and columns represents the second dimension of size 5. We have provided a simple python list of size 4 for index dimension and a list of strings for columns dimension. Apart from specifying coordinates, we have also specified the name of an array using name parameter.
When we define an array using dimension values like this, we can access subarray and elements of an array using these values for indexing. We'll be explaining how we can use these values to perform indexing in the upcoming section of the tutorial.
arr1 = xr.DataArray(data=np.random.rand(4,5), dims=['index','columns'],
coords={"index": [0,1,2,3], "columns": list("ABCDE")},
name="Array1"
)
arr1
Below we have created another DataArray using a random numpy array. This time we have specified index dimension values as a list of strings, unlike our previous examples where values were a list of integers.
We'll be using these arrays during the indexing section to explain indexing in different ways using these coordinate values.
arr2 = xr.DataArray(data=np.random.rand(4,5),
dims=['index','columns'],
coords={"index": ['0','1','2','3'], "columns": list("ABCDE")},
name="Array2"
)
arr2
Below we have created another DataArray of shape 4x5 whose data is a random numpy array. This time we have specified index dimension value as a list of dates. We have used the pandas date_range() function to create a list of dates starting from 2020-1-1.
arr3 = xr.DataArray(data=np.random.rand(4,5),
dims=['index','columns'],
coords={"index": pd.date_range(start="2021-01-01", freq="D", periods=4),
"columns": list("ABCDE")},
name="Array3"
)
arr3
In this section, we have explained how we can create an array with attributes.
We have created a DataArray of shape 4x5 using a random numpy array. We have specified dimensions and coordinates like we were doing till now. Apart from that, we have provided a dictionary to attrs parameter explaining our dataset. We can describe our data, dimensions, and coordinates in this dictionary.
arr = xr.DataArray(
data=np.random.rand(4,5),
dims=['index','columns'],
coords={"index": ['0','1','2','3'], "columns": list("ABCDE")},
attrs={"index": "X-Dimension of Data",
"columns": "Y-Dimension of Data",
"info": "Pandas DataFrame",
"long_name": "Random Data",
"units": "Unknown"
},
name="Array"
)
arr
We can access attributes of our DataArray using attrs attribute anytime.
arr.attrs
arr.attrs["index"]
arr.attrs["long_name"]
In this section, we have explained how we can create DataArray from the pandas series.
Below we have first created a pandas series with index and data.
ser = pd.Series([1,2,3,4], index=list("ABCD"),name="col")
ser
We can create DataArray by just giving pandas series as input. It'll take dimension and coordinate data based on index values of series.
arr_ser = xr.DataArray(ser)
arr_ser
In this section, we have explained how we can create DataArray from pandas dataframe.
Below we have created pandas dataframe with random data. We have also provided dataframe index values and column names.
df = pd.DataFrame(np.random.rand(4,5), index=[0,1,2,3], columns=list("ABCDE"))
df
We can create DataArray from pandas dataframe directly. It'll take dimension and coordinate values based on index and column names of pandas dataframe.
arr_df = xr.DataArray(df)
arr_df
In this section, we'll explain how we can perform indexing operations on xarray DataArray. We can do normal numpy indexing using integers as well as indexing using coordinate values that we specified when creating arrays. We'll be performing indexing on arrays that we created during the array creation section earlier.
In this section, we have performed normal numpy-like integer indexing on our xarray DataArray.
Below we have accessed the 0th element of our 2D array which we created earlier.
arr1[0]
Below we have accessed all elements of the first dimension and the 0th elements of the second dimension. This will be like accessing 1 column of 2D array.
arr1[:, 0]
Below we have accessed the 0th and 1st row of our data.
arr1[[0,1]]
Below we have accessed the 0th and 1st column of our 2D array.
arr1[:,[0,1]]
Below we have accessed 2D array of shape 2x2 from our original 4x5 array.
arr1[[1,2],[0,1]]
The xarray DataArray provided loc property which we can use to index arrays as we do with pandas dataframe. The loc property let us specify coordinates values that we had provided when we created the array. The coordinates values can be of any type (string, date, time, etc), not only integer.
Below we have accessed the first element of the first dimension of our DataArray which we created earlier.
arr1.loc[0]
Below we have accessed the sub-array by using loc property. We have accessed the sub-array which crosses the 0th element of the first dimension and the first two values of the second dimension. We have used string values for indexing DataArray this time.
arr1.loc[0, ["A","B"]]
Below we have accessed the first value of the 0th dimension of our DataArray which we created earlier using loc property. We have a string value to access the value.
arr2.loc['0']
Below we have accessed another sub-array from our original DataArray using all indices as string values inside of loc property.
arr2.loc['0', ["A","B","C"]]
Below we have accessed the sub-array from our array where we had first dimension coordinates specified as date values. We have specified the date value as a string.
arr3.loc["2021-1-1", ['A','B']]
Below we have created another example where we are accessing sub-array from our array with date dimension. We have specified list dates as strings this time to access the sub-array.
arr3.loc[["2021-1-1","2021-1-3"], ['A','B']]
In this example, we have accessed sub-array from our date dimension array by providing date dimension coordinates as a list of dates. We have created a list of 3 dates using date_range() function and provided it to filter first dimension values.
three_days = pd.date_range(start="2021-1-1",periods=3)
arr3.loc[three_days, ["A","B","C"]]
The xarray DataArray has a method named isel() which lets us specify dimension values as integers and access the sub-array of the original array based on values provided to it.
In order to perform indexing using isel() method, we can provide dimension names and their values either as a dictionary or we can provide them as if they are parameters of the methods as well. We'll explain with examples below how we can use this method to perform indexing to make things clear.
Below we have retrieved the 0th element of the 'index' dimension of the array using isel() method. We have provided value to the dimension as if it is a parameter of the method.
arr1.isel(index=0)
Below we have recreated our previous example by providing coordinate value for dimension as a dictionary. This has the same effect as the previous cell.
arr1.isel({'index':0})
Below we have tried to retrieve 2D array of shape 2x4 using isel() method. We have provided two coordinate values for the 'index' dimension and 4 coordinates values for the 'columns' dimension.
arr1.isel(index=[0,1], columns=[0,1,2,3])
Below we have recreated our previous example by providing coordinate values as a dictionary.
arr1.isel({'index':[0,1], 'columns':[0,1,2,3]})
The xarray DataArray provides a method named sel() which works like isel() but it can accept the actual value of coordinates to access sub-arrays rather than integer indexing. We can provide values as either dictionary or as if they are parameters of the method.
Below we have retrieved a sub-array of shape 3x5 from our original array using sel() method. The 'index' dimension has coordinate values as integers hence we have provided them as integers.
arr1.sel(index=[0,1,2])
Below we have tried to access the sub-array of shape 3x5 from our original array using sel() method. This time we have provided coordinate values as a list of strings because original arrays have 'index' dimension values stored as integers.
arr2.sel(index=['0','1','2'])
Below we have accessed another 3x3 array from our original array using sel() method. We have provided coordinate values for both dimensions as a list of strings.
arr2.sel(index=['0','1','2'], columns=['A','C','E'])
Below we have created another example demonstrating the use of sel() method. We are accessing a sub-array of dimension which holds dates.
arr3.sel(index=["2021-1-1","2021-1-2", "2021-1-3"], columns=['A','B'])
Below we have created one more example demonstrating the use of sel() method. We have created a list of dates using the pandas date_range() function to access the sub-array based on it. We have provided this list of dates to the 'index' dimension of an array. For other 'columns' dimension, we have provided a list of 3 strings.
three_days = pd.date_range(start="2021-1-1",periods=3)
arr3.sel(index=three_days, columns=['A','B', 'C'])
In this section, we'll explain some of the commonly performed operations with arrays like addition, multiplication with scalar, transpose, dot product, null elements check, etc. We'll try to explain as many simple operations as possible with simple examples.
We can retrieve the transpose of an array by calling T attribute on the array or by calling transpose() method on it.
arr1_transpose = arr1.T # arr1.transpose() works same
arr1_transpose
We can easily multiply, add, subtract and perform many other operations using scalar.
arr1 * 10
We can add arrays of the same shape only if dimension names and coordinate values match between them.
That's the reason below we are adding our first array to itself to demonstrate array addition because all our arrays created earlier have different coordinate values.
arr1 + arr1
arr + arr2
We can retrieve an index of the maximum element in the array using argmax() method.
Below we have retrieved the index of the maximum element of one of our arrays.
max_index = arr1.argmax()
max_index
We can call item() method on an array with one element to access it.
We can use the same item() method with index to retrieve an element at that index value. Below we are retrieving the maximum element using item() method.
arr1.item(max_index.item())
The item() method can also accept a tuple of indices for arrays with more than one dimension to extract the individual element.
arr1.item((0,0))
As we had said earlier, the majority of array operations which we perform on a numpy array can be performed on xarray DataArray as well. But the major difference is that DataArray let us perform those operations based on dimension name and axis index both whereas numpy array let us perform an operation based only on-axis.
Below we have tried to get indices of maximum values across 'index' dimension of an array.
max_indices = arr1.argmax(dim='index', skipna=True)
max_indices
The idxmax() method works exactly like argmax() method with only difference that index values are returned as floats instead of integers.
max_indices = arr1.idxmax(dim='index',skipna=True)
max_indices
The argmin() method can be used to retrieve an index of minimum values.
Below we have retrieved indices of minimum values across 'columns' dimension.
There is idxmin() method as well which works exactly like this method.
min_indices = arr1.argmin(dim='columns')
min_indices
The isnull() method detect Nan/None values in array. It returns an array of the same size as the original array with boolean values indicating the presence/absence of Nan/None values.
arr1.isnull()
The where() method lets us perform the conditional operation on an array. Its first argument is condition and the second argument is a value that should be taken in the case where the condition evaluates to False.
Below we have printed two of our earlier arrays as a reference as we'll be testing where() function on them.
arr, arr2
Below we have called where() method on arr array checking for a condition where the value of an array is greater than 0.5. Whenever value is greater than 0.5 take value from arr else take value from arr2.
arr.where(arr > 0.5, arr2)
Below we have explained the usage of where() method with another example.
arr.where(arr2 > 0.5, arr)
We can perform the dot product of two arrays using dot() method. We can perform dot products based on dimension names as well.
Below we have performed dot product of two arrays based on dimension 'columns' present in both.
xr.dot(arr, arr2, dims=["columns"])
Below we have performed dot product of two array-based on dimension 'index' present in both.
xr.dot(arr, arr2, dims=["index"])
Below we have performed bot product without specifying any dimension name.
xr.dot(arr,arr2)
The drop() method can be used to drop values in an array based on dimension and coordinates of dimension. It accepts two values as input. The first value is a list of coordinates and the second value is the dimension name, it then drops those values of dimension which has specified coordinates.
Below we have dropped values of 'index' dimension who has coordinate values [0,1].
arr1.drop(labels=[0,1], dim="index")
Below we have created another example demonstrating the use of drop() method in cases where coordinate values are not integers.
arr2.drop(labels=['0','1'], dim="index")
Below we have created another example demonstrating the use of drop() method. This time we are dropping values across 'columns' dimension of our array.
arr1.drop(labels=["D","E"], dim="columns")
The drop_isel() method works like the drop method but it let us specify coordinate values as integers instead of original coordinate values which can be of other data type as well.
The drop_isel() method works like isel() method and lets us specify coordinates of dimension either as a dictionary or as if they are parameters of the method.
Below we have dropped elements from the array whose coordinate value is 0 for dimension 'index'.
arr1.drop_isel({"index":0})
Below we have dropped elements from the array whose coordinate values are 0 and 1 for dimension 'index'.
arr1.drop_isel({"index":[0,1]})
Below we have created another example demonstrating the use of drop_isel() method to drop values across multiple dimensions of the array.
arr1.drop_isel({"index":[0,1], "columns": [2,3,4]})
The drop_sel() method works exactly like drop_isel() with only difference that it accepts original coordinate values of dimension instead of integer values.
Below we have dropped elements from the array whose coordinate values is [0,1] for dimension 'index' and ["C","D","E"] for dimension 'columns'.
arr1.drop_sel({"index":[0,1], "columns": ["C","D","E"]})
Below we have created another example demonstrating the use of drop_sel() method across multiple dimensions.
arr2.drop_sel({"index":['0','1'], "columns": ["C","D","E"]})
We can call copy() method on xarray DataArray to create a copy of it. This will actually create a new array and any modification to this new array won't reflect in an original array from which it was copied because this new array is stored with its own memory.
arr1_copy = arr1.copy()
arr1_copy
The dropna() method let us drop values across dimension of array. It accepts dimension name as the first parameter and method of drop as the second parameter to drop values. There are two different methods to drop values.
Below we have set a few entries to Nan in our array which we created by copying one of our existing arrays.
arr1_copy[0,3] = np.nan
arr1_copy[2,4] = np.nan
arr1_copy
Below we have called dropna() method to drop values across 'index' dimension. It'll drop values where even a single value is Nan.
arr1_copy.dropna(dim="index")
Below we have called dropna() method to drop values across 'columns' dimension.
arr1_copy.dropna(dim="columns")
We can use fillna() method to fill NaN values in the array. It accepts a single value as input which will be replaced in place of all NaNs.
arr1_copy.fillna(value=9.99999)
The drop_duplicate() method let us drop duplicate values across dimension. We need to provide dimension names across which we want to drop duplicates.
Below we have first created a copy of one of our existing arrays and then we have copied one of the second axis data to another to create duplicate data. We can notice from the dataset printed below that the 1st and 3rd columns have the same data.
arr1_copy = arr1.copy()
arr1_copy[:, 2] = arr1_copy[:, 0]
arr1_copy
arr1_copy.drop_duplicates(dim='columns')
The clip() method let us restrict values of the array between the minimum and maximum values specified by us. It accepts two values as input where the first value is the minimum value and the second value is the maximum value. It then replaces all values in an array less than the minimum value with minimum value and all values greater than the maximum value with maximum value.
Below we have tried to restrict values of our array in the range [0.3,0.6] using clip() method.
arr1.clip(min=0.3, max=0.6)
We can combine arrays across dimensions using concat() method. It accepts a list of arrays as the first parameter and dimension name as the second parameter. It then combines two arrays across that dimension.
Below we have combined two arrays across 'index' dimension.
xr.concat((arr,arr1), dim="index")
Below we have combined two arrays across 'columns' dimension.
xr.concat((arr,arr2), dim="columns")
The round() method will round float values of an array.
arr1.round()
In this section, we'll explain how we can perform simple statistics like sum, mean, variance, standard deviation, cumulative sum, cumulative product, etc.
The sum() function can calculate sum across dimensions. If we don't provide dimension then it'll calculate the sum of all elements of the array.
Below we have first calculated the sum of all elements of the array. Then in the next cell, we have calculated the sum across 'index' dimension.
arr1.sum()
arr1.sum(dim="index")
The min() function returns minimum values across dimensions.
Below we have first retrieved the minimum value of the whole array. Then in the next cell, we have retrieved minimum values across 'columns' dimension of the array.
arr1.min()
arr1.min(dim="columns")
The max() method works exactly like min() but returns maximum values instead.
arr1.max()
arr1.max(dim="index")
The std() method helps us calculate a standard deviation across different dimensions of an array. Below we have explained the usage with simple examples.
arr1.std()
arr1.std(dim="columns")
The var() function helps us calculate variance across dimensions of array.
arr1.var()
arr1.var(dim="index")
The median() function helps us find the median across different dimensions of the array.
arr1.median()
arr1.median(dim="index")
The count() function counts a number of elements across dimensions of the array.
arr1.count()
arr1.count(dim="index")
The cumprod() function helps us calculate cumulative product across different dimensions of the array.
Below we have first calculated cumulative product across 'index' dimension of the array and then in the next cell, we have calculated cumulative product across 'columns' dimension of the array.
arr1.cumprod(dim='index')
arr1.cumprod(dim='columns')
The cumsum() function helps us find cumulative sum across different dimensions of the array and works exactly like cumprod() function.
arr1.cumsum(dim='index')
arr1.cumsum(dim='columns')
The corr() function helps us find the Pearson correlation coefficient across different dimensions of an array.
Below we have calculated the correlation between two arrays of the same dimensions. Then we have calculated correlation across 'index' dimension and 'columns' dimensions respectively. It'll take 1D arrays from 2D arrays based on dimensions and find out the correlation between them.
xr.corr(arr, arr2)
xr.corr(arr, arr2, dim="index")
xr.corr(arr, arr2, dim="columns")