Updated On : Nov-11,2021
Time Investment : ~30 mins
xarray: Simple Guide to Labeled N-Dimensional Array (DataArray)¶
Xarray is a python library that lets us create N-dimensional arrays just like numpy but it let us name the dimension of the N-dimensional array as well. Apart from letting us specify a name for dimensions, it let us specify coordinates data for each dimension. It also lets us record some attributes with our n-dimensional array. All the operations that we perform on a numpy array using integer indexing can be performed on xarray array as well but all those operations can be performed using dimension names as well. The code written using xarray becomes more intuitive as we use dimension names instead of integer indexing. The concept of dimensions, coordinates, and attributes will become more clear when we explain arrays with examples below.
Xarray provides two important data structures to store data.
DataArray - It's a data structure that is used to represent an N-dimensional array.
Dataset - It's a data structure that is used to represent a multi-dimensional array which is dict-like container holding DataArray objects. The DataArray objects are aligned across shared dimensions.
As a part of this tutorial, we'll be discussing only DataArray data structure. We'll explain with simple examples how to create them, perform indexing, normal array operations, and simple statistics. If you have come to learn about Dataset data structure then please feel free to check the below tutorial where we have covered it in detail with examples.
In this section, we'll explain various ways of creating a xarray DataArray object. We'll explore different methods available from xarray to create arrays.
The first and the most simple way to create a DataArray is by using DataArray() constructor available from xarray. We can provide a numpy array or python list, pandas series object, and pandas dataframe object to this constructor to create DataArray object. Below we have highlighted the signature of DataArray() constructor for reference purposes.
DataArray(data, dims=None,coords=None,attrs=None,name=None) - This constructor takes as input numpy array, python list, pandas series or pandas dataframe and creates an instance of DataArray. All other parameters are optional.
The dims parameter accepts a list of names specified as strings to define dimension names for each dimension of the array. For 1D array we need to provide a list with one name, for 2D array we need to provide a list with 2 names, for 3D we need to provide a list with 3 names, and so on.
The coords parameter accepts dictionary specifying values for each dimension which will be used when indexing an array. The key of the dictionary is the name of the dimension and the value is a list of the same length as the number of values in that dimension. E.g - For 2D array of shape 3x5, we can provide a dictionary with 2 dimensions where one will have a list of 3 values and the other will have a list of 5 values.
The attrs parameter accepts a dictionary which will be a list of attributes that we want to attach with this array describing it.
The name parameter accepts string specifying the name of the array.
Below we have created our first xarray DataArray using a random numpy array of shape (5,). As it is 1D array, we have given dims parameter with a single name. We have given index name to the single dimension of our array.
Below we have created another example where we have created a 2DDataArray of shape 4x5 using a numpy array of random numbers. We have specified two-dimension names this time as we have 2D array.
Other array attributes like dtype, shape, size, ndim, nbytes which are available for numpy array are also available for DataArray. The nbytes attribute returns a total number of bytes taken by an array which is 160 (20*8) in this case (20 floats elements each of size 8 bytes).
arr.dtype
dtype('float64')
arr.nbytes
160
arr.ndim
2
arr.shape
(4, 5)
arr.size
20
arr.sizes
Frozen({'index': 4, 'columns': 5})
Below we have created another example explaining how we can create DataArray of 3D shape.
In all our previous examples, we only specified dimension names of DataArray but we did not specify coordinates for those dimensions. Now, we'll explain how we can include coordinates for the dimensions of an array.
Below we have created a 2DDataArray using a random numpy array. We have specified coordinates of our array by providing a dictionary to coords parameter. We have defined two dimensions of data (index, columns). The index represents the first dimension of size 4 and columns represents the second dimension of size 5. We have provided a simple python list of size 4 for index dimension and a list of strings for columns dimension. Apart from specifying coordinates, we have also specified the name of an array using name parameter.
When we define an array using dimension values like this, we can access subarray and elements of an array using these values for indexing. We'll be explaining how we can use these values to perform indexing in the upcoming section of the tutorial.
Below we have created another DataArray using a random numpy array. This time we have specified index dimension values as a list of strings, unlike our previous examples where values were a list of integers.
We'll be using these arrays during the indexing section to explain indexing in different ways using these coordinate values.
Below we have created another DataArray of shape 4x5 whose data is a random numpy array. This time we have specified index dimension value as a list of dates. We have used the pandas date_range() function to create a list of dates starting from 2020-1-1.
In this section, we have explained how we can create an array with attributes.
We have created a DataArray of shape 4x5 using a random numpy array. We have specified dimensions and coordinates like we were doing till now. Apart from that, we have provided a dictionary to attrs parameter explaining our dataset. We can describe our data, dimensions, and coordinates in this dictionary.
arr=xr.DataArray(data=np.random.rand(4,5),dims=['index','columns'],coords={"index":['0','1','2','3'],"columns":list("ABCDE")},attrs={"index":"X-Dimension of Data","columns":"Y-Dimension of Data","info":"Pandas DataFrame","long_name":"Random Data","units":"Unknown"},name="Array")arr
<xarray.DataArray 'Array' (index: 4, columns: 5)>
array([[0.38733228, 0.23109638, 0.66964265, 0.6708009 , 0.95829975],
[0.7713564 , 0.1166787 , 0.6483082 , 0.75409353, 0.76900532],
[0.54839661, 0.72950701, 0.65034097, 0.92334631, 0.70863973],
[0.0655017 , 0.56941354, 0.59030199, 0.5371372 , 0.45977435]])
Coordinates:
* index (index) <U1 '0' '1' '2' '3'
* columns (columns) <U1 'A' 'B' 'C' 'D' 'E'
Attributes:
index: X-Dimension of Data
columns: Y-Dimension of Data
info: Pandas DataFrame
long_name: Random Data
units: Unknown
We can create DataArray from pandas dataframe directly. It'll take dimension and coordinate values based on index and column names of pandas dataframe.
In this section, we'll explain how we can perform indexing operations on xarray DataArray. We can do normal numpy indexing using integers as well as indexing using coordinate values that we specified when creating arrays. We'll be performing indexing on arrays that we created during the array creation section earlier.
Below we have accessed all elements of the first dimension and the 0th elements of the second dimension. This will be like accessing 1 column of 2D array.
The xarray DataArray provided loc property which we can use to index arrays as we do with pandas dataframe. The loc property let us specify coordinates values that we had provided when we created the array. The coordinates values can be of any type (string, date, time, etc), not only integer.
Below we have accessed the first element of the first dimension of our DataArray which we created earlier.
Below we have accessed the sub-array by using loc property. We have accessed the sub-array which crosses the 0th element of the first dimension and the first two values of the second dimension. We have used string values for indexing DataArray this time.
Below we have accessed the first value of the 0th dimension of our DataArray which we created earlier using loc property. We have a string value to access the value.
Below we have accessed the sub-array from our array where we had first dimension coordinates specified as date values. We have specified the date value as a string.
Below we have created another example where we are accessing sub-array from our array with date dimension. We have specified list dates as strings this time to access the sub-array.
In this example, we have accessed sub-array from our date dimension array by providing date dimension coordinates as a list of dates. We have created a list of 3 dates using date_range() function and provided it to filter first dimension values.
The xarray DataArray has a method named isel() which lets us specify dimension values as integers and access the sub-array of the original array based on values provided to it.
In order to perform indexing using isel() method, we can provide dimension names and their values either as a dictionary or we can provide them as if they are parameters of the methods as well. We'll explain with examples below how we can use this method to perform indexing to make things clear.
Below we have retrieved the 0th element of the 'index' dimension of the array using isel() method. We have provided value to the dimension as if it is a parameter of the method.
Below we have recreated our previous example by providing coordinate value for dimension as a dictionary. This has the same effect as the previous cell.
Below we have tried to retrieve 2D array of shape 2x4 using isel() method. We have provided two coordinate values for the 'index' dimension and 4 coordinates values for the 'columns' dimension.
Indexing Based on Dimension Data using sel() Function¶
The xarray DataArray provides a method named sel() which works like isel() but it can accept the actual value of coordinates to access sub-arrays rather than integer indexing. We can provide values as either dictionary or as if they are parameters of the method.
Below we have retrieved a sub-array of shape 3x5 from our original array using sel() method. The 'index' dimension has coordinate values as integers hence we have provided them as integers.
Below we have tried to access the sub-array of shape 3x5 from our original array using sel() method. This time we have provided coordinate values as a list of strings because original arrays have 'index' dimension values stored as integers.
Below we have accessed another 3x3 array from our original array using sel() method. We have provided coordinate values for both dimensions as a list of strings.
Below we have created one more example demonstrating the use of sel() method. We have created a list of dates using the pandas date_range() function to access the sub-array based on it. We have provided this list of dates to the 'index' dimension of an array. For other 'columns' dimension, we have provided a list of 3 strings.
In this section, we'll explain some of the commonly performed operations with arrays like addition, multiplication with scalar, transpose, dot product, null elements check, etc. We'll try to explain as many simple operations as possible with simple examples.
We can add arrays of the same shape only if dimension names and coordinate values match between them.
That's the reason below we are adding our first array to itself to demonstrate array addition because all our arrays created earlier have different coordinate values.
We can retrieve an index of the maximum element in the array using argmax() method.
Below we have retrieved the index of the maximum element of one of our arrays.
max_index=arr1.argmax()max_index
<xarray.DataArray 'Array1' ()>
array(8)
xarray.DataArray
'Array1'
8
array(8)
We can call item() method on an array with one element to access it.
We can use the same item() method with index to retrieve an element at that index value. Below we are retrieving the maximum element using item() method.
arr1.item(max_index.item())
0.9947187341846935
The item() method can also accept a tuple of indices for arrays with more than one dimension to extract the individual element.
arr1.item((0,0))
0.5786850732755588
As we had said earlier, the majority of array operations which we perform on a numpy array can be performed on xarray DataArray as well. But the major difference is that DataArray let us perform those operations based on dimension name and axis index both whereas numpy array let us perform an operation based only on-axis.
Below we have tried to get indices of maximum values across 'index' dimension of an array.
The isnull() method detect Nan/None values in array. It returns an array of the same size as the original array with boolean values indicating the presence/absence of Nan/None values.
The where() method lets us perform the conditional operation on an array. Its first argument is condition and the second argument is a value that should be taken in the case where the condition evaluates to False.
Below we have printed two of our earlier arrays as a reference as we'll be testing where() function on them.
Below we have called where() method on arr array checking for a condition where the value of an array is greater than 0.5. Whenever value is greater than 0.5 take value from arr else take value from arr2.
arr.where(arr>0.5,arr2)
<xarray.DataArray 'Array' (index: 4, columns: 5)>
array([[0.07511355, 0.60393655, 0.66964265, 0.6708009 , 0.95829975],
[0.7713564 , 0.7067142 , 0.6483082 , 0.75409353, 0.76900532],
[0.54839661, 0.72950701, 0.65034097, 0.92334631, 0.70863973],
[0.36282616, 0.56941354, 0.59030199, 0.5371372 , 0.46026918]])
Coordinates:
* index (index) <U1 '0' '1' '2' '3'
* columns (columns) <U1 'A' 'B' 'C' 'D' 'E'
Attributes:
index: X-Dimension of Data
columns: Y-Dimension of Data
info: Pandas DataFrame
long_name: Random Data
units: Unknown
The drop() method can be used to drop values in an array based on dimension and coordinates of dimension. It accepts two values as input. The first value is a list of coordinates and the second value is the dimension name, it then drops those values of dimension which has specified coordinates.
Below we have dropped values of 'index' dimension who has coordinate values [0,1].
Below we have created another example demonstrating the use of drop() method. This time we are dropping values across 'columns' dimension of our array.
The drop_isel() method works like the drop method but it let us specify coordinate values as integers instead of original coordinate values which can be of other data type as well.
The drop_isel() method works like isel() method and lets us specify coordinates of dimension either as a dictionary or as if they are parameters of the method.
Below we have dropped elements from the array whose coordinate value is 0 for dimension 'index'.
The drop_sel() method works exactly like drop_isel() with only difference that it accepts original coordinate values of dimension instead of integer values.
Below we have dropped elements from the array whose coordinate values is [0,1] for dimension 'index' and ["C","D","E"] for dimension 'columns'.
We can call copy() method on xarray DataArray to create a copy of it. This will actually create a new array and any modification to this new array won't reflect in an original array from which it was copied because this new array is stored with its own memory.
The dropna() method let us drop values across dimension of array. It accepts dimension name as the first parameter and method of drop as the second parameter to drop values. There are two different methods to drop values.
'any' - This is default method value. It'll drop entries of dimension where even a single value is Nan.
'all' - It'll drop entries of dimension where all entries are Nan.
Below we have set a few entries to Nan in our array which we created by copying one of our existing arrays.
The drop_duplicate() method let us drop duplicate values across dimension. We need to provide dimension names across which we want to drop duplicates.
Below we have first created a copy of one of our existing arrays and then we have copied one of the second axis data to another to create duplicate data. We can notice from the dataset printed below that the 1st and 3rd columns have the same data.
The clip() method let us restrict values of the array between the minimum and maximum values specified by us. It accepts two values as input where the first value is the minimum value and the second value is the maximum value. It then replaces all values in an array less than the minimum value with minimum value and all values greater than the maximum value with maximum value.
Below we have tried to restrict values of our array in the range [0.3,0.6] using clip() method.
We can combine arrays across dimensions using concat() method. It accepts a list of arrays as the first parameter and dimension name as the second parameter. It then combines two arrays across that dimension.
Below we have combined two arrays across 'index' dimension.
In this section, we'll explain how we can perform simple statistics like sum, mean, variance, standard deviation, cumulative sum, cumulative product, etc.
The min() function returns minimum values across dimensions.
Below we have first retrieved the minimum value of the whole array. Then in the next cell, we have retrieved minimum values across 'columns' dimension of the array.
The std() method helps us calculate a standard deviation across different dimensions of an array. Below we have explained the usage with simple examples.
The cumprod() function helps us calculate cumulative product across different dimensions of the array.
Below we have first calculated cumulative product across 'index' dimension of the array and then in the next cell, we have calculated cumulative product across 'columns' dimension of the array.
The corr() function helps us find the Pearson correlation coefficient across different dimensions of an array.
Below we have calculated the correlation between two arrays of the same dimensions. Then we have calculated correlation across 'index' dimension and 'columns' dimensions respectively. It'll take 1D arrays from 2D arrays based on dimensions and find out the correlation between them.
The rolling() method let us perform rolling window functions on xarray DataArray objects. It accepts the dimension at which to apply the rolling window function and window size as input. We can provide dimension name and window size as a dictionary or as if they are parameters of methods as well. After applying the rolling window function, we can calculate various aggregate functions like mean, standard deviation, sum, variance, etc on rolled windows of data.
Below we have performed the rolling window function on our array at 'index' dimension with a window size of 2. We have then taken the average of windows.
If you want to know how to perform moving window functions in pandas then please feel free to check our tutorial on the same where we cover the topic in detail.
Below we have created another example where we are performing a rolling window function on our array at 'columns' dimension with a window size of 3. We have then taken standard deviation on data windows.
The resample() function is useful in situations where the dimension is datetime and we want to resample it at a different frequency than the current one. The resampling can be of two types.
Up Sampling - We increase sample frequency from lesser to higher. E.g. - daily frequency to monthly.
Down Sampling - We decrease sample frequency. E.g. - daily to 6 hourly
The resample() function takes as input dimension name and new frequency as input to resample xarray DataArra. We can provide dimension and frequency either as a dictionary or as if they are parameters of the method.
If you are interested in learning about resampling using pandas then please feel free to check our tutorial where we discuss resampling in detail.
Below we have taken one of our arrays which had 'index' dimension with datetime coordinates, we have then resampled the array to 2 days frequency to daily frequency. We have upsampled array. After upsampling, we have called mean() function to replace values in the new array as an average of values.
In this section, we have downsampled our DataArray from daily frequency to 12 hourly frequency. As we have downsampled dataset, it'll introduce many new entries and will also introduce NaNs in the dataset in places we don't have data. The reason behind NaNs is that we have introduced new entries in the dataset which were not present earlier by downsampling. Our data has entry only for 1 day and not every 12 hours. We can fill NaNs by calling some xarray functions like ffill(), bfill(), fillna(), etc.
After downsampling, we have taken an average of resampled entries. We have also displayed 'index' dimension data for verification purposes.
This ends our small tutorial explaining the DataArray data structure of xarray to hold and manipulate data. Please feel free to let us know your views in the comments section.
About: Sunny Solanki holds a bachelor's degree in Information Technology (2006-2010) from L.D. College of Engineering. Post completion of his graduation, he has 8.5+ years of experience (2011-2019) in the IT Industry (TCS). His IT experience involves working on Python & Java Projects with US/Canada banking clients. Since 2020, he’s primarily concentrating on growing CoderzColumn.
His main areas of interest are AI, Machine Learning, Data Visualization, and Concurrent Programming. He has good hands-on with Python and its ecosystem libraries.
Apart from his tech life, he prefers reading biographies and autobiographies. And yes, he spends his leisure time taking care of his plants and a few pre-Bonsai trees.
Contact: sunny.2309@yahoo.in
Comfortable Learning through Video Tutorials?
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
Want to Share Your Views? Have Any Suggestions?
If you want to
provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com.
We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.