Updated On : Nov-26,2021 Tags dask-array, parallel-computing, distributed-computing
Dask Array: Guide to Work with Large Arrays in Parallel [Python]

Dask Array: Guide to Work with Large Arrays in Parallel

Dask is one of the most commonly used parallel computing frameworks of the Python ecosystem. It provides many different sub-modules to perform parallel computing on a single computer or cluster of computers with ease. It let us work with big datasets which do not fit into main memory by dividing datasets and bringing a small part of them into the main memory at the time. It can easily distribute work on the cores of a single computer to multiple computers on the cluster. Below we have listed the most commonly used dask sub-modules designed for various purposes.

  • dask.distributed (Futures) - This submodule has the same API as that of concurrent.future module of Python and lets us submit our functions to dask cluster for execution. The code designed using this module works in eager mode compared to all other modules (mentioned below) which work in lazy mode (need to call compute() method to actually start execution). We don't need to call compute() tasks submitted as they will start executing as soon as we submit them to cluster based on the availability of free core/computer/node. If there is no free node then it'll be added to the pending tasks queue and will be executed as soon as free core/computer/node is available for it. The operations are completed immediately returning Future object which will be populated with results when the task is complete. This won't block the normal execution of code.
  • dask.bag - This module is designed to work with collection of python objects (lists/iterators) in parallel and perform operations like map(), filter(), groupby(), etc on them. This package is similar to PySpark. This module works in lazy mode hence we need to call compute() method at last to actually perform operations. The execution will wait for completion of task until compute() method returns with results.
  • dask.delayed - This module is designed to work with problems when you can't use arrays or dataframes to represent your data. It let us parallelize our existing python functions. This module works in lazy mode hence we need to call compute() method, at last, to actually perform operations. The execution will wait for the completion of the task until compute() method returns with results.
  • dask.dataframe - This sub-module let us work on large pandas dataframes in parallel. This module works in lazy mode hence we need to call compute() method, at last, to actually perform operations. The execution will wait for the completion of the task until compute() method returns with results.
  • dask.array - This module lets us work on large numpy arrays in parallel. This module works in lazy mode hence we need to call compute() method, at last, to actually perform operations. The execution will wait for the completion of the task until compute() method returns with results.

As a part of this tutorial, we'll be concentrating on dask.array module. We'll explain how we can create arrays and work with them in parallel using dask. The dask.array module implements many functionalities provided by numpy arrays. It implements all functionalities which can be parallelized. Internally dask array itself is composed of many numpy arrays where each array holds part of the main array.

Dask Array: Guide to work with Arrays in Distributed Environment

Below we have listed important sections of our tutorial to give an overview of the material covered.

Important Sections of Tutorial

  1. Create Arrays
  2. Indexing/Slicing Arrays
  3. Array Attributes
  4. Normal Array Operations
  5. Random Numbers Array
  6. Simple Statistics

Below we have imported the necessary libraries that we'll use in our tutorial. We have also imported dask.array separately.

In [6]:
import dask

print("Dask Version : {}".format(dask.__version__))
Dask Version : 2021.11.0
In [7]:
import dask.array as da
In [8]:
import numpy as np

Create Dask Cluster (Optional Step)

In order to use dask, we'll be creating a cluster of dask workers. This is not a compulsory step as we can continue working with dask arrays without creating dask clusters. But we are creating it to limit memory usage and the number of workers. This cluster also gives us a URL of the dashboard which dask provides to analyze parallel processing it performs behind the scenes. This can help us better understand operations and the time taken by them. The dask cluster we have created below is on a single computer but the dask cluster can be spread across clusters of computers as well.

If you want to know in detail about dask clusters then please feel free to check our below tutorial which explains the architecture of dask along with the cluster creation process.

Below we have created a dask cluster of 4 processes where each process can use 0.75 GB of main memory. We can open the dashboard URL in a separate tab to see the progress of tasks when we execute some tasks on the dask cluster by calling compute() method.

In [9]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=4, memory_limit='0.75GB')

client

Dask Array: Guide to Work with Large Arrays in Parallel

1. Create Arrays

In this section, we'll explain various ways to create dask arrays. The API of dask.array is almost the same as that of numpy hence the majority of functions will work exactly the same as numpy.

arange()

The arange() method works exactly like python range() function but returns dask array. When creating a dask array by calling this function, it does not actually create an array. It'll only create an array when we call compute() method on it or on any operations performed on it. The dask array are lazy by default just like the majority of dask modules which requires us to call compute() method to actually create and bring data in memory and perform some operations on them.

Below we have created an array using arange() function and displayed it. The below function call completes immediately and does not use any memory as it's just a function call. We have then displayed the array. The jupyter notebook creates a nice representation of the dask array. We can notice that it shows the size of an array, memory usage (when we actually bring it in memory by calling compute()), and data type as well.

It also displays information about chunks. The chunk is a small subset of our original dask array. When we perform any operations on the dask array, it'll bring a single chunk of an array in memory and work on it. At last, it'll combine the results of all chunks to create a final resulting array.

The majority of dask array creation methods accept an argument named 'chunks' which accepts a tuple of integers with the same rank as that of the original array to let dask know about the size of the individual chunks. The integer specifies the number of elements to keep in each dimension of the array. If an array is one dimensional then chunks will have a tuple of size 1, a two-dimensional array will have a tuple of two integers for chunks, and so on. We'll explain through our examples how we can use 'chunks' argument with various methods.

By default, chunks argument is None and dask will create an array with only one chunk. We should divide the dask array into chunks using 'chunks' argument if we want to distribute work across different cores of computer/computers of the cluster for faster computation in parallel. If we keep only one chunk then there won't be parallel computation and operations will be almost like a normal numpy array on a single-core/computer which can be slower for large datasets that can benefit from parallelism.

In [10]:
x = da.arange(100_000)

x
Out[10]:
Array Chunk
Bytes 781.25 kiB 781.25 kiB
Shape (100000,) (100000,)
Count 1 Tasks 1 Chunks
Type int64 numpy.ndarray
100000 1

In the next cell, we have created another array using arange() but this time, we have provided chunks argument with tuple (100,). This will keep 100 elements per chunk. As this is a one-dimensional array, we have provided chunk size with a tuple of a single integer. As our main array is of size 100_000, we'll have 1000 chunks which we can easily get by dividing array size by elements per chunk (100,000 / 100 = 1000).

We can notice the representation created of an array this time includes chunk size, shape, and count differently from the previous one chunk array. We can notice the difference in the size of the total array and chunk as well.

In [11]:
x = da.arange(100_000, chunks=(100,))

x
Out[11]:
Array Chunk
Bytes 781.25 kiB 800 B
Shape (100000,) (100,)
Count 1000 Tasks 1000 Chunks
Type int64 numpy.ndarray
100000 1

By default, arange() method creates an array of integers but we can ask it to create an array of floats as well by providing dtype argument.

Below we have created a float array using arange() method.

Please make a NOTE that we have provided numpy data type to dtype argument as dask does not have any data type defined in them. Dask is just a wrapper library that performs array operations in parallel on underlying numpy arrays.

In [12]:
x = da.arange(100_000, chunks=(100,), dtype=np.float32)

x
Out[12]:
Array Chunk
Bytes 390.62 kiB 400 B
Shape (100000,) (100,)
Count 1000 Tasks 1000 Chunks
Type float32 numpy.ndarray
100000 1

ones()

In the next cell, we have created an array of all 1s using ones() method of dask array. It takes as input shape of an array as the first argument. We have provided chunks as 10 which will use chunk size as 10 in all dimensions. If we provide only a single integer for 'chunks' for a multi-dimensional array then it'll use the same chunk size in all dimensions. In the below example, the chunk size of 10 will be treated as (10,10).

We can notice that our array has total 1_000_000 elements (1000x1000) and single chunk (10x10) is composed of 100 elements (10x10), the total number of chunks will be 10_000 which we can get by dividing total elements by number of elements per chunk (1_000_000 / 100 = 10_000).

In [13]:
x = da.ones((1000,1000), chunks=10, dtype=np.float32)

x
Out[13]:
Array Chunk
Bytes 3.81 MiB 400 B
Shape (1000, 1000) (10, 10)
Count 10000 Tasks 10000 Chunks
Type float32 numpy.ndarray
1000 1000

In the next cell, we have again created an array of 1s with dimension 1000x1000 but this time we have given a tuple of two integers to specify chunk size. The tuple (10,100) indicates that 10 elements in first axis and 100 elements in second axis should create one chunk (10 x 100 = 1000 elements per chunk).

In [14]:
x = da.ones((1000,1000), chunks=(10, 100))

x
Out[14]:
Array Chunk
Bytes 7.63 MiB 7.81 kiB
Shape (1000, 1000) (10, 100)
Count 1000 Tasks 1000 Chunks
Type float64 numpy.ndarray
1000 1000

ones_like()

The ones_like() method takes as input another array and creates an array of 1s with the same dimension as the dimension of the input array.

In [15]:
x = da.arange(100_000, chunks=100)

y = da.ones_like(x)

y
Out[15]:
Array Chunk
Bytes 781.25 kiB 800 B
Shape (100000,) (100,)
Count 1000 Tasks 1000 Chunks
Type int64 numpy.ndarray
100000 1

zeros()

The zeros() method works exactly like ones() with only difference that it creates an array of 0s. There is also zeros_like() method which works exactly like ones_like() but creates an array of 0s.

In [16]:
x = da.zeros((10_000,10_000), chunks=(100,100))

x
Out[16]:
Array Chunk
Bytes 762.94 MiB 78.12 kiB
Shape (10000, 10000) (100, 100)
Count 10000 Tasks 10000 Chunks
Type float64 numpy.ndarray
10000 10000

diag()

The diag() function takes as input another dask array and sets its elements on the diagonal of the square array with all other elements set to 0.

Below we have created a diagonal array where numbers 0-9 are set on diagonal of 2-dimensional array. We have the first time called compute() method on a dask array to retrieve the actual array and display it.

In [17]:
x = da.diag(da.arange(10))

x
Out[17]:
Array Chunk
Bytes 800 B 800 B
Shape (10, 10) (10, 10)
Count 2 Tasks 1 Chunks
Type int64 numpy.ndarray
10 10
In [18]:
x.compute()
Out[18]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 3, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 4, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 6, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 7, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 8, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 9]])

empty()

We can create an empty array using the method empty(). The data inside the array generated by this method will be garbage. It accepts the shape of an array as the first argument and should be specified using a tuple. It also accepts chunks argument like other methods.

In [19]:
x = da.empty((100,100), chunks=(10,10))

x.compute()
Out[19]:
array([[800., 800., 800., ...,   0.,   0.,   0.],
       [800., 800., 800., ...,   0.,   0.,   0.],
       [800., 800., 800., ...,   0.,   0.,   0.],
       ...,
       [  0.,   0.,   0., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,   0.,   0.,   0.]])

eye()

The eye() method creates an identity array where all elements on the diagonal will be 1 and all other elements of the array will be 0. It accepts a single integer specifying the size of the array.

In [20]:
x = da.eye(100, chunks=10)

x.compute()
Out[20]:
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

full()

The full() method can create an array whose all elements have the same value. It accepts array shape specified as tuple as the first argument. The second argument should be the element that should be present in an array.

In [21]:
x = da.full((10,10), 12.3)

x.compute()
Out[21]:
array([[12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3, 12.3]])

full_like()

The full_like() method can create an array whose shape can be taken from another array. It takes as input another array and a scalar value.

In [22]:
x = da.zeros((10,5))

y = da.full_like(x, 12.3)

y.compute()
Out[22]:
array([[12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3],
       [12.3, 12.3, 12.3, 12.3, 12.3]])

from_array()

We can create a dask array from any existing python list/tuple or numpy array using from_array() method.

Below we have created a dask array from a numpy array of random numbers.

In [23]:
x = np.random.rand(10,10)

y = da.from_array(x)

y
Out[23]:
Array Chunk
Bytes 800 B 800 B
Shape (10, 10) (10, 10)
Count 1 Tasks 1 Chunks
Type float64 numpy.ndarray
10 10
In [24]:
x[0], y[0].compute()
Out[24]:
(array([0.55492532, 0.90895822, 0.94007155, 0.06401506, 0.40150513,
        0.75877353, 0.47479847, 0.42534337, 0.23331959, 0.31552919]),
 array([0.55492532, 0.90895822, 0.94007155, 0.06401506, 0.40150513,
        0.75877353, 0.47479847, 0.42534337, 0.23331959, 0.31552919]))

array()

The array() method works exactly like from_array() method and can create a dask array from a python sequence or numpy arrays.

In [25]:
x = np.random.rand(10,10)

y = da.array(x)

y
Out[25]:
Array Chunk
Bytes 800 B 800 B
Shape (10, 10) (10, 10)
Count 1 Tasks 1 Chunks
Type float64 numpy.ndarray
10 10
In [26]:
x[0], y[0].compute()
Out[26]:
(array([0.75881025, 0.33603925, 0.42797113, 0.79104277, 0.41889755,
        0.24763189, 0.43119328, 0.41908326, 0.10440219, 0.07803936]),
 array([0.75881025, 0.33603925, 0.42797113, 0.79104277, 0.41889755,
        0.24763189, 0.43119328, 0.41908326, 0.10440219, 0.07803936]))

asarray()

The asarray() is another method which works like methods array() and from_array().

In [27]:
x = np.random.rand(10,10)

y = da.asarray(x)

y
Out[27]:
Array Chunk
Bytes 800 B 800 B
Shape (10, 10) (10, 10)
Count 1 Tasks 1 Chunks
Type float64 numpy.ndarray
10 10
In [28]:
x[0], y[0].compute()
Out[28]:
(array([0.26025194, 0.64982115, 0.04991184, 0.55478964, 0.69155262,
        0.85442688, 0.71524462, 0.90400504, 0.10351039, 0.07825381]),
 array([0.26025194, 0.64982115, 0.04991184, 0.55478964, 0.69155262,
        0.85442688, 0.71524462, 0.90400504, 0.10351039, 0.07825381]))

repeat()

The repeat() method takes as input an array and repeats its elements based on the axis to create a new array.

Below we have first created a dask array of five integers from 0-4. We have then called repeat() method asking it to repeat elements 3 times.

In [29]:
x = da.arange(5)

y = da.repeat(x, 3, axis=0)

y.compute()
Out[29]:
array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

Below we have first created a dask array of integers in the range 0-6 and then reshaped it as an array of size (2,3).

We have then called repeat() method with our dask array as input and asked it to repeat elements 3 times at axis 1.

We have then called repeat() method again with our dask array as input and asked it to repeat elements 2 times at axis 0.

We have then at last called compute() on both of our new dask arrays to see the results. We can notice from the results that how elements are repeated across different axes.

In [30]:
x = da.arange(6).reshape(2,3)

y = da.repeat(x, 3, axis=1)

z = da.repeat(x, 2, axis=0)

x.compute(), y.compute(), z.compute()
Out[30]:
(array([[0, 1, 2],
        [3, 4, 5]]),
 array([[0, 0, 0, 1, 1, 1, 2, 2, 2],
        [3, 3, 3, 4, 4, 4, 5, 5, 5]]),
 array([[0, 1, 2],
        [0, 1, 2],
        [3, 4, 5],
        [3, 4, 5]]))

linspace()

The linspace() method can create a dask array by dividing a range into evenly spaced values. We need to provide start and end values of range as well as the number of evenly spaced values to create in that range.

In [31]:
x = da.linspace(1,9,10)

x.compute()
Out[31]:
array([1.        , 1.88888889, 2.77777778, 3.66666667, 4.55555556,
       5.44444444, 6.33333333, 7.22222222, 8.11111111, 9.        ])

copy()

The copy() method actually creates a copy of the dask array. The new array will be stored at different memory locations hence change in the original array from which it was created, won't affect it.

In [32]:
x = np.random.rand(10,10)

y = da.asarray(x)

z = y.copy()
In [33]:
x[0], y[0].compute(), z[0].compute()
Out[33]:
(array([0.39467889, 0.00304257, 0.66207215, 0.72179242, 0.00886892,
        0.35795217, 0.32104309, 0.23563315, 0.55624762, 0.74580858]),
 array([0.39467889, 0.00304257, 0.66207215, 0.72179242, 0.00886892,
        0.35795217, 0.32104309, 0.23563315, 0.55624762, 0.74580858]),
 array([0.39467889, 0.00304257, 0.66207215, 0.72179242, 0.00886892,
        0.35795217, 0.32104309, 0.23563315, 0.55624762, 0.74580858]))

2. Indexing/Slicing Arrays

In this section, we'll explain with simple examples how we can perform integer indexing/slicing on our dask array. It exactly follows the same rules as that of numpy indexing hence any background with numpy indexing will be useful to follow along this section.

Below we have created an array of random numbers using numpy first. We have then created a dask array from this numpy array using from_array() method. We'll be using this array to explain indexing.

In [34]:
x = np.random.rand(100,100)

arr = da.from_array(x, chunks=(10,10))

arr
Out[34]:
Array Chunk
Bytes 78.12 kiB 800 B
Shape (100, 100) (10, 10)
Count 100 Tasks 100 Chunks
Type float64 numpy.ndarray
100 100

Below we have performed integer indexing on our array asking it to take every 5th element from the first dimension and all elements from the second dimension.

As we are taking every 5th element from 100 elements, there will be only 20 elements in our first dimension. We can notice below from array representation that chunk size is also adjusted according to indexing.

In [35]:
arr[::5,:]
Out[35]:
Array Chunk
Bytes 15.62 kiB 160 B
Shape (20, 100) (2, 10)
Count 200 Tasks 100 Chunks
Type float64 numpy.ndarray
100 20

Below we have created another example where we are taking every 5th element in both dimensions of our array. This will reduce our 100x100 array to 20x20. The chunk size is also changed from 10x10 to 2x2.

In [36]:
arr[::5, ::5]
Out[36]:
Array Chunk
Bytes 3.12 kiB 32 B
Shape (20, 20) (2, 2)
Count 200 Tasks 100 Chunks
Type float64 numpy.ndarray
20 20
In [37]:
arr_part = arr[::5, ::5].compute()

arr_part.shape
Out[37]:
(20, 20)

3. Array Attributes

In this section, we'll be explaining the attributes of the dask array.

Below we have created a simple array of integers using arange() method and then reshaped it to shape 100x100.

In [38]:
arr = da.arange(10_000).reshape(100,100)

arr
Out[38]:
Array Chunk
Bytes 78.12 kiB 78.12 kiB
Shape (100, 100) (100, 100)
Count 2 Tasks 1 Chunks
Type int64 numpy.ndarray
100 100

The shape of the array can be retrieved by calling shape attribute on an array.

In [39]:
arr.shape
Out[39]:
(100, 100)

The size of an array represents a total number of elements stored in an array and can be retrieved using size attribute.

In [40]:
arr.size
Out[40]:
10000

The chunks returns a tuple of the same size as the dimensions of the array. It'll return a tuple of size 1 for a one-dimensional array, the tuple of size 2 for a two-dimensional array, etc. Each individual element of the tuple will be again tuple which will have the number of elements present in the particular chunk of an array.

In [41]:
arr.chunks
Out[41]:
((100,), (100,))

The chunksize attribute returns the chunk shape of the dask array.

In [42]:
arr.chunksize
Out[42]:
(100, 100)

The itemsize attribute returns the number of bytes of memory used by a single element of an array.

In [43]:
arr.itemsize
Out[43]:
8

The nbytes attribute returns the total number of bytes of memory that will be used by this array if we bring it in memory.

In [44]:
arr.nbytes
Out[44]:
80000

The dtype represents the type of elements stored in an array.

In [45]:
arr.dtype
Out[45]:
dtype('int64')

We can also name a dask array using name attribute.

In [46]:
arr.name
Out[46]:
'reshape-6ded2bfa80b8da037471e001721bdf00'
In [47]:
arr._name = "Array1"

arr.name
Out[47]:
'Array1'

The npartitions attribute returns the number of partitions of an array. This attribute returns the number of chunks of an array.

In [48]:
arr.npartitions
Out[48]:
1

Below we have created a two-dimensional array again to explain a few attributes of the dask array again.

In [49]:
arr = da.zeros((100,100), chunks=(10,10))

arr
Out[49]:
Array Chunk
Bytes 78.12 kiB 800 B
Shape (100, 100) (10, 10)
Count 100 Tasks 100 Chunks
Type float64 numpy.ndarray
100 100

We can notice here that how chunks attribute returns a tuple of size 2 where both elements are again tuples specifying elements present in the chunk of an array.

In [50]:
arr.chunks
Out[50]:
((10, 10, 10, 10, 10, 10, 10, 10, 10, 10),
 (10, 10, 10, 10, 10, 10, 10, 10, 10, 10))

The chunksize is the same as what we provided to chunks argument of the method.

In [51]:
arr.chunksize
Out[51]:
(10, 10)
In [52]:
arr.npartitions
Out[52]:
100
In [53]:
arr.partitions[0].compute().shape
Out[53]:
(10, 100)
In [54]:
arr.partitions[0,0].compute().shape
Out[54]:
(10, 10)

4. Normal Array Operations

In this section, we'll explain different operations that we can perform on dask array like transpose, dot product, array multiplication, addition, flattening array, changing dimensions, etc. We'll be exploring many methods of dask array in this section.

reshape()

Below we have first created a dask array of 10_000 integers and then reshaped it from one dimension to a two-dimensional array using reshape() method. We have reshaped the array to shape 100x100.

In [55]:
arr = da.arange(10_000).reshape(100,100)

arr
Out[55]:
Array Chunk
Bytes 78.12 kiB 78.12 kiB
Shape (100, 100) (100, 100)
Count 2 Tasks 1 Chunks
Type int64 numpy.ndarray
100 100

rechunk()

The rechunk() method can be called on any dask array to change the existing chunk size or add chunk size to an array without chunking. It accepts chunk size specified as a tuple or single integer. If a single integer is specified for a multi-dimensional array then the same chunk size will be used in all dimensions.

Below we have re-chunked our array from a single chunk to 100 chunks of size (10,10).

In [56]:
arr.rechunk(chunks=(10,10))
Out[56]:
Array Chunk
Bytes 78.12 kiB 800 B
Shape (100, 100) (10, 10)
Count 202 Tasks 100 Chunks
Type int64 numpy.ndarray
100 100

transpose()

The transpose() function can transpose an array.

In [57]:
arr = da.arange(10).reshape(2,5)

arr.transpose()
Out[57]:
Array Chunk
Bytes 80 B 80 B
Shape (5, 2) (5, 2)
Count 3 Tasks 1 Chunks
Type int64 numpy.ndarray
2 5

astype()

The astype() function takes as input a numpy data type and transforms the input array type to the given.

In [58]:
arr.astype(np.float32)
Out[58]:
Array Chunk
Bytes 40 B 40 B
Shape (2, 5) (2, 5)
Count 3 Tasks 1 Chunks
Type float32 numpy.ndarray
5 2

flatten()

The flatten() method flattens multi-dimensional array.

In [59]:
arr.flatten()
Out[59]:
Array Chunk
Bytes 80 B 80 B
Shape (10,) (10,)
Count 3 Tasks 1 Chunks
Type int64 numpy.ndarray
10 1

dot()

The dot() method calculates the dot product of the current array and another array given as input.

In [60]:
arr.dot(arr.T)
Out[60]:
Array Chunk
Bytes 32 B 32 B
Shape (2, 2) (2, 2)
Count 4 Tasks 1 Chunks
Type int64 numpy.ndarray
2 2

matmul()

The matmul() method is available from dask.array module and can multiply two dask arrays.

In [61]:
da.matmul(arr, arr.T)
Out[61]:
Array Chunk
Bytes 32 B 32 B
Shape (2, 2) (2, 2)
Count 5 Tasks 1 Chunks
Type int64 numpy.ndarray
2 2

Array Addition

We can perform array addition using + operator or add() method available through dask.array module.

In [62]:
arr = da.arange(10).reshape(2,5)

(arr + arr).compute()
Out[62]:
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18]])
In [63]:
da.add(arr, arr).compute()
Out[63]:
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18]])

ravel()

The ravel() method works like flatten() method and flattens dask array.

In [64]:
arr.ravel().compute()
Out[64]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

squeeze()

The squeeze() method removes any extra dimension present in the dask array.

Below we have reshaped our array to shape 1x2x5 where the first dimension is extra and can be removed by calling squeeze() method on the array.

In [65]:
x = arr.reshape((1,2,5))

x
Out[65]:
Array Chunk
Bytes 80 B 80 B
Shape (1, 2, 5) (1, 2, 5)
Count 3 Tasks 1 Chunks
Type int64 numpy.ndarray
5 2 1
In [66]:
x.squeeze()
Out[66]:
Array Chunk
Bytes 80 B 80 B
Shape (2, 5) (2, 5)
Count 4 Tasks 1 Chunks
Type int64 numpy.ndarray
5 2

round()

The round() method will round float elements as its name suggest.

In [67]:
arr = da.random.random((5,2)) * 10

arr.compute(), da.round(arr).compute(), arr.round().compute()
Out[67]:
(array([[3.97296649, 5.61008888],
        [4.17565619, 8.45053221],
        [8.37451698, 8.76292295],
        [5.04055009, 7.35127857],
        [5.1559902 , 2.16625279]]),
 array([[4., 6.],
        [4., 8.],
        [8., 9.],
        [5., 7.],
        [5., 2.]]),
 array([[4., 6.],
        [4., 8.],
        [8., 9.],
        [5., 7.],
        [5., 2.]]))

clip()

The clip() method limits the elements of the array in a particular range. It'll replace all elements which fall outside of the specified range with upper and lower limits of the range.

The clip() method is available directly on the dask array as well as from dask.array module.

In [68]:
arr = da.random.random((5,2)) * 10

x = arr.clip(min=0.5, max=5.0)

y = da.clip(arr, 0.5, 5.0)

arr.compute(), x.compute(), y.compute()
Out[68]:
(array([[8.21493101, 6.05107939],
        [4.14450236, 4.9375196 ],
        [8.39922484, 6.11072614],
        [0.89582841, 5.06048699],
        [5.02786728, 4.3837421 ]]),
 array([[5.        , 5.        ],
        [4.14450236, 4.9375196 ],
        [5.        , 5.        ],
        [0.89582841, 5.        ],
        [5.        , 4.3837421 ]]),
 array([[5.        , 5.        ],
        [4.14450236, 4.9375196 ],
        [5.        , 5.        ],
        [0.89582841, 5.        ],
        [5.        , 4.3837421 ]]))

all()

The all() method returns True if all elements evaluates to True. If some array elements are 0 then this method will return False.

In [69]:
arr = da.random.random((5,2)) * 10

arr.all().compute()
Out[69]:
True

any()

This method returns True if any element of the dask array evaluates to True. This method will return False if all array elements are 0.

In [70]:
arr = da.random.random((5,2)) * 10

arr.any().compute()
Out[70]:
True

allclose()

The allclose() method available from dask.array module takes as input two dask arrays and returns True if both are equal element-wise else it returns False.

In [71]:
arr1 = da.random.random((5,2)) * 10
arr2 = da.random.random((5,2)) * 10

da.allclose(arr1, arr1).compute(), da.allclose(arr1, arr2).compute()
Out[71]:
(True, False)

append()

The append() method available from dask.array module let us append elements to dask array.

In [72]:
arr = da.arange(10)

da.append(arr, [5,10]).compute()
Out[72]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  5, 10])
In [73]:
arr = da.arange(10).reshape(2,5)

da.append(arr, arr, axis=0).compute(), da.append(arr, arr, axis=1).compute()
Out[73]:
(array([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9],
        [0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]]),
 array([[0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9, 5, 6, 7, 8, 9]]))

ceil()

The ceil() method returns ceiling values of the input dask array of floats.

In [74]:
arr = da.random.random((5,2)) * 10

da.ceil(arr).compute()
Out[74]:
array([[5., 9.],
       [4., 5.],
       [8., 1.],
       [2., 6.],
       [4., 4.]])

floor()

The floor() method returns floor values of the input dask array of floats.

In [75]:
arr = da.random.random((5,2)) * 10

da.floor(arr).compute()
Out[75]:
array([[7., 9.],
       [5., 5.],
       [4., 9.],
       [8., 5.],
       [9., 8.]])

concatenate()

The concatenate() method available from dask.array module can be used to concatenate more than one array. We can perform concatenation at a particular axis as well. By default, it performs concatenation at the first dimension of the array.

In [76]:
arr1 = da.random.random((5,2)) * 10
arr2 = da.random.random((5,2)) * 10

arr3 = da.concatenate((arr1, arr2))
arr4 = da.concatenate((arr1, arr2), axis=1)

arr3.shape, arr4.shape
Out[76]:
((10, 2), (5, 4))

isnan() | isnull()

The isnan() or isnull() method available from dask.array module can be used to check presence of NaN / None in dask array. It returns array of same shape as input array with boolean values specifying presence or absence of NaN / None.

In [77]:
arr1 = da.random.random((5,2)) * 10

arr1[1,0] = np.nan
arr1[2,1] = np.nan
arr1[3,1] = np.nan

da.isnan(arr1).compute(), da.isnull(arr1).compute()
Out[77]:
(array([[False, False],
        [ True, False],
        [False,  True],
        [False,  True],
        [False, False]]),
 array([[False, False],
        [ True, False],
        [False,  True],
        [False,  True],
        [False, False]]))

where()

The where() method available from dask.array module can let us make conditional decisions on an array. The array elements are selected based on condition. The first argument is a condition, the second argument is value to use when the condition evaluates to True and the third argument is value to use when the condition evaluates to False.

Below we have first created a dask of random numbers with the shape of (5,2). We have then checked for the condition of elements greater than 5 using method where(). We return an array of the same shape as the input array where elements greater than 5 will be True and elements less than that will be False.

In [78]:
arr1 = da.random.random((5,2)) * 10

x = da.where(arr1 > 5., True, False)

arr1.compute(), x.compute()
Out[78]:
(array([[6.13513664, 2.43234731],
        [6.22041661, 3.82143705],
        [3.51092878, 6.35506368],
        [1.39135289, 7.56378462],
        [0.14717421, 1.67162379]]),
 array([[ True, False],
        [ True, False],
        [False,  True],
        [False,  True],
        [False, False]]))

Below we have created another example where we are checking for the condition greater than 5 again with where() method. This time we have provided the second and third argument dask array of the same shape as the input array. The array provided for cases when the condition evaluates to True is an array of 1s and the array provided for cases when the condition evaluates to False is an array of 0s.

In [79]:
arr1 = da.random.random((5,2)) * 10
arr2 = da.ones_like(arr1)
arr3 = da.zeros_like(arr1)

x = da.where(arr1 > 5., arr2, arr3)

arr1.compute(), x.compute()
Out[79]:
(array([[7.27397543, 0.83751119],
        [3.88807007, 4.75386433],
        [1.85187972, 9.49306162],
        [6.83441209, 2.84605967],
        [3.12498255, 9.33403342]]),
 array([[1., 0.],
        [0., 0.],
        [0., 1.],
        [1., 0.],
        [0., 1.]]))

unique()

The unique() method available from dask.array takes as input dask array and removes duplicate elements from the array.

In [80]:
arr = da.array([1,2,1,3,4,5,1,2])

da.unique(arr).compute()
Out[80]:
array([1, 2, 3, 4, 5])

pad()

The pad() function available from dask.array module can be used to add padding of specified size around the array. It takes as input dask array and padding size specified as single integer or tuple of integers.

In [81]:
arr = da.random.random((3,2))

x = da.pad(arr, 2)

x.compute()
Out[81]:
array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.40764197, 0.62426414, 0.        ,
        0.        ],
       [0.        , 0.        , 0.9333562 , 0.33195563, 0.        ,
        0.        ],
       [0.        , 0.        , 0.88170949, 0.94119391, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

hstack() | vstack()

The hstack() and vstack() methods are used to horizontally and vertically merge input arrays.

Below we have explained the usage of both methods with simple and easy-to-understand examples.

In [82]:
arr1 = da.arange(10).reshape(5,2)
arr2 = da.arange(10,20).reshape(5,2)

x = da.vstack((arr1, arr2))
y = da.hstack((arr1, arr2))

arr1.compute(), arr2.compute(), x.compute(), y.compute()
Out[82]:
(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 array([[10, 11],
        [12, 13],
        [14, 15],
        [16, 17],
        [18, 19]]),
 array([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17],
        [18, 19]]),
 array([[ 0,  1, 10, 11],
        [ 2,  3, 12, 13],
        [ 4,  5, 14, 15],
        [ 6,  7, 16, 17],
        [ 8,  9, 18, 19]]))

stack()

The stack() method can be used to merge input arrays at a particular axis.

Below we have explained its usage with a simple example. Please take a look at how arrays are merged at a particular axis and compare it with an output of methods hstack() and vstack() from the previous cell. The stack() method creates one extra dimension when merging arrays.

In [83]:
arr1 = da.arange(10).reshape(5,2)
arr2 = da.arange(10,20).reshape(5,2)

x = da.stack((arr1,arr2), axis=0)

y = da.stack((arr1,arr2), axis=1)

x = x.compute()
y = y.compute()

print(x.shape, y.shape)

x, y
(2, 5, 2) (5, 2, 2)
Out[83]:
(array([[[ 0,  1],
         [ 2,  3],
         [ 4,  5],
         [ 6,  7],
         [ 8,  9]],

        [[10, 11],
         [12, 13],
         [14, 15],
         [16, 17],
         [18, 19]]]),
 array([[[ 0,  1],
         [10, 11]],

        [[ 2,  3],
         [12, 13]],

        [[ 4,  5],
         [14, 15]],

        [[ 6,  7],
         [16, 17]],

        [[ 8,  9],
         [18, 19]]]))

absolute()

The absolute() method available from dask.array module returns an array where values of input array are replaced with absolute values.

In [84]:
arr1 = da.random.random((5,2)) * 10

arr1[0,0] = -1.5
arr1[1,1] = -2.5

da.absolute(arr1).compute()
Out[84]:
array([[1.5       , 9.31290159],
       [8.14690158, 2.5       ],
       [7.13987101, 8.58824419],
       [0.54188034, 2.10244167],
       [2.81535325, 1.99407251]])

Few other Methods

Below we have listed a few other methods available from dask.array module that can be used for various purposes.

In [85]:
da.log, da.log2, da.exp, da.sin, da.sinh, da.tan, da.tanh, da.cos, da.cosh, da.power, da.floor
Out[85]:
(<ufunc 'log'>,
 <ufunc 'log2'>,
 <ufunc 'exp'>,
 <ufunc 'sin'>,
 <ufunc 'sinh'>,
 <ufunc 'tan'>,
 <ufunc 'tanh'>,
 <ufunc 'cos'>,
 <ufunc 'cosh'>,
 <ufunc 'power'>,
 <ufunc 'floor'>)

5. Random Numbers Array

In this section, we'll explain various methods available through dask.array.random module to create dask arrays of random numbers.

random()

This method takes shape of an array specified as tuple and chunk size specified as integer/tuple as input and creates a dask array of random numbers in the range 0-1.

In [86]:
arr = da.random.random(size=(5,5), chunks=(2,2))

arr
Out[86]:
Array Chunk
Bytes 200 B 32 B
Shape (5, 5) (2, 2)
Count 9 Tasks 9 Chunks
Type float64 numpy.ndarray
5 5
In [87]:
arr.compute()
Out[87]:
array([[0.18412775, 0.96109041, 0.77225942, 0.59004179, 0.64410131],
       [0.44181774, 0.83411212, 0.21478979, 0.37770885, 0.37615642],
       [0.32440576, 0.6731673 , 0.73394615, 0.85531019, 0.03491276],
       [0.07013881, 0.68162293, 0.02277348, 0.76746369, 0.87788743],
       [0.28976463, 0.48732985, 0.22298156, 0.0548691 , 0.32101117]])

randint()

The randint() method lets us create a dask array of random integers in the specified range. We can specify the range as the first two values of the method followed by array shape specified as a tuple of integers. We can also provide chunk size with chunks parameter.

In [88]:
arr = da.random.randint(1,100,size=(5,5), chunks=(2,2))

arr.compute()
Out[88]:
array([[39, 56, 74, 36,  7],
       [92, 90, 87, 46, 15],
       [12, 36, 98, 68, 65],
       [65, 19, 23, 59, 70],
       [22, 11, 80,  2, 88]])

choice()

The choice() method creates an array of numbers randomly selected from the input array. It takes another 1D array of numbers and the shape of an array as input and creates an array by randomly selecting elements from the input array.

In [89]:
arr = da.arange(100)

x = da.random.choice(arr, size=(5,5), chunks=(2,2))

x.compute()
Out[89]:
array([[25, 88, 35, 47, 41],
       [77, 99, 12, 35, 82],
       [ 6, 60, 53, 20, 17],
       [32, 72, 55, 78, 95],
       [92, 18, 59, 45, 37]])

normal()

This method lets us create a dask array of random numbers selected from normal (gaussian) distribution. It takes the mean of the distribution as the first parameter and standard deviation as the second parameter followed by the array shape specified as a tuple. It then creates normal distribution and draws numbers randomly from it.

In [90]:
arr = da.random.normal(loc=0, scale=2, size=(5,5), chunks=(2,2))

arr.compute()
Out[90]:
array([[-1.26548106,  0.14975707,  0.08736993,  1.38149374,  1.75811442],
       [-4.02089205, -1.77809869, -1.58828286, -1.00317334,  2.84262564],
       [ 2.25401257,  4.74017485,  0.83628624, -0.23630734,  2.1252103 ],
       [-0.37234091, -1.23919658,  0.47145937,  1.15060709,  1.84107895],
       [-1.06455756, -0.71221938,  2.77714304, -4.26206608,  3.3650857 ]])

uniform()

This method creates a dask array of random numbers selected from a uniform distribution. It takes a lower limit of distribution as the first parameter and a higher limit of distribution as the second parameter followed by an array shape specified as a tuple. It then creates uniform distribution and draws numbers randomly from it.

In [91]:
arr  = da.random.uniform(low=0.0, high=2.0, size=(5,5), chunks=(2,2))

arr.compute()
Out[91]:
array([[0.17734124, 1.93364168, 1.67618978, 0.02611602, 0.02062228],
       [0.44312922, 0.5530165 , 1.41437035, 1.52908091, 1.83680722],
       [0.2787521 , 1.85084017, 1.59852119, 1.56458952, 1.2153871 ],
       [1.83990031, 1.939919  , 0.89418917, 0.66308575, 0.759346  ],
       [0.69806728, 0.17951115, 0.78675073, 0.57949471, 1.83652048]])

binomial()

This method creates a dask array of random numbers drawn from a binomial distribution. The first two parameters are input parameters for binomial distribution followed by the shape of the array specified as a tuple. The first parameter is a number of experiments and should be integer >=0. The second parameter is probability and should float in the range (0,1).

In [92]:
arr = da.random.binomial(2,0.5, size=(5,5), chunks=(2,2))

arr.compute()
Out[92]:
array([[1, 2, 0, 1, 1],
       [2, 0, 2, 1, 1],
       [0, 0, 2, 0, 0],
       [1, 1, 0, 2, 1],
       [1, 1, 1, 2, 2]])

multinomial()

This method creates a dask array of random numbers drawn from a multinomial distribution. The first two parameters are input parameters for multinomial distribution followed by the shape of the array specified as a tuple. The first parameter is a number of experiments and should be integer >=0. The second parameter is a list of p-values (specified as floats in the range 0-1) for different outcomes of experiments. The sum of all p-values should be 1. The method creates multinomial distribution using parameters and draws numbers randomly from it to create an array.

In [93]:
arr = da.random.multinomial(2, [1/4,]*4, size=(5,5), chunks=(2,2))

arr.compute()
Out[93]:
array([[[0, 1, 0, 1],
        [0, 0, 1, 1],
        [1, 0, 1, 0],
        [1, 0, 1, 0],
        [1, 0, 1, 0]],

       [[0, 0, 2, 0],
        [0, 1, 0, 1],
        [0, 2, 0, 0],
        [0, 2, 0, 0],
        [0, 0, 1, 1]],

       [[0, 1, 1, 0],
        [0, 0, 0, 2],
        [1, 1, 0, 0],
        [1, 0, 1, 0],
        [1, 1, 0, 0]],

       [[1, 0, 1, 0],
        [0, 0, 2, 0],
        [1, 0, 1, 0],
        [1, 1, 0, 0],
        [1, 1, 0, 0]],

       [[0, 1, 0, 1],
        [0, 1, 0, 1],
        [1, 1, 0, 0],
        [2, 0, 0, 0],
        [1, 0, 0, 1]]])

chisquare()

This method creates a dask array of random numbers drawn from the chi-squared distribution. This method accepts one parameter for distribution creation followed by an array shape specified as a tuple of integers. The first parameter is a degree of freedom of chi-squared distribution and should be integer >0. The method creates chi-squared distribution from it and draws numbers randomly from it.

In [94]:
arr = da.random.chisquare(2, size=(5,5), chunks=(2,2))

arr.compute()
Out[94]:
array([[ 3.85880911,  0.95762527,  0.54774722,  0.48285694,  2.34967049],
       [ 0.63420116,  0.55609132, 11.62313532,  0.2236012 ,  1.84753214],
       [ 0.10985597,  6.54892141,  0.35822403,  0.89918416,  1.99046679],
       [ 2.73109502,  1.106645  ,  3.64011672,  3.16725357,  1.7029928 ],
       [ 2.80125197,  0.2200947 ,  0.72569047,  0.74647333,  2.62392184]])

poisson()

This method creates a dask array of random numbers drawn from the Poisson distribution. This method accepts one parameter for distribution creation followed by an array shape specified as a tuple of integers. The first parameter is the expectation of interval of Poisson distribution and should be float value >=0. The method creates Poisson distribution from it and draws numbers randomly from it.

In [95]:
arr = da.random.poisson(lam=1.,size=(5,5), chunks=(2,2))

arr.compute()
Out[95]:
array([[3, 0, 2, 0, 0],
       [0, 0, 0, 3, 2],
       [1, 1, 3, 0, 5],
       [1, 0, 4, 1, 1],
       [1, 2, 1, 0, 0]])

6. Simple Statistics

In this section, we'll explain various methods available through dask.array to perform various statistical operations like min, max, mean, median, cumulative sum, cumulative product, etc.

Below we have created a dask array of random numbers of shape 3x5. We'll be using this array to explain various statistical operations available through dask.array.

In [96]:
arr = da.random.random(size=(3,5), chunks=(2,2))

arr
Out[96]:
Array Chunk
Bytes 120 B 32 B
Shape (3, 5) (2, 2)
Count 6 Tasks 6 Chunks
Type float64 numpy.ndarray
5 3
In [97]:
arr.compute()
Out[97]:
array([[0.45901057, 0.06182941, 0.32918735, 0.67186194, 0.69771056],
       [0.9872385 , 0.85582724, 0.31476628, 0.89864173, 0.95558797],
       [0.32251347, 0.96381529, 0.72896677, 0.42282161, 0.50371302]])

min() | max()

The min() function returns minimum element of the dask array and max() function returns maximum element of dask array.

We can also provide axis details to both methods to retrieve minimums/maximums at a particular axis of our dask array.

In [98]:
arr.min().compute()
Out[98]:
0.06182940564309336
In [99]:
arr.max().compute()
Out[99]:
0.9872384957192029
In [100]:
arr.min(axis=0).compute(), arr.min(axis=1).compute(),
Out[100]:
(array([0.32251347, 0.06182941, 0.31476628, 0.42282161, 0.50371302]),
 array([0.06182941, 0.31476628, 0.32251347]))
In [101]:
arr.max(axis=0).compute(), arr.max(axis=1).compute(),
Out[101]:
(array([0.9872385 , 0.96381529, 0.72896677, 0.89864173, 0.95558797]),
 array([0.69771056, 0.9872385 , 0.96381529]))

argmin() | argmax()

The argmin() and argmax() methods works like min() and max() methods but returns index of minimum and maximum elements.

In [102]:
arr.argmin(axis=0).compute(), arr.argmin(axis=1).compute(),
Out[102]:
(array([2, 0, 1, 2, 2]), array([1, 2, 0]))
In [103]:
arr.argmax(axis=0).compute(), arr.argmax(axis=1).compute(),
Out[103]:
(array([1, 2, 2, 1, 1]), array([4, 0, 1]))

sum()

The sum() method returns the sum of all values of the dask array. We can retrieve the sum at a particular axis of an array as well using axis argument.

In [104]:
arr.sum().compute()
Out[104]:
9.173491704885786
In [105]:
arr.sum(axis=0).compute(), arr.sum(axis=1).compute()
Out[105]:
(array([1.76876253, 1.88147194, 1.3729204 , 1.99332529, 2.15701155]),
 array([2.21959982, 4.01206171, 2.94183017]))

mean()

The mean() method returns an average of array elements. We can retrieve the mean at a particular axis of the dask array using axis argument.

In [106]:
arr.mean().compute()
Out[106]:
0.6115661136590523
In [107]:
arr.mean(axis=0).compute(), arr.mean(axis=1).compute()
Out[107]:
(array([0.58958751, 0.62715731, 0.45764013, 0.66444176, 0.71900385]),
 array([0.44391996, 0.80241234, 0.58836603]))

std()

The std() method returns standard deviation of array elements. We can retrieve standard deviation at a particular axis of the dask array using axis argument.

In [108]:
arr.std().compute()
Out[108]:
0.2798327227692256
In [109]:
arr.std(axis=0).compute(), arr.std(axis=1).compute()
Out[109]:
(array([0.2866503 , 0.40217085, 0.19194722, 0.19432359, 0.1850906 ]),
 array([0.2348411 , 0.24800621, 0.23064232]))

cumsum()

The cumsum() method cumulative sum of array elements. We can retrieve the cumulative sum at a particular axis of the dask array using axis argument.

In [110]:
arr.cumsum(axis=0).compute()
Out[110]:
array([[0.45901057, 0.06182941, 0.32918735, 0.67186194, 0.69771056],
       [1.44624906, 0.91765664, 0.64395363, 1.57050367, 1.65329853],
       [1.76876253, 1.88147194, 1.3729204 , 1.99332529, 2.15701155]])
In [111]:
arr.cumsum(axis=1).compute()
Out[111]:
array([[0.45901057, 0.52083997, 0.85002732, 1.52188927, 2.21959982],
       [0.9872385 , 1.84306573, 2.15783202, 3.05647374, 4.01206171],
       [0.32251347, 1.28632876, 2.01529553, 2.43811715, 2.94183017]])

cumprod()

The cumprod() method returns cumulative product of array elements. We can retrieve cumulative product at the particular axis of the dask array using axis argument.

In [112]:
arr.cumprod(axis=0).compute()
Out[112]:
array([[0.45901057, 0.06182941, 0.32918735, 0.67186194, 0.69771056],
       [0.4531529 , 0.05291529, 0.10361708, 0.60376318, 0.66672382],
       [0.14614791, 0.05100057, 0.07553341, 0.25528412, 0.33583747]])
In [113]:
arr.cumprod(axis=1).compute()
Out[113]:
array([[0.45901057, 0.02838035, 0.00934245, 0.00627684, 0.00437942],
       [0.9872385 , 0.8449056 , 0.26594779, 0.23899178, 0.22837767],
       [0.32251347, 0.31084341, 0.22659452, 0.09580906, 0.04826027]])

var()

The var() method returns variance of array elements. We can retrieve variance at a particular axis of the dask array using axis argument.

In [114]:
arr.var().compute()
Out[114]:
0.07830635273243829
In [115]:
arr.var(axis=0).compute(), arr.var(axis=1).compute()
Out[115]:
(array([0.0821684 , 0.16174139, 0.03684373, 0.03776166, 0.03425853]),
 array([0.05515034, 0.06150708, 0.05319588]))

corrcoef()

We can calculate correlation between two dask arrays using corrcoef() method.

In [116]:
arr1 = da.linspace(1,10,5)
arr2 = da.linspace(1,20,5)

arr1.compute(), arr2.compute()
Out[116]:
(array([ 1.  ,  3.25,  5.5 ,  7.75, 10.  ]),
 array([ 1.  ,  5.75, 10.5 , 15.25, 20.  ]))
In [117]:
da.corrcoef(arr1, arr2).compute()
Out[117]:
array([[1., 1.],
       [1., 1.]])

maximum()

The maximum() method takes two dask arrays as input and returns an array that has the same length as input arrays but elements are element-wise maximum.

In [118]:
arr1 = da.linspace(1,10,5)
arr2 = da.random.random(5) *5

arr1.compute(), arr2.compute()
Out[118]:
(array([ 1.  ,  3.25,  5.5 ,  7.75, 10.  ]),
 array([0.09015038, 2.23540179, 3.52211549, 1.74988675, 3.04323073]))
In [119]:
da.maximum(arr1, arr2).compute()
Out[119]:
array([ 1.  ,  3.25,  5.5 ,  7.75, 10.  ])

This ends our small tutorial explaining how we can use dask.array module to create, index/slice, manipulate and calculate simple statistics on dask arrays. Please feel free to let us know your views in the comments section.

References


  Support Us to Make a Difference

Thank You for visiting our website. If you like our work, please support us so that we can keep on creating new tutorials/blogs on interesting topics (like AI, ML, Data Science, Python, Digital Marketing, SEO, etc.) that can help people learn new things faster. You can support us by clicking on the Coffee button at the bottom right corner. We would appreciate even if you can give a thumbs-up to our article in the comments section below.

 Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to let us know in the comments section below (Guest Comments are allowed). We appreciate and value your feedbacks.



Sunny Solanki  Sunny Solanki