Updated On : Dec-22,2021 Tags numba, stencil-kernel
Numba @stencil Decorator: Guide to Improve Performance of Code involving Stencil Kernels

Numba @stencil Decorator: Guide to Improve Performance of Code involving Stencil Kernels

Numba is one of the most commonly used libraries nowadays to speed-up python code. It can speed up your existing python code by a big margin by simply decorating your existing functions with numba decorators. Numba provides various decorators to speed up the python code. Below is a list of some of the decorators on which we have already covered tutorials explaining decorators in depth.

  • @jit &@njit - Decorators to speed up almost any python function.
  • @vectorize - Decorator to speed up numpy-like universal functions.
  • @guvectorize - Decorator which is an extended version of @vectorize decorator.
  • @stencil - Decorator which speeds up function performing stencil kernel operations like convolution, correlation, etc.

As a part of this tutorial, we'll be concentrating on @stencil kernel. The stencil kernel is a kind of numerical computation where each element of the array is updated according to some fixed pattern. This fixed pattern can be performing operations on each array element based on neighbors of the same array or different array. One common example of a stencil kernel is a convolution operation where we convolve a fixed dimension kernel on each element of the array. Numba let us speed up these functions which perform stencil operations on input arrays using @stencil decorator.

Below we have highlighted important sections of the tutorial.

Important Sections of Tutorial

Below we have imported Numba and printed the version of it. We have also imported @stencil decorator.

In [1]:
import numba

print("Numba Version : {}".format(numba.__version__))
Numba Version : 0.54.1
In [2]:
import numpy as np
In [3]:
from numba import stencil

Example 1: Simple Example with 1-Dimensional Array

As a part of our first example, we'll explain how we can create a stencil kernel that works with 1D array. We'll then decorate it with Numba @stencil decorator to speed up the execution.

Below we have created a function that takes 1D array as input and creates a new 1D array where all elements are replaced by the addition of the element and its neighbors. Our function takes as input 2 1D arrays. The first array will have data and 2nd array will be empty which will be filled with the result. We have put an if-else condition to handle boundary conditions.

In [3]:
def conv_op(a, b):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            b[i] = 0
        else:
            b[i] = a[i-1] + a[i] + a[i+1]

In the below cell, we have created a 1D array of 1M numbers and applied our function to it. We have then recorded the time taken by function to execute using %time magic command of jupyter notebook. It records the time of a single python statement. We have then printed the first few elements of the resulted array as well.

If you want to know about other jupyter notebook magic commands like %time then please feel free to check our tutorial on the same which covers the majority of the commands available.

In [4]:
input_arr = np.arange(1_000_000)
output_arr = np.empty_like(input_arr)

%time conv_op(input_arr,output_arr)
CPU times: user 474 ms, sys: 8.11 ms, total: 482 ms
Wall time: 482 ms
In [5]:
output_arr[:10]
Out[5]:
array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])

In the below cell, we have re-created our function again using @stencil decorator. We have already imported stencil from Numba earlier. We have removed if-else conditions needed to handle boundary conditions because now boundary conditions will be handled by @stencil decorator itself. We also don't need to provide a second array as Numba will internally create a new array and return it. We only need to provide logic to apply stencil kernel. Internally, @stencil decorator will loop through all elements of the input array and apply the stencil kernel logic on all of them. It'll then return the final array with the result. In the below function, 'a[-1], a[0] and a[1]' elements refers to 'a[i-1],a[i] and a[i+1]' elements where 'i' will change from 0 till end index of the array. So for each element, we'll add its previous and next elements to it. When handling boundary conditions for first and last elements where there is no previous and next element, it'll evaluate it to 0.

Then, in the next cell, we have executed our @stencil decorated function 3 times and recorded the time taken by it each time. We have also stored the output array and printed it for comparison with the result of the original python function without a decorator. We can notice from the time that it takes quite less time to execute @stencil decorated function compared to the python function without it.

When designing a function that applies stencil kernel to the input array, we just need to include kernel logic that will be applied to the input array. We don't need to worry about boundary conditions. We can also provide more than one array to stencil function which we'll explain in our upcoming examples.

In [13]:
@stencil
def conv_op(a):
    return a[-1] + a[0] + a[1]
In [57]:
%time output_arr = conv_op(input_arr)

%time output_arr = conv_op(input_arr)

%time output_arr = conv_op(input_arr)
CPU times: user 129 ms, sys: 0 ns, total: 129 ms
Wall time: 133 ms
CPU times: user 92.6 ms, sys: 3.39 ms, total: 95.9 ms
Wall time: 95.2 ms
CPU times: user 93.7 ms, sys: 4.08 ms, total: 97.8 ms
Wall time: 97.3 ms
In [59]:
output_arr[:10]
Out[59]:
array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])

Example 2: Working with 2-Dimensional Array

In this example, we have explained how we can create a stencil kernel function that works on 2D array. Below we have created a python function that loops through all elements of the input 2D array. It then adds the elements that come in the same column but one row before and after it and elements that come in the same row but one column before and after it. It adds elements that are on a plus sign formed in an array on the given element. Our function adds results to another array given as input to the function.

In the next cell, we have first created an array of (1000,1000) integers and executed our python function on it. We have also recorded the time taken by the function.

In [86]:
def conv_op(a, b):
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if i+1 == a.shape[0] or j+1 == a.shape[1]:
                b[i,j] = 0
            elif i-1 < 0 or j-1 < 0:
                b[i,j] = 0
            else:
                b[i,j] = a[i, j+1] + a[i+1, j] + a[i, j-1] + a[i-1, j]
In [87]:
input_arr = np.arange(1000_000).reshape((1000, 1000))

input_arr[:5,:5]
Out[87]:
array([[   0,    1,    2,    3,    4],
       [1000, 1001, 1002, 1003, 1004],
       [2000, 2001, 2002, 2003, 2004],
       [3000, 3001, 3002, 3003, 3004],
       [4000, 4001, 4002, 4003, 4004]])
In [89]:
output_arr = np.empty_like(input_arr)

%time conv_op(input_arr, output_arr)

output_arr[:5,:5]
CPU times: user 852 ms, sys: 7.22 ms, total: 859 ms
Wall time: 859 ms
Out[89]:
array([[    0,     0,     0,     0,     0],
       [    0,  4004,  4008,  4012,  4016],
       [    0,  8004,  8008,  8012,  8016],
       [    0, 12004, 12008, 12012, 12016],
       [    0, 16004, 16008, 16012, 16016]])

In the below cell, we have recreated the function but using Numba @stencil decorator. We can now remove boundary conditions check and loop as it'll be handled by Numba internally.

After creating the function, we have executed it three times with the array we created in the previous cell. We also recorded the time taken by function to compare it with the normal python function we created earlier. We can notice from the results that @stencil decorated function performs quite better compared to our normal python function.

In [90]:
from numba import stencil

@stencil
def conv_op(a):
    return a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0]
In [92]:
%time output_arr = conv_op(input_arr)

%time output_arr = conv_op(input_arr)

%time output_arr = conv_op(input_arr)

output_arr[:5,:5]
CPU times: user 173 ms, sys: 0 ns, total: 173 ms
Wall time: 174 ms
CPU times: user 143 ms, sys: 0 ns, total: 143 ms
Wall time: 143 ms
CPU times: user 156 ms, sys: 0 ns, total: 156 ms
Wall time: 156 ms
Out[92]:
array([[    0,     0,     0,     0,     0],
       [    0,  4004,  4008,  4012,  4016],
       [    0,  8004,  8008,  8012,  8016],
       [    0, 12004, 12008, 12012, 12016],
       [    0, 16004, 16008, 16012, 16016]])

Example 3: Constant Value as One of the Input

Both of our previous examples demonstrate usage of @stencil decorator took single array as input and worked on it. We can also provide another array of scalar as input to the function. In this example, we'll explain how we can give scalar as input to function.

Below we have first created a simple python function that takes three arguments as input. The first argument is an array, the second argument is scalar and the third argument is again an array of the same shape as the first argument. We then loop through each element of the first array and add its neighbors to it like our first example. This time after adding neighbors, we are also multiplying it with a scalar. We are storing results in another input array.

In the next cell, we have created an array of 1M integers and initialized scalar value to be given as input to function. Then, we have executed our function with this array and scalar value recording time taken for the execution.

In [188]:
def conv_op(a, b, c):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            c[i] = 0.0
        else:
            c[i] = b * (a[i-1] + a[i] + a[i+1])
In [189]:
input_arr1 = np.arange(1000_000)

input2 = 0.25
In [190]:
output_arr = np.empty_like(input_arr1)

%time conv_op(input_arr1, input2, output_arr)

output_arr[:10]
CPU times: user 1.76 s, sys: 0 ns, total: 1.76 s
Wall time: 1.75 s
Out[190]:
array([0, 0, 1, 2, 3, 3, 4, 5, 6, 6])

In the below cell, we have recreated the function using @stencil decorator. Our function now takes two inputs where the first argument is an array and the second argument is a scalar value.

Then, in the next cell, we have executed our function three times on an array from the previous cell and recorded the time taken by it. We can notice that it outperforms the original python function without a decorator by a big margin.

In [191]:
@stencil
def conv_op(a, b):
    return b * (a[-1] + a[0] + a[+1])
In [193]:
%time conv_op(input_arr1, input2)

%time conv_op(input_arr1, input2)

%time conv_op(input_arr1, input2)

output_arr[:10]
CPU times: user 122 ms, sys: 0 ns, total: 122 ms
Wall time: 125 ms
CPU times: user 100 ms, sys: 1.53 ms, total: 102 ms
Wall time: 101 ms
CPU times: user 101 ms, sys: 0 ns, total: 101 ms
Wall time: 101 ms
Out[193]:
array([0, 0, 1, 2, 3, 3, 4, 5, 6, 6])

Example 4: Working with More than One Arrays as Input

All our previous examples worked on single input arrays but we can create a stencil kernel function that works on more than one array. In this example, we'll create an example that works on two input arrays.

Below we have created a python function that takes as input three input 1D arrays of the same size. It then loops through each element of the first and second array, takes the average of elements at the same index, previous index, and next index. It then adds all averaged elements and stores results in the third array. The logic is almost the same as our first example with the only change being that it's taking an average of two array elements this time.

Then in the next cell, we have created two 1D arrays of 1M elements and execute our function on them. We have also recorded the time taken for execution.

In [4]:
def conv_op(a, b, c):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            c[i] = 0.0
        else:
            c[i] = (a[i-1] + b[i-1])/2 + (a[i]+b[i])/2 + (a[i+1]+b[i+1])/2
In [5]:
input_arr1 = np.arange(1000_000)

input_arr2 = np.ones_like(input_arr1)
In [6]:
output_arr = np.empty_like(input_arr1, dtype=np.float32)

%time conv_op(input_arr1, input_arr2, output_arr)

output_arr[:10]
CPU times: user 1.26 s, sys: 0 ns, total: 1.26 s
Wall time: 1.26 s
Out[6]:
array([ 0. ,  3. ,  4.5,  6. ,  7.5,  9. , 10.5, 12. , 13.5, 15. ],
      dtype=float32)

In the below cell, we have reimplemented the function using @stencil decorator. We have created a function that takes two 1D arrays as input. We have then executed the function three times and recorded the time taken for execution. We can notice from the time recorded that it takes quite less time for @stencil decorated function to execute compared to the normal python function without it.

In [197]:
@stencil
def conv_op(a, b):
    return (a[-1]+b[-1])/2 + (a[0]+b[0])/2 + (a[1]+b[1])/2
In [198]:
input_arr1 = np.arange(1000_000)

input_arr2 = np.ones_like(input_arr1)

%time output_arr = conv_op(input_arr1, input_arr2)

%time output_arr = conv_op(input_arr1, input_arr2)

%time output_arr = conv_op(input_arr1, input_arr2)

output_arr[:10]
CPU times: user 166 ms, sys: 0 ns, total: 166 ms
Wall time: 165 ms
CPU times: user 142 ms, sys: 3.63 ms, total: 145 ms
Wall time: 145 ms
CPU times: user 150 ms, sys: 0 ns, total: 150 ms
Wall time: 149 ms
Out[198]:
array([ 0. ,  3. ,  4.5,  6. ,  7.5,  9. , 10.5, 12. , 13.5, 15. ])

Example 5: Using neighborhood Argument for Rolling Operations

In this example, we'll explain how we can use neighborhood argument of @stencil decorator to work on many elements of the input array. This can be useful in situations when we want to perform rolling window functions like rolling average over 5 days, rolling standard deviation over 10 days, etc.

Below we have created a function that takes as input two 1D arrays. It then loops through elements of the first array one by one. It takes an average of elements and four previous elements before it. We have given another array of the same shape as the first array in which results will be recorded.

In [247]:
def rolling_mean(a, b):
    for i in range(a.shape[0]):
        idx1, idx2, idx3, idx4, idx5 = i-4,i-3,i-2,i-1,i
        elem1 = np.nan if idx1 < 0 else a[idx1]
        elem2 = np.nan if idx2 < 0 else a[idx2]
        elem3 = np.nan if idx3 < 0 else a[idx3]
        elem4 = np.nan if idx4 < 0 else a[idx4]
        elem5 = a[idx5]

        b[i] = (elem1+elem2+elem3+elem4+elem5) / 5
In [248]:
input_arr = np.linspace(0, 500, num=1_000_000)

output_arr = np.empty_like(input_arr)
In [249]:
%time rolling_mean(input_arr, output_arr)

output_arr[:10]
CPU times: user 854 ms, sys: 0 ns, total: 854 ms
Wall time: 853 ms
Out[249]:
array([   nan,    nan,    nan,    nan, 0.001 , 0.0015, 0.002 , 0.0025,
       0.003 , 0.0035])

In the below cell, we have recreated the python function using @stencil decorator. We have provided argument neighborhood to @stencil decorator with the value of (-4,0) which suggests indexing to go from -4 to 0. Inside of the function, we are looping from -4 to 0 and adding elements of the array. We are then dividing the total by 5 to calculate the average. This function will be executed for each array element taking an average of the current element and four previous elements.

In the next cell, we have executed our @stencil decorated function three times with an array created earlier and recorded time for each execution. We can notice that it takes quite less time for @stencil decorated function to execute.

In [250]:
@stencil(neighborhood=((-4,0), ))
def rolling_mean(a):
    tot = 0.0
    for i in range(-4,1):
        tot += a[i]
    return tot / 5
In [251]:
%time output_arr = rolling_mean(input_arr)

%time output_arr = rolling_mean(input_arr)

%time output_arr = rolling_mean(input_arr)

output_arr[:10]
CPU times: user 233 ms, sys: 3.32 ms, total: 236 ms
Wall time: 237 ms
CPU times: user 102 ms, sys: 0 ns, total: 102 ms
Wall time: 102 ms
CPU times: user 102 ms, sys: 0 ns, total: 102 ms
Wall time: 102 ms
Out[251]:
array([0.    , 0.    , 0.    , 0.    , 0.001 , 0.0015, 0.002 , 0.0025,
       0.003 , 0.0035])

Example 6: Specifying Constant Value for Border Cases

Till now, for all our examples, we were not able to specify boundary values for boundary conditions. It was set to 0 internally by Numba but we can specify constant value for border cases using func_or_mode and cval parameters of @stencil decorators. In this example, we'll explain how to use them.

Below we have recreated our function from the first example, where we are setting boundary values to -1 instead of 0 which we were setting till now. We can notice from the results when we run the below function.

In [272]:
def conv_op(a, b):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            b[i] = -1
        else:
            b[i] = a[i-1] + a[i] + a[i+1]
In [273]:
input_arr = np.arange(1_000_000)

output_arr = np.empty_like(input_arr)

%time conv_op(input_arr, output_arr)

output_arr
CPU times: user 473 ms, sys: 3.21 ms, total: 476 ms
Wall time: 475 ms
Out[273]:
array([     -1,       3,       6, ..., 2999991, 2999994,      -1])

In the below cell, we have created our python function using @stencil decorator. We have provided a two-parameter to stencil kernel. The value of func_or_mode parameter is set to 'constant' and cval parameter is set to -1. This will inform numba to use this constant value for boundary conditions.

In [274]:
@stencil(func_or_mode="constant", cval=-1)
def conv_op(a):
    return a[-1] + a[0] + a[1]
In [275]:
%time output_arr = conv_op(input_arr)

%time output_arr = conv_op(input_arr)

%time output_arr = conv_op(input_arr)

output_arr
CPU times: user 135 ms, sys: 7.89 ms, total: 143 ms
Wall time: 141 ms
CPU times: user 299 ms, sys: 0 ns, total: 299 ms
Wall time: 297 ms
CPU times: user 114 ms, sys: 0 ns, total: 114 ms
Wall time: 114 ms
Out[275]:
array([     -1,       3,       6, ..., 2999991, 2999994,      -1])

Example 7: Using Standard Indexing on Specified Arrays

As we noticed through all our examples till now, @stencil decorator internally uses relative indexing to loop through all elements and apply stencil kernel function to them. But there can be situations where we have given more than one array as input and we want to use standard indexing on some input arrays rather than using relative indexing for all. By default, @stencil decorator uses relative indexing for all input arrays. In this example, we'll explain how we can use standard indexing with @stencil decorator.

In the below cell, we have created a function that takes three arrays as input. The first and last arrays are of the same size whereas the second array has three elements. We have modified our function from the first example. We are looping through each element of the first array and multiplying the current element with the second element of the second array, a previous element with the first element of the second array, and the next element with the third element of the second array. We are storing the result in the third array. In this case, the second array is not following the same indexing as the first array.

In the next cell, we have executed the function on the input array of 1M numbers and the second array of three elements. We have also recorded the time taken by the function.

In [310]:
def conv_op(a, b, c):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            c[i] = 0
        else:
            c[i] = (b[0] * a[i-1]) + (b[1] * a[i]) + (b[2] * a[i+1])
In [312]:
input_arr = np.arange(1_000_000)

weights = np.array([0.25, 0.5, 0.75])

output_arr = np.empty_like(input_arr, dtype=np.float64) ## Please make a note of data type change here.

%time conv_op(input_arr, weights, output_arr)

output_arr[:10]
CPU times: user 866 ms, sys: 0 ns, total: 866 ms
Wall time: 865 ms
Out[312]:
array([ 0. ,  2. ,  3.5,  5. ,  6.5,  8. ,  9.5, 11. , 12.5, 14. ])

In the below cell, we have recreated our earlier python function using @stencil decorator. This time we have provided parameter standard_indexing to @stencil decorator with value ('b',) telling it to use standard indexing for array b which is the second input array to function.

We have then executed our @stencil decorated function three times and recorded the time taken for each execution. We can notice from the results that it performs quite better compared to the normal loop-based python function.

In [313]:
@stencil(standard_indexing=("b",))
def conv_op(a,b):
    return b[0] * a[-1] + b[1] * a[0] + b[2] * a[1]
In [314]:
input_arr = np.arange(1_000_000)

weights = np.array([0.25, 0.5, 0.75])

%time output_arr = conv_op(input_arr, weights)

%time output_arr = conv_op(input_arr, weights)

%time output_arr = conv_op(input_arr, weights)

output_arr[:10]
CPU times: user 141 ms, sys: 3.95 ms, total: 145 ms
Wall time: 144 ms
CPU times: user 120 ms, sys: 3.65 ms, total: 123 ms
Wall time: 123 ms
CPU times: user 129 ms, sys: 0 ns, total: 129 ms
Wall time: 128 ms
Out[314]:
array([ 0. ,  2. ,  3.5,  5. ,  6.5,  8. ,  9.5, 11. , 12.5, 14. ])

Example 8: Specifying Output Array Explicitly

Till now, all our @stencil decorated functions returned the output array after performing stencil kernel on the input array. We can also create an array by ourselves and ask the results to be stored in it. We need to provide output array to out argument when calling @stencil decorated function and it'll add results to that array rather than returning it. Below we have explained with a simple example how we can use out argument.

In [294]:
@stencil
def conv_op(a):
    return a[-1] + a[0] + a[1]
In [295]:
input_arr = np.arange(1_000_000)

output_arr = np.empty_like(input_arr)

%time conv_op(input_arr, out=output_arr)

output_arr[:10]
CPU times: user 100 ms, sys: 3.97 ms, total: 104 ms
Wall time: 105 ms
Out[295]:
array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])

This ends our small tutorial explaining how we can use @stencil decorator of Numba to perform stencil kernel on a given input array. Please feel free to let us know your views in the comments section.

References


  Support Us to Make a Difference

Thank You for visiting our website. If you like our work, please support us so that we can keep on creating new tutorials/blogs on interesting topics (like AI, ML, Data Science, Python, Digital Marketing, SEO, etc.) that can help people learn new things faster. You can support us by clicking on the Coffee button at the bottom right corner. We would appreciate even if you can give a thumbs-up to our article in the comments section below.

 Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to let us know in the comments section below (Guest Comments are allowed). We appreciate and value your feedbacks.



Sunny Solanki  Sunny Solanki