Python is an interpreter-based language hence it's slow compared to other compiler-based languages like C/C++. Due to this python was not used in any performance-intensive application. To solve this problem, a python library named Numba was developed. Numba is generally referred to as Just-In-Time (JIT) compiler of python code which can speed some parts of or all of the python code by converting it to low-level machine instructions. It uses LLVM library for converting python code to machine instructions. We can rerun code compiled with Numba and it'll run almost faster like C/C++ language code.
The process of using Numba to speed up code is quite simple. Numba provides us with a list of decorators that we can use to decorate our functions and it'll compile them when we call the function the first time. Each subsequent call will be using that compiled version hence will be faster. When a function decorated with Numba decorators is called, it'll be compiled first to generate faster machine code hence it'll take a little more time. Once the code is compiled then recalling such function will be way faster because the compiled version will be called subsequently.
Numba reads python bytecode of function covered with Numba decorator, converts its input arguments and other data used inside the function to Numba datatypes, optimizes various parts, and converts it to machine code using LLVM library. If a function is designed to use with various data types (generic functions) then Numba will take time to compile the function each time function is called with a new data type it hasn't seen before. Because it'll be creating a different compilation of the same generic function with different datatypes.
Please make a NOTE that Numba can only translate a certain subset of python code which involves loops and code involving numpy to faster machine code. Not everything will be running faster using Numba. One needs to have basic knowledge of what can be parallelized and what not to make efficient use of Numba. We'll help understand how to use Numba better in various situations in this tutorial.
As a part of this tutorial, we'll be covering how we can speed up our python functions using Numba. We'll be explaining @jit and @njit decorators available from Numba. Below we have highlighted important sections of the tutorial to give an overview of the material that we'll be covering in this tutorial.
We can easily install Numba using pip or conda.
Below we have imported Numba and printed the version of it that we have used in this tutorial.
import numba
print("Numba Version : {}".format(numba.__version__))
import numpy as np
import pandas as pd
In this example, we'll be introducing the first decorator available from Numba named @jit to speed up our python function. We can decorate any python function with @jit decorator and it should speed up the python function.
The @jit decorator will compile a python code of function decorated with it. The @jit decorator generally works in one of the below-mentioned two modes.
Users can test whether their function can run with nopython mode first and if it works then use that mode otherwise fall back to object mode. If you know that your whole function can be converted using Numba then it's preferred to use nopython mode. If your function is designed in a way that some parts of it can be converted to Numba and some will run in pure python then it's preferred to run in object mode.
When testing Numba @jit decorator, if it does not seem to improve performance then it's better to remove @jit decorator and fall back to using pure python and find out some other ways to improve performance. Because using @jit decorator with functions that can't be converted to Numba might result in worsening performance as it'll take time to compile function the first time and will result in no performance improvement. Hence time taken first time to compile functions to convert will add up overhead.
Please make a NOTE that Numba generally does not speed up code involving list-comprehensions and it is suggested to fall back and convert function using comprehensions to loop-based again for faster performance.
In this section, we have created two small examples to explain the usage of @jit decorator in object mode.
In our first example, we have created a simple function that takes as input an array of arbitrary size and performs a cube formula on each individual element of the array. The perform_operation() function takes as input an array and executes cube_formula() function on each element of the array recording their results.
def cube_formula(x):
return x**3 + 3*x**2 + 3
def perform_operation(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
After defining the functions, we have executed our main function with two different arrays of numbers where the first one consists of 1M numbers and the second one consists of 10M numbers. We have also recorded the time taken by functions as we'll be comparing it against @jit decorated functions. We have used the jupyter magic command %time to measure the execution time of a particular statement.
Please make a NOTE that speed up provided by Numba @jit decorated functions will be different on different computers as it's based on low-level machine instructions available to LLVM Compiler on the particular computer which can differ from computer to computer.
If you are interested in learning about cell magic commands (like %time which we have used in this tutorial) available in jupyter notebook then please feel free to check our tutorial on the same. It covers the majority of jupyter notebook magic commands.
%time out = perform_operation(np.arange(1e6))
%time out = perform_operation(np.arange(1e7))
Below we have re-defined both of our functions again but this time decorated them again with @jit decorator.
from numba import jit
@jit
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
Below we have executed the jit-decorated function with two different arrays of different sizes. We have used the same arrays which we had used when testing function normally.
We can notice from the results of time taken by both functions that it takes literally a lot less compared to what it used to take without @jit. The @jit decorator has improved the performance by quite a big margin.
%time out = perform_operation_jitted(np.arange(1e6))
%time out = perform_operation_jitted(np.arange(1e7))
In this section, we have defined one more function to explain the usage of @jit decorator. We have simply created a function that simply executes loop inside of loop and records indices of all combinations. The first loop executes 10000 times and the inside loop executes 1000 times.
After defining the function, we have executed it 3 times and recorded the time taken by it each time for comparison purposes later.
def calculate_all_permutations():
perms = []
for i in range(int(1e4)):
for j in range(int(1e3)):
perms.append((i,j))
return perms
%time perms = calculate_all_permutations()
%time perms = calculate_all_permutations()
%time perms = calculate_all_permutations()
Now, we have again defined our function but this time decorated it with @jit decorator. We have then rerun this @jit decorated function 3 times to record the time taken by it. We can notice from the results that it takes quite less time compared to normal function. Also, subsequent calls to @jit decorated function take less time because it uses an already compiled version.
@jit
def calculate_all_permutations():
perms = []
for i in range(int(1e4)):
for j in range(int(1e3)):
perms.append((i,j))
return perms
%time perms = calculate_all_permutations()
%time perms = calculate_all_permutations()
%time perms = calculate_all_permutations()
In this section, we have run our examples in nopython mode of Numba @jit decorator. There are two ways in which we can force nopython mode.
We'll be using both in our examples.
In this section, we have redefined our functions again and decorated them with @jit decorators. But this time, we have set nopython argument of @jit decorator to True which is False by default. This will force Numba to run in strict nopython mode and convert all the code of the function to low-level machine code. This mode is generally preferred as it works fast compared to object mode.
Our current functions are designed in a way that they can be totally converted to low-level machine code using Numba.
If you use @jit decorator in nopython mode then Numba will try to compile your function immediately and if it could not convert some parts then it'll fail with an error. If your function fails to compile in nopython mode then it’s advisable to either use object mode or divide functions into more functions and use nopython mode whenever possible on sub-parts.
from numba import jit
@jit(nopython=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit(nopython=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
Below we have executed our function two times, ones with an array of size 1M and one's with an array of size 10. We have also recorded the time taken by both. We can notice from the results that it seems to be taking the almost same time as our previous object mode runs. Though it does not seem to improve the performance of these functions much further, it’s generally the preferred mode to use whenever possible as it can speed up code more.
%time out = perform_operation_jitted(np.arange(1e6))
%time out = perform_operation_jitted(np.arange(1e7))
Below we have introduced another way of using nopython mode by decorating our functions with @njit decorator. We have then also run our @njit decorated function two times, ones with an array of size 1M and ones with an array of size 10M. We can notice from the results that the time taken is almost the same as using nopython=True inside of @jit decorator.
from numba import njit
@njit
def cube_formula(x):
return x**3 + 3*x**2 + 3
@njit
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
%time out = perform_operation_jitted(np.arange(1e6))
%time out = perform_operation_jitted(np.arange(1e7))
In this section, we have @njit decorated our second example which we had run in object mode explanation section earlier. We have then executed the function three times to check performance. We can notice from the results that the time taken is almost the same or a little better compared to object mode runs.
@njit
def calculate_all_permutations():
perms = []
for i in range(int(1e4)):
for j in range(int(1e3)):
perms.append((i,j))
%time perms = calculate_all_permutations()
%time perms = calculate_all_permutations()
%time perms = calculate_all_permutations()
When Numba compiles the code, it internally creates a version for each different data type with which a function is run. Each time a @jit decorated function is run with a new data type, Numba needs to compile the function first with this new data type and create a new data type version for future use. All subsequent calls for this recorded data type will be faster.
We can also separately specify input and output data types of our @jit decorated function. This will create a compiled version for the specified data type when the function is defined and not when the function is first called with that data type.
We can provide data types as the first argument of the decorator. We can specify the input and out data types of function using ret_type(param1_type, param2_type, ...) format. The input parameters data type is specified inside of parenthesis and return type is specified outside of parenthesis at the beginning. The data type that we can use in @jit decorator needs to be imported from Numba. If input or output element is an array then we can represent it using strong '[:]' followed by data type.
Please make a NOTE that when declaring functions with the data type, Numba will only allow us to execute functions with specified data types. All calls of any other data type will fail.
Below we have redefined our functions which we have been using for the last few examples again but this time we have provided input/output data types as well. We have decorated our function with int64 data type for both input and output. This will create a compiled version for this data type when we execute the below cell. Now, when we execute these functions with int64 data types, it does not need compilation again, it'll just run them immediately.
As we have declared our functions with input/output data types as integer, if we call the below functions with float data types then it'll fail.
from numba import jit, int64, float32, float64
@jit(int64(int64), nopython=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit(int64[:](int64[:]), nopython=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
Below we have run our jit-decorated function with data types, first with an array of 1M int64 numbers and then with an array of 10M int64 numbers. We have also recorded the time taken by both. We can notice from the time that it has improved further compared to all our previous versions.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.int64))
If our input function works with different data types then we can specify more than one signature as well inside of @jit decorator. The data type signatures can be specified as a list.
Below we have specified two different data types signatures for our functions. Numba will internally create compiled versions for both data types. Now our functions can run with these two data types, call with some other data type will fail.
from numba import jit, int64, float32, float64
@jit([int64(int64), float64(float64)], nopython=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
Below we have executed our functions with arrays of sizes 1M and 10M respectively. We have first executed them with int64 data type and then with float64. We have also recorded the time taken by both. We can notice from the time taken that it has improved quite a lot compared to our examples where we had not declared data types.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
When we call a @jit decorated function with the particular data type, Numba creates a machine code for it. This compilation can take time. We can avoid this compilation time if we are calling functions more than once by setting cache argument of @jit decorator to True.
Numba will internally use file-based cache to maintain compiled versions of functions.
Below we have re-defined our functions with cache argument set to True.
from numba import jit, int32, int64, float32, float64
@jit([int32(int32), int64(int64), float64(float64)], nopython=True, cache=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
Below we have executed our functions three times using the same array of 1M integer numbers. We have also recorded the time taken for executions. We can notice from the time taken by executions that they are the lowest of all our tries till now.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
Below we have executed our functions three times using the same array of 1M float numbers. We have also recorded the time taken for executions. The time taken for executions is the least of all our tries till now.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
Below we have executed our functions three times using the same array of 10M float numbers. We have also recorded the time taken for executions. The time taken is the least of all our tries of the same functions till now.
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
Numba can also parallelize our code on multi-core CPUs. It uses multi-threading to speed up code by running threads on different cores of the computer in parallel. In order to parallelize code, we need to set parallel parameter of @jit decorator to True. There are two types of parallelization available in Numba
In our example, we'll use explicit parallelization by using prange() function.
Please make a NOTE that Python Global Interpreter Lock (GIL) can prevent the speed up of multi-threading. We'll explain in our upcoming examples how we can release GIL and get around this problem.
Below we have re-defined our functions and set parallel parameter to True inside of @jit decorator. We have also modified the logic of our perform_operation_jitted() function to use prange() function. We are using index retrieved from prange() function to index array and retrieve individual element.
from numba import jit, int64, float32, float64, prange
@jit([int64(int64), float64(float64)], nopython=True, cache=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, parallel=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i in prange(len(x)):
res = cube_formula(x[i])
out[i] = res
return out
Now, we have run our parallelized function with arrays of size 1M and 10M to test their performance. We have also recorded the time taken by them. We have first used an array of 1M integers, then an array of 1M floats, and at last, an array of 10M floats.
We can notice from that time taken by executions that performance has improved compared to non-parallelized versions.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
Numba provides some additional performance in some situations by setting fastmath parameter to True inside of @jit decorator. The fastmath option, when set to True, will relax some numerical strict rules and perform approximate arithmetic & mathematical functions. If Intel's short vector math library (SVML) is installed on the system, then Numba can utilize it to improve performance when fastmath is set to True.
We can install intel's SVML library using the below conda command. Please see this link for more details on SVML.
In this section, we have first fastmath normally and then along with parallel argument of @jit decorator.
In this section, we have first re-defined our functions and decorated them with @jit decorator. We have set fastmath parameter to True along with nopython and cache parameters. We have also provided data types for inputs/outputs of functions.
from numba import jit, int64, float32, float64, prange
@jit([int64(int64), float64(float64)], nopython=True, cache=True, fastmath=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, fastmath=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i, elem in enumerate(x):
res = cube_formula(elem)
out[i] = res
return out
Below we have tested our @jit decorated and fastmath set functions three times using different inputs.
First, we have executed it with 1M integers three times, followed by 1M floats three times and at last, 10M floats three times. We can notice from the time recorded for executions that it seems to have improved performance a little bit.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
In this section, we have re-defined the functions that we have been using for the last few examples. We have @jit decorated it along with options nopython, cache, fastmath, and parallel set to True.
from numba import jit, int64, float32, float64, prange
@jit([int64(int64), float64(float64)], nopython=True, cache=True, fastmath=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, fastmath=True, parallel=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i in prange(len(x)):
res = cube_formula(x[i])
out[i] = res
return out
Below we have tested our fastmath optimized and parallelized functions by executing with different arrays three times. We have also recorded the time taken by each for comparison. We can notice from the results that there is almost the same time as that of the parallel section above.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
One of the python's drawbacks when using multi-threading is GIL which does not let python actually execute more than one thread in parallel in a few situations. To overcome this drawback, Numba let us skip python GIL by setting nogil parameter to True inside @jit decorator. When Numba can convert the majority of python code to low-level machine code, then it's not necessary to hold python's GIL.
Our functions for this example are exact copies of the functions we had defined in example 5 (with one minor change) when explaining how we can use multi-threading with Numba @jit decorator by setting parallel=True. We have set parameter GIL to True as well this time to let python release GIL.
from numba import jit, int64, float32, float64
@jit([int64(int64), float64(float64)], nopython=True, nogil=True)
def cube_formula(x):
return x**3 + 3*x**2 + 3
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, nogil=True, parallel=True)
def perform_operation_jitted(x):
out = np.empty_like(x)
for i in prange(len(x)):
res = cube_formula(x[i])
out[i] = res
return out
Below we have executed our jit-decorated function three times using different inputs. First we have executed function with an array of 1M integers three times, then with an array of 1M floats three times and at last with an array of 10M floats three times. We have recorded the time taken by function each time. We can notice from the time recorded that the function seems to be doing better compared to the majority of our previous trials.
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))
As we have highlighted many times, Numba works well with python loops and numpy. Though Pandas is built on top of numpy but still Numba can not improve code involving pandas data structures using pandas operations. The reason behind this can be that Numba does not have access to lower-level code behind pandas API which it can optimize.
Below we have created a simple function that takes as input pandas dataframe and performs some operations on columns of pandas dataframe. It then returns a modified data frame. We have first run the function normally 3 times and recorded the time of each run.
We have then @jit decorated the same function and run it again three times. We have recorded the time taken by this jit-decorated function as well. We can clearly see from the results that @jit decorator does not seem to improve results. It even increases the time taken by the function.
The below examples show that using Numba involving only pandas code will not result in improving performance. It can even backfire and can take time to run the first time as seen below. Because it tried to convert code to Numba for improving performance but it failed and fall back to pure python at last.
Though decorating functions involving pandas data frame with @jit decorator does not seem to improve results, but there are ways to improve functions involving pandas dataframe. We have discussed how we can improve the code involving pandas dataframe using Numba and its decorators in a separate tutorial. Please feel free to check it.
def work_on_dataframe(df):
df['Col1'] = (df.Col1 * 100)
df['Col2'] = (df.Col1 * df.Col3)
df = df.where((df > 100) & (df < 10000))
df = df.dropna(how='any')
return df
data = {'Col1': range(10000), 'Col2': range(10000), 'Col3': range(10000)}
df = pd.DataFrame(data=data)
%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)
from numba import jit
@jit
def work_on_dataframe(df):
df['Col1'] = (df.Col1 * 100)
df['Col2'] = (df.Col1 * df.Col3)
df = df.where((df > 100) & (df < 10000))
df = df.dropna(how='any')
return df
data = {'Col1': range(1000), 'Col2': range(1000), 'Col3': range(1000)}
df = pd.DataFrame(data=data)
%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)
This ends our small tutorial explaining Numba @jit decorator to speed-up python code. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to