Accelerate your Python code with Numba
Last updated on 20240312  Edit this page
Estimated time: 60 minutes
Overview
Questions
 “How can I run my own Python functions on the GPU?”
Objectives
 “Learn how to use Numba decorators to improve the performance of your Python code.”
 “Run your first application on the GPU.”
Using Numba to execute Python code on the GPU
Numba is a Python library that “translates Python functions to optimized machine code at runtime using the industrystandard LLVM compiler library”. You might want to try it to speed up your code on a CPU. However, Numba can also translate a subset of the Python language into CUDA, which is what we will be using here. So the idea is that we can do what we are used to, i.e. write Python code and still benefit from the speed that GPUs offer us.
We want to compute all prime numbers  i.e. numbers that have only 1 or themselves as exact divisors  between 1 and 10000 on the CPU and see if we can speed it up, by deploying a similar algorithm on a GPU. This is code that you can find on many websites. Small variations are possible, but it will look something like this:
PYTHON
def find_all_primes_cpu(upper):
all_prime_numbers = []
for num in range(0, upper):
prime = True
for i in range(2, (num // 2) + 1):
if (num % i) == 0:
prime = False
break
if prime:
all_prime_numbers.append(num)
return all_prime_numbers
Calling find_all_primes_cpu(10_000)
will return all
prime numbers between 1 and 10000 as a list. Let us time it:
You will probably find that find_all_primes_cpu
takes
several hundreds of milliseconds to complete:
OUTPUT
176 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
As a quick sidestep, add Numba’s JIT (Just in Time compilation)
decorator to the find_all_primes_cpu
function. You can
either add it to the function definition or to the call, so either in
this way:
PYTHON
from numba import jit
@jit(nopython=True)
def find_all_primes_cpu(upper):
all_prime_numbers = []
for num in range(0, upper):
prime = True
for i in range(2, (num // 2) + 1):
if (num % i) == 0:
prime = False
break
if prime:
all_prime_numbers.append(num)
return all_prime_numbers
%timeit n 10 r 1 find_all_primes_cpu(10_000)
or in this way:
PYTHON
from numba import jit
upper = 10_000
%timeit n 10 r 1 jit(nopython=True)(find_all_primes_cpu)(upper)
which can give you a timing result similar to this:
OUTPUT
69.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
So twice as fast, by using a simple decorator. The speedup is much
larger for upper = 100_000
, but that takes a little too
much waiting time for this course. Despite the
jit(nopython=True)
decorator the computation is still
performed on the CPU. Let us move the computation to the GPU. There are
a number of ways to achieve this, one of them is the usage of the
jit(device=True)
decorator, but it depends very much on the
nature of the computation. Let us write our first GPU kernel which
checks if a number is a prime, using the cuda.jit
decorator, so different from the jit
decorator for CPU
computations. It is essentially the inner loop of
find_all_primes_cpu
:
PYTHON
from numba import cuda
@cuda.jit
def check_prime_gpu_kernel(num, result):
result[0] = num
for i in range(2, (num // 2) + 1):
if (num % i) == 0:
result[0] = 0
break
A number of things are worth noting. CUDA kernels do not return anything, so you have to supply for an array to be modified. All arguments have to be arrays, if you work with scalars, make them arrays of length one. This is the case here, because we check if a single number is a prime or not. Let us see if this works:
PYTHON
import numpy as np
result = np.zeros((1), np.int32)
check_prime_gpu_kernel[1, 1](11, result)
print(result[0])
check_prime_gpu_kernel[1, 1](12, result)
print(result[0])
If we have not made any mistake, the first call should return “11”, because 11 is a prime number, while the second call should return “0” because 12 is not a prime:
OUTPUT
11
0
Note the extra arguments in square brackets  [1, 1]

that are added to the call of check_prime_gpu_kernel
: these
indicate the number of threads we want to run on the GPU. While this is
an important argument, we will explain it later and for now we can keep
using 1
.
One possible implementation of this function is the following one.
PYTHON
def find_all_primes_cpu_and_gpu(upper):
all_prime_numbers = []
for num in range(0, upper):
result = np.zeros((1), np.int32)
check_prime_gpu_kernel[1,1](num, result)
if result[0] != 0:
all_prime_numbers.append(num)
return all_prime_numbers
%timeit n 10 r 1 find_all_primes_cpu_and_gpu(10_000)
OUTPUT
6.21 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
As you may have noticed, find_all_primes_cpu_and_gpu
is
much slower than the original find_all_primes_cpu
. The
reason is that the overhead of calling the GPU, and transferring data to
and from it, for each number of the sequence is too large. To be
efficient the GPU needs enough work to keep all of its cores busy.
Let us give the GPU a work load large enough to compensate for the
overhead of data transfers to and from the GPU. For this example of
computing primes we can best use the vectorize
decorator
for a new check_prime_gpu
function that takes an array as
input instead of upper
in order to increase the work load.
This is the array we have to use as input for our new
check_prime_gpu
function, instead of upper, a single
integer:
So that input to the new check_prime_gpu
function is
simply the array of numbers we need to check for primes.
check_prime_gpu
looks similar to
check_prime_gpu_kernel
, but it is not a kernel, so it can
return values:
PYTHON
import numba as nb
@nb.vectorize(['int32(int32)'], target='cuda')
def check_prime_gpu(num):
for i in range(2, (num // 2) + 1):
if (num % i) == 0:
return 0
return num
where we have added the vectorize
decorator from Numba.
The argument of check_prime_gpu
seems to be defined as a
scalar (single integer in this case), but the vectorize
decorator will allow us to use an array as input. That array should
consist of 4B (bytes) or 32b (bit) integers, indicated by
(int32)
. The return array will also consist of 32b
integers, with zeros for the nonprimes. The nonzero values are the
primes.
Let us run it and record the elapsed time:
which should show you a significant speedup:
OUTPUT
5.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
This amounts to a speedup of our code of a factor 11 compared to the
jit(nopython=True)
decorated code on the CPU.