This lesson is still being designed and assembled (Pre-Alpha version)

GPU Programming

Introduction

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What is a Graphics Processing Unit?

  • Can a GPU be used for anything else than graphics?

  • Are GPUs faster than CPUs?

Objectives
  • Learn how a GPU works

  • Understand the differences between CPU and GPU

Graphics Processing Unit

The Graphics Processing Unit (GPU) is one of the components of a computer’s video card, together with specialized memory and different Input/Output (I/O) units. In the context of the video card, the GPU fulfills a role similar to the one that the Central Processing Unit (CPU) has in a general purpose computing system: it processes input data data to generate some kind of output. In the traditional context of video cards, GPUs process data in order to render images on an output device, such as a screen or monitor. However, current GPUs are general purpose computing devices able to perform any kind of computation.

Parallel by Design

But what is the reason to use GPUs to perform general purpose computation, when computers already have fast CPUs that are able to perform any kind of computation? One way to answer this question is to go back to the roots of what a GPU is designed to do.

An image can be seen as a matrix of points called pixels (a portmanteau of the words picture and element), with each pixel representing the color the image should have in that particular point, and the traditional task performed by video cards is to produce the images a user will see on the screen. So GPUs are designed with this particular task in mind: render multiple pixels at the same time.

A single 4K UHD image contains more than 8 million pixels. If a GPU needs to generate a continuous stream of 25 4K frames (images) per second, enough for a user to not experience delays in a videogame, movie, or any other video output, it must process over 200 million pixels per second. So GPUs are not only designed to render multiple pixels at the same time, they are designed to do it efficiently.

This design principle results in the GPU being, from a hardware point of view, a very different device than a CPU. The CPU is a very general purpose device, good at different tasks, being them parallel or sequential in nature; it is also designed for interaction with the user, so it has to be responsive and guarantee minimal latency. The result is a device where most of the silicon is used for memory caches and control-flow logic, not just compute units. By contrast, most of the silicon on a GPU is actually used for compute units.

The GPU does not need an overly complicated cache hierarchy, nor it does need complex control logic, because the overall goal is not to minimize the latency of any given thread, but to maximize the throughput of the whole computation. With many compute units available, the GPU can run massively parallel programs, programs in which thousands of threads are executed at the same time, while thousands more are ready for execution to hide the cost of memory operations.

Speed Benefits

So, GPUs are massively parallel devices that can execute thousands of threads at the same time. But what does it mean in practice for the user? Why anyone would need to use a GPU to compute something that can be easily computed on a CPU? We begin with an example: sorting a large array in Python.

First we need to create an array of random single precision floating point numbers.

import numpy as np
size = 4096 * 4096
input = np.random.random(size).astype(np.float32)

We then time the execution of the NumPy sort() function, to see how long sorting this array takes on the CPU.

%timeit output = np.sort(input)

While the timing of this operation will differ depending on the system on which you run the code, these are the results for one experiment running on a Jupyter notebook on Google Colab.

1 loop, best of 5: 1.84 s per loop

We now perform the same sorting operation, but this time we will be using CuPy to execute the sort() on the GPU. CuPy is an open-source library, compatible with NumPy, for GPU computing in Python.

import cupy as cp
input_gpu = cp.asarray(input)
%timeit output_gpu = cp.sort(input_gpu)

We also report the output, obtained on the same notebook on Google Colab; as always note that your result will vary based on the environment and GPU you are using.

100 loops, best of 5: 6.83 ms per loop

Sorting an array using CuPy, and therefore the GPU, is clearly much faster than using NumPy; but how much faster? Having recorded the average execution time of both operations, we can then compute the speedup of using CuPy over NumPy. The speedup is defined as the ratio between the sequential (NumPy in our case) and parallel (CuPy in our case) execution time; beware that both execution times need to be in the same unit, this is why we had to convert the GPU execution time from milliseconds to seconds.

1.84 / 0.00683

With the result of the previous operation being the following.

269.39970717423137

We can therefore say that by using the GPU with CuPy for our sort() operation we obtained a speedup of 269, or simply put an improvement in performance of 269 times.

Key Points


Programming your GPU using CuPy

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I copy my data to the GPU?

  • How can I do a calculation on a GPU?

  • How can I copy the result back to my computer?

Objectives
  • Be able to indicate if an array, represented by a variable in an iPython shell, is stored in host or device memory.

  • Be able to copy the contents of this array from host to device memory and vice versa.

  • Be able to select the appropriate function to either convolve an image using either CPU or GPU compute power.

  • Be able to quickly estimate the speed benefits for a simple calculation by moving it from the CPU to the GPU.

Introduction to CuPy

CuPy is a GPU array library that implements a subset of the NumPy and SciPy interfaces. This makes it a very convenient tool to use the compute power of GPUs for people that have some experience with NumPy, without the need to write code in a GPU programming language such as CUDA, OpenCL, or HIP.

Convolution in Python

We start by generating an artificial “image” on the host using Python and NumPy; the host is the CPU on the laptop, desktop, or cluster node you are using right now, and from now on we may use host to refer to the CPU and device to refer to the GPU. The image will be all zeros, except for isolated pixels with value one, on a regular grid. We will convolve it with a Gaussian and inspect the result. We will also record the time it takes to execute this convolution on the host.

We can interactively write and executed the code in an iPython shell or a Jupyter notebook.

import numpy as np

# Construct a subimage with all zeros and a single one in the middle
primary_unit = np.zeros((16, 16))
primary_unit[8, 8] = 1

# Now duplicate this subimage many times to construct a larger image
deltas = np.tile(primary_unit, (128, 128))
print(deltas.shape)

The final print should show that you have indeed built a large image.

Out[7]: (2048, 2048)

To get a feeling of how the whole image looks like, we can display the top-left corner of it.

import pylab as pyl

# Display the image
# You can zoom in using the menu in the window that will appear
pyl.imshow(deltas[0:32, 0:32])
pyl.show()

The result of this should be four times the content of primary_unit.

Background

The computation we want to perform on this image is a convolution, once on the host and once on the device so we can compare the results and execution times. In computer vision applications, convolutions are often used to filter images and if you want to know more about them, we encourage you to check out this github repository by Vincent Dumoulin and Francesco Visin with some great animations. We have already seen that we can think of an image as a matrix of color values, when we convolve that image with a particular filter, we generate a new matrix with different color values. An example of convolution can be seen in the figure below (illustration by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons).

Example of convolution

In our example, we will convolve our image with a 2D Gaussian function shown below:

$$G(x,y) = \frac{1}{2\pi \sigma^2} e^{-\frac{x^2 + y^2}{2 \sigma^2}}$$

Where x and y are the “coordinates in our matrix, i.e. our row and columns. \(\sigma\) controls the width of the Gaussian distribution. Convolving images with 2D Gaussian funcitons will change the value of each pixel to be a weighted average of the pixels around it, thereby “smoothing” the image. Convolving images with a Gaussian function denoises the image, which is often required in edge-detection since most algorithms to do this are sensitive to noise.

Convolution on the CPU Using SciPy

Let us first construct the Gaussian, and then display it. Remember that at this point we are still doing everything with standard Python, and not using the GPU yet.

x, y = np.meshgrid(np.linspace(-2, 2, 15), np.linspace(-2, 2, 15))
dst = np.sqrt(x*x + y*y)
sigma = 1
muu = 0.000
gauss = np.exp(-((dst-muu)**2/(2.0 * sigma**2)))
pyl.imshow(gauss)
pyl.show()

This should show you a symmetrical two-dimensional Gaussian. Now we are ready to do the convolution on the host. We do not have to write this convolution function ourselves, as it is very conveniently provided by SciPy. Let us also record the time it takes to perform this convolution and inspect the top left corner of the convolved image.

from scipy.signal import convolve2d as convolve2d_cpu

convolved_image_using_CPU = convolve2d_cpu(deltas, gauss)
%timeit convolve2d_cpu(deltas, gauss)
pyl.imshow(convolved_image_using_CPU[0:32, 0;32])
pyl.show()

Obviously, the time to perform this convolution will depend very much on the power of your CPU, but I expect you to find that it takes a couple of seconds.

1 loop, best of 5: 2.52 s per loop

When you display the corner of the image, you can see that the “ones” surrounded by zeros have actually been blurred by a Gaussian, so we end up with a regular grid of Gaussians.

Convolution on the GPU Using CuPy

This is part of a lesson on GPU programming, so let us use the GPU. Although there is a physical connection - i.e. a cable - between the CPU and the GPU, they do not share the same memory space. TODO add an image to show CPU-GPU connection. This means that an array created from e.g. an iPython shell using NumPy is physically located into the main memory of the host, and therefore available for the CPU but not the GPU. It is not yet present in GPU memory, which means that we need to copy our data, the input image and the convolving function to the GPU, before we can execute any code on it. In practice, we have the arrays deltas and gauss in the host’s RAM, and we need to copy them to GPU memory using CuPy.

import cupy as cp

deltas_gpu = cp.asarray(deltas)
gauss_gpu = cp.asarray(gauss)

Now it is time to do the convolution on the GPU. SciPy does not offer functions that can use the GPU, so we need to import the convolution function from another library, called cupyx; cupyx.scipy contains a subset of all SciPy routines. You will see that the GPU convolution function from the cupyx library looks very much like the convolution function from SciPy we used previously. In general, NumPy and CuPy look very similar, as well as the SciPy and cupyx libraries, and this is on purpose to facilitate the use of the GPU by programmers that are already familiar with NumPy and SciPy. Let us again record the time to execute the convolution, so that we can compare it with the time it took on the host.

from cupyx.scipy.signal import convolve2d as convolve2d_gpu

convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu)
%timeit convolve2d_gpu(deltas_gpu, gauss_gpu)

Similar to what we had previously on the host, the execution time of the GPU convolution will depend very much on the GPU used, but I expect you to find it will take about 10ms. This is what I got on a TITAN X (Pascal) GPU:

1000 loops, best of 5: 20.2 ms per loop

This is a lot faster than on the host, a performance improvement, or speedup, of 125 times. Impressive!

Challenge: Try a shortcut: convolution on the GPU without CuPy

Try to convolve the Numpy array deltas with the Numpy array gauss directly on the GPU, so without using CuPy arrays. If we succeed, this should save us the time and effort of transferring deltas and gauss to the GPU.

Solution

We can again use the GPU convolution function from the cupyx library: convolve2d_gpu and use deltas and gauss as input.

convolve2d_gpu(deltas, gauss)

However, this gives a long error message with this last line:

TypeError: Unsupported type <class 'numpy.ndarray'>

It is unfortunately not possible to access Numpy arrays from the GPU directly. Numpy arrays exist in the Random Access Memory (RAM) of the host and not in GPU memory. These types of memory are not united, but transfers are possible.

Compare the results. Copy the convolved image from the device back to the host

To check that we actually computed the same output on the host and the device we can compare the two output arrays convolved_image_using_GPU and convolved_image_using_CPU.

np.allclose(convolved_image_using_GPU, convolved_image_using_CPU)

As you may expect, the result of the comparison is positive, and in fact we computed the same results on the host and the device.

array(True)

Challenge: Fairer runtime comparison CPU vs. GPU

Compute the CPU vs GPU speedup while taking into account the transfers of data to the GPU and back. You should now find a lower speedup from taking the overhead of the transfer of arrays into account. Hint: To copy a CuPy array back to the host (CPU), use cp.asnumpy().

Solution

For timing, it is most convenient to define a function that completes all the steps.

def transfer_compute_transferback():
    deltas_gpu = cp.asarray(deltas)
    gauss_gpu = cp.asarray(gauss)
    convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu)
    convolved_image_using_GPU_copied_to_host = cp.asnumpy(convolved_image_using_GPU)
   
%timeit transfer_compute_transferback()
10 loops, best of 5: 35.1 ms per loop

This means that our speedup has decreased from 2520 ms/20.2 ms = 125 to 2520 ms/35.1 ms = 72. This is still a significant speedup of our computations and adequately takes account of additional data transfers.

A shortcut: performing Numpy routines on the GPU.

We saw above that we cannot execute routines from the “cupyx” library directly on Numpy arrays. In fact we need to first transfer the data from host to device memory. Vice versa, if we try to execute a regular Scipy routine (i.e. designed to run the CPU) on a CuPy array, we will also encounter an error. Try the following:

convolve2d_cpu(deltas_gpu, gauss_gpu)

This results in

......
......
......
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

So Scipy routines cannot have CuPy arrays as input. We can, however, execute a simpler command that does not require Scipy. Instead of 2D convolution, we can do 1D convolution. For that we can use a Numpy routine instead of a Scipy routine. The “convolve” routine from Numpy performs linear (1D) convolution. To generate some input for a linear convolution, we can flatten our image from 2D to 1D (using ravel()), but we also need a 1D kernel. For the latter we will take the diagonal elements of our 2D Gaussian kernel. Try the following three instructions for linear convolution on the CPU:

deltas_1d = deltas.ravel()
gauss_1d = gauss.diagonal()
%timeit np.convolve(deltas_1d, gauss_1d)

You could arrive at something similar to this timing result:

104 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We have performed a regular linear convolution using our CPU. Now let us try something bold. We will transfer the 1D arrays to the GPU and use the Numpy (!) routine to do the convolution. Again, we have to issue three commands:

deltas_1d_gpu = cp.asarray(deltas_1d)
gauss_1d_gpu = cp.asarray(gauss_1d)
%timeit np.convolve(deltas_1d_gpu, gauss_1d_gpu)

You may be surprised that we can issue these commands without error. Contrary to Scipy routines, Numpy accepts CuPy arrays, i.e. arrays that exist in GPU memory, as input. Here you can find some background on why Numpy routines can handle CuPy arrays.

Also, remember the np.allclose command above? With a Numpy and a CuPy array as input. That worked for the same reason.

The linear convolution is actually performed on the GPU, which is shown by a nice speedup:

731 µs ± 106 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So this implies a speedup of a factor 104/0.731 = 142. Impressive.

Key Points


Programming your GPU using Numba

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I copy my data to the GPU?

  • How can I do a calculation on a GPU?

  • How can I copy the result back to my computer?

Objectives
  • Be able to indicate if an array, represented by a variable in an iPython shell, is stored in host or device memory.

  • Be able to copy the contents of this array from host to device memory and vice versa.

  • Be able to select the appropriate function to either convolve an image using either CPU or GPU compute power.

  • Be able to quickly estimate the speed benefits for a simple calculation by moving it from the CPU to the GPU.

Using Numba to execute Python code on the GPU

Numba is a Python library that “translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library”. You might want to try it to speed up your code on a CPU. However, Numba can also translate a subset of the Python language into CUDA, which is what we will be using here. So the idea is that we can do what we are used to, i.e. write Python code and still benefit from the speed that GPUs offer us.

We want to compute all prime numbers - i.e. numbers that have only 1 or itself as divisors without a remainder - between 1 and 10000 on the CPU and see if we can speed it up, by deploying a similar algorithm on a GPU. This is code that you can find on many websites. Small variations are possible, but it will look something like this:

def find_all_primes_cpu(upper):
    all_prime_numbers=[]
    for num in range(2, upper):
        # all prime numbers are greater than 1
        for i in range(2, num):
            if (num % i) == 0:
                break
        else:
            all_prime_numbers.append(num)
    return all_prime_numbers

Calling “find_all_primes_cpu(10000)” will return all prime numbers between 1 and 10000 as a list. Let us time it:

%timeit find_all_primes_cpu(10000) 

You will probably find that “find_all_primes_cpu” takes several hundreds of milliseconds to complete:

378 ms ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As a quick sidestep, add Numba’s JIT (Just in Time compilation) decorator to the “find_all_primes_cpu” function. You can either add it to the function definition or to the call, so either in this way:

@jit(nopython=True)
def find_all_primes_cpu(upper):
    all_prime_numbers=[]
    ....
    ....

or in this way:

%timeit jit(nopython=True)(find_all_primes_cpu)(upper_limit)

which can give you a timing result similar to this:

165 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So twice as fast, by using a simple decorator. The speedup is much larger for upper = 100000, but that takes a little too much waiting time for this course. Despite the “jit(nopython=True)” decorator the computation is still performed on the CPU. Let us move the computation to the GPU. There are a number of ways to achieve this, one of them is the usage of the “jit(device=True)” decorator, but it depends very much on the nature of the computation. Let us write our first GPU kernel which checks if a number is a prime, using the cuda.jit decorator, so different from the jit decorator for CPU computations. It is essentially the inner loop of “find_all_primes_cpu”:

from numba import cuda

@cuda.jit
def check_prime_gpu_kernel(num, result):
   # all prime numbers are greater than 1
   result[0] =  0
   for i in range(2, num):
       if (num % i) == 0:
           break
   else:
       result[0] = num

A number of things are worth noting. CUDA kernels do not return anything, so you have to supply for an array to be modified. All arguments have to be arrays, if you work with scalars, make them arrays of length one. This is the case here, because we check if a single number is a prime or not. Let us see if this works:

result = np.zeros((1), np.int32)
check_prime_gpu_kernel[1, 1](11, result)
print(result[0])
check_prime_gpu_kernel[1, 1](12, result)
print(result[0])

This should return “11”, because that is a prime and “0” because 12 is not a prime:

11
0

Note the extra arguments in square brackets - [1, 1] - that are added to the call of “check_prime_gpu_kernel”. These indicate the number of “threads per block” and the number of “blocks per grid”. These concepts will be explained in a later session. We will both set them to 1 for now.

Challenge: Write a function find_all_primes_cpu_and_gpu that uses check_prime_gpu_kernel and the outer loop similar to find_all_primes_cpu.

How long does it take to find all primes up to 10000?

Solution

@cuda.jit
def find_all_primes_cpu_and_gpu(upper):
    all_prime_numbers=[]
    for num in range(2, upper):
        result = np.zeros((1), np.int32)
        # Calculate the number of thread blocks in the grid
        check_prime_gpu_kernel[1,1](num, result)
        if result[0]>0:
            all_prime_numbers.append(num)
    return all_prime_numbers
   
%timeit find_all_primes_cpu_and_gpu(10000)
6.62 s ± 152 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Wow, that is slow! So much slower than find_all_primes_cpu. Clearly, we have not given the GPU enough work to do, the overhead is a lot larger than the workload.

Let us give the GPU a work load large enough to compensate for the overhead of data transfers to and from the GPU. For this example of computing primes we can best use the “vectorize” decorator for a new “check_prime_gpu” function that takes an array as input instead of “upper” in order to increase the work load. This is the array we have to use as input for our new “check_prime_gpu” function, instead of upper, a single integer:

np.arange(2, 10000, dtype=np.int32)

So that input to the new “check_prime_gpu” function is simply the array of numbers we need to check for primes. “check_prime_gpu” looks similar to “check_prime_gpu_kernel”, but it is not a kernel, so it can return values:

@nb.vectorize(['int32(int32)'], target='cuda')
def check_prime_gpu(num):
   for i in range(2, num):
       if (num % i) == 0:
           return 0
   else:
       return num

where we have added the “vectorize” decorator from Numba. The argument of “check_prime_gpu” seems to be defined as a scalar (single integer in this case), but the “vectorize” decorator will allow us to use an array as input. That array should consist of 4B (byte) of 32b (bit) integers, indicated by “(int32)”. The return array will also consist of 32b integers, with zeros for the non-primes. The nonzero values are the primes.

Let us run it and record the elapsed time:

%timeit check_prime_gpu(np.arange(2, upper_limit, dtype=np.int32))

which should show you a significant speedup:

3.25 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This amounts to an accelleration of our code of a factor 165/3.25 = 50.8 compared to the “jit(nopython=True)” decorated code on the CPU.

Key Points


Closure Day 1

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How do I organise the computations on my GPU in an efficient manner?

Objectives
  • Understand the building blocks of the CUDA programming model, i.e. threads, blocks and grids

Threads, blocks and grids

On Day 1, computations were done on a GPU using the CuPy and Numpy interfaces, i.e. from a high abstraction level. How the compute power of the GPU was applied remained ‘under the hood’, with one exception, where we used a single GPU thread to perform a computation. On Day 2 it will be shown that we have full control on how are computations are distributed over the GPU. The concepts that define that control are shown in this graph. It is meant to sink in overnight, but we will not discuss all of its aspects now. That will be done on the second day.

Threads, blocks and grids

Key Points

  • There is a large amount of freedom in distributing your computations over the GPU, but a lot of configurations will render your GPU mostly idle.


Your First GPU Kernel

Overview

Teaching: 45 min
Exercises: 25 min
Questions
  • How can I parallelize a Python application on a GPU?

  • How to write a GPU program?

  • What is CUDA?

Objectives
  • Recognize possible data parallelism in Python code

  • Understand the structure of a CUDA program

  • Execute a CUDA program in Python using CuPy

Summing Two Vectors in Python

We start by introducing a program that, given two input vectors of the same size, returns a third vector containing the sum of the corresponding elements of the two input vectors.

def vector_add(A, B, C, size):
    for item in range(0, size):
        C[item] = A[item] + B[item]
    
    return C

One of the characteristics of this program is that each iteration of the for loop is independent from the other iterations. In other words, we could reorder the iterations and still produce the same output, or even compute each iteration in parallel or on a different device, and still come up with the same output. These are the kind of programs that we would call naturally parallel, and they are perfect candidates for being executed on a GPU.

Summing Two Vectors in CUDA

While we could just use CuPy to run something equivalent to our vector_add on a GPU, our goal is to learn how to write code that can be executed by GPUs, therefore we now begin learning CUDA.

The CUDA-C language is a GPU programming language and API developed by NVIDIA. It is mostly equivalent to C/C++, with some special keywords, built-in variables, and functions.

We begin our introduction to CUDA by writing a small kernel, i.e. a GPU program, that computes the same function that we just described in Python.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
We are aware that CUDA is a proprietary solution, and that there are open-source alternatives such as OpenCL.
However, CUDA is the most used platform for GPU programming and therefore we decided to use it for our teaching material.

Running Code on the GPU with CuPy

Before delving deeper into the meaning of all lines of code, let us try to execute the code on a GPU. To compile the code and manage the GPU in Python we are going to use the interface provided by CuPy.

import cupy

# size of the vectors
size = 1024

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)

# CUDA vector_add
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
'''
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")

vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

And to be sure that the CUDA code does exactly what we want, we can execute our sequential Python code and compare the results.

import numpy

a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = numpy.zeros(size, dtype=numpy.float32)

vector_add(a_cpu, b_cpu, c_cpu, size)

# test
if numpy.allclose(c_cpu, c_gpu):
    print("Correct results!")
Correct results!

Understanding the CUDA Code

We can now move back to the CUDA code and analyze it line by line to highlight the differences between CUDA-C and standard C.

__global__ void vector_add(const float * A, const float * B, float * C, const int size)

This is the definition of our CUDA vector_add function. The __global__ keyword is an execution space identifier, and it is specific to CUDA. What this keyword means is that the defined function will be able to run on the GPU, but can also be called from the host (in our case the Python interpreter running on the CPU). All of our kernel definitions will be preceded by this keyword.

Other execution space identifiers in CUDA-C are __host__, and __device__. Functions annotated with the __host__ identifier will run on the host, and be only callable from the host, while functions annotated with the __device__ identifier will run on the GPU, but can only be called from the GPU itself. We are not going to use these identifiers as often as __global__.

int item = threadIdx.x;
C[item] = A[item] + B[item];

This is the part of the code in which we do the actual work. As you may see, it looks similar to the innermost loop of our vector_add Python function, with the main difference being in how the value of the item variable is evaluated.

In fact, while in Python the content of item is the result of the range function, in CUDA we are reading a special variable, i.e. threadIdx, containing a triplet that indicates the id of a thread inside a three-dimensional CUDA block. In this particular case we are working on a one dimensional vector, and therefore only interested in the first dimension, that is stored in the x field of this variable.

Challenge: Loose threads

We know enough now to pause for a moment and do a little exercise. Assume that in our vector_add kernel we replace the following line:

int item = threadIdx.x;

With this other line of code:

int item = 1;

What will the result of this change be?

1) Nothing changes

2) Only the first thread is working

3) Only C[1] is written

4) All elements of C are zero

Solution

The correct answer is number 3, only the element C[1] is written, and we do not even know by which thread!

Computing Hierarchy in CUDA

In the previous example we had a small vector of size 1024, and each of the 1024 threads we generated was working on one of the element.

What would happen if we changed the size of the vector to a larger number, such as 2048? We modify the value of the variable size and try again.

# size of the vectors
size = 2048

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)

# CUDA vector_add
vector_add_gpu = cupy.RawKernel(r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
''', "vector_add")

vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

This is how the output should look like when running the code in a Jupyter Notebook:

---------------------------------------------------------------------------

CUDADriverError                           Traceback (most recent call last)

<ipython-input-4-a26bc8acad2f> in <module>()
     19 ''', "vector_add")
     20 
---> 21 vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))
     22 
     23 print(c_gpu)

cupy/core/raw.pyx in cupy.core.raw.RawKernel.__call__()

cupy/cuda/function.pyx in cupy.cuda.function.Function.__call__()

cupy/cuda/function.pyx in cupy.cuda.function._launch()

cupy_backends/cuda/api/driver.pyx in cupy_backends.cuda.api.driver.launchKernel()

cupy_backends/cuda/api/driver.pyx in cupy_backends.cuda.api.driver.check_status()

CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument

The reason for this error is that most GPUs will not allow us to execute a block composed of more than 1024 threads. If we look at the parameters of our functions we see that the first two parameters are two triplets.

vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

The first triplet specifies the size of the CUDA grid, while the second triplet specifies the size of the CUDA block. The grid is a three-dimensional structure in the CUDA programming model and it represent the organization of a whole kernel execution. A grid is made of one or more independent blocks, and in the case of our previous snippet of code we have a grid composed by a single block (1, 1, 1). The size of this block is specified by the second triplet, in our case (size, 1, 1). While blocks are independent of each other, the thread composing a block are not completely independent, they share resources and can also communicate with each other.

To go back to our example, we can modify che grid specification from (1, 1, 1) to (2, 1, 1), and the block specification from (size, 1, 1) to (size // 2, 1, 1). If we run the code again, we should now get the expected output.

We already introduced the special variable threadIdx when introducing the vector_add CUDA code, and we said it contains a triplet specifying the coordinates of a thread in a thread block. CUDA has other variables that are important to understand the coordinates of each thread and block in the overall structure of the computation.

These special variables are blockDim, blockIdx, and gridDim, and they are all triplets. The triplet contained in blockDim represents the size of the calling thread’s block in three dimensions. While the content of threadIdx is different for each thread in the same block, the content of blockDim is the same because the size of the block is the same for all threads. The coordinates of a block in the computational grid are contained in blockIdx, therefore the content of this variable will be the same for all threads in the same block, but different for threads in different blocks. Finally, gridDim contains the size of the grid in three dimensions, and it is again the same for all threads.

Challenge: Hidden variables

Given the following snippet of code:

size = 512
vector_add_gpu((4, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

What is the content of the blockDim and gridDim variables inside the CUDA vector_add kernel?

Solution

The content of blockDim is (512, 1, 1) and the content of gridDim is (4, 1, 1), for all threads.

What happens if we run the code that we just modified to work on an vector of 2048 elements, and compare the results with our CPU version?

# size of the vectors
size = 2048

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = numpy.zeros(size, dtype=numpy.float32)

# CPU code
def vector_add(A, B, C, size):
    for item in range(0, size):
        C[item] = A[item] + B[item]
    
    return C

# CUDA vector_add
vector_add_gpu = cupy.RawKernel(r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
''', "vector_add")

# execute the code
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size))
vector_add(a_cpu, b_cpu, c_cpu, size)

# test
if numpy.allclose(c_cpu, c_gpu):
    print("Correct results!")
else:
    print("Wrong results!")
Wrong results!

The results are wrong! In fact, while we increased the number of threads we launch, we did not modify the kernel code to compute the correct results using the new builtin variables we just introduced.

Challenge: Scaling up

In the following code, fill in the blank to work with vectors that are larger than the largest CUDA block (i.e. 1024).

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = ______________;
   C[item] = A[item] + B[item];
}

Solution

The correct answer is (blockIdx.x * blockDim.x) + threadIdx.x.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   C[item] = A[item] + B[item];
}

Vectors of Arbitrary Size

So far we have worked with a number of threads that is the same as the elements in the vector. However, in a real world scenario we may have to process vectors of arbitrary size, and to do this we need to modify both the kernel and the way it is launched.

Challenge: More work than necessary

We modified the vector_add kernel to include a check for the size of the vector, so that we only compute elements that are within the vector boundaries. However the code is not correct as it is written now. Can you reorder the lines of the source code to make it work?

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   if ( item < size )
   {
      int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   }
   C[item] = A[item] + B[item];
}

Solution

The correct way to modify the vector_add to work on vectors of arbitrary size is to first compute the coordinates of each thread, and then perform the sum only on elements that are within the vector boundaries.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

To test our changes we can modify the size of the vectors from 2048 to 10000, and execute the code again.

---------------------------------------------------------------------------

CUDADriverError                           Traceback (most recent call last)

<ipython-input-20-00d938215d28> in <module>()
     31 
     32 # Execute the code
---> 33 vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size))
     34 vector_add(a_cpu, b_cpu, c_cpu, size)
     35 

cupy/core/raw.pyx in cupy.core.raw.RawKernel.__call__()

cupy/cuda/function.pyx in cupy.cuda.function.Function.__call__()

cupy/cuda/function.pyx in cupy.cuda.function._launch()

cupy/cuda/driver.pyx in cupy.cuda.driver.launchKernel()

cupy/cuda/driver.pyx in cupy.cuda.driver.check_status()

CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument

This error is telling us that CUDA cannot launch a block with size // 2 threads, because the maximum amount of threads in a kernel is 1024 and we are requesting 5000 threads.

What we need to do is to make grid and block more flexible, so that they can adapt to vectors of arbitrary size. To do that, we can replace the Python code to call vector_add_gpu with the following code.

import math

grid_size = (int(math.ceil(size / 1024)), 1, 1)
block_size = (1024, 1, 1)

vector_add_gpu(grid_size, block_size, (a_gpu, b_gpu, c_gpu, size))

With these changes we always have blocks composed of 1024 threads, but we adapt the number of blocks so that we always have enough to threads to compute all elements in the vector. If we want to be able to easily modify the number of threads per block, we can even rewrite the code like the following:

threads_per_block = 1024
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

vector_add_gpu(grid_size, block_size, (a_gpu, b_gpu, c_gpu, size))

We can now execute the code again.

Correct results!

Challenge: Compute prime numbers with CUDA

Given the following Python code, similar to what we have seen in the previous episode about Numba, write the missing CUDA kernel that computes all the prime numbers up to a certain upper bound.

# CPU
def all_primes_to(upper : int, prime_list : list):
    for num in range(2, upper):
        prime = True
        for i in range(2, num // 2):
            if (num % i) == 0:
                prime = False
                break
        if prime:
            prime_list[num] = 1

upper_bound = 100000
all_primes_cpu = numpy.zeros(upper_bound, dtype=numpy.int32)
all_primes_cpu[0] = 1
all_primes_cpu[1] = 1
%timeit all_primes_to(upper_bound, all_primes_cpu)

# GPU
check_prime_gpu_code = r'''
extern "C"
__global__ void all_primes_to(int size, int * const all_prime_numbers)
{
}
'''
# Allocate memory
all_primes_gpu = cupy.zeros(upper_bound, dtype=cupy.int32)

# Compile and execute code
all_primes_to_gpu = cupy.RawKernel(check_prime_gpu_code, "all_primes_to")
grid_size = (int(math.ceil(upper_bound / 1024)), 1, 1)
block_size = (1024, 1, 1)
%timeit all_primes_to_gpu(grid_size, block_size, (upper_bound, all_primes_gpu))

# Test
if numpy.allclose(all_primes_cpu, all_primes_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

Solution

One possible solution is the following one:

check_prime_gpu_code = r'''
extern "C"
__global__ void all_primes_to(int size, int * const all_prime_numbers)
{
    int number = (blockIdx.x * blockDim.x) + threadIdx.x;
    int result = 1;

    if ( number < size )
    {
        for ( int factor = 2; factor < number / 2; factor++ )
        {
            if ( number % factor == 0 )
            {
                result = 0;
                break;
            }
        }

        all_prime_numbers[number] = result;
    }
}
'''

Key Points

  • Precede your kernel definition with the __global__ keyword

  • Use built-in variables threadIdx, blockIdx, gridDim and blockDim to identify each thread


Registers, Global, and Local Memory

Overview

Teaching: 25 min
Exercises: 20 min
Questions
  • What are registers?

  • How to share data between host and GPU?

  • Which memory is accessible to threads and thread blocks?

Objectives
  • Understanding the difference between registers and device memory

  • Understanding the difference between local and global memory

Now that we know how to write a CUDA kernel to run code on the GPU, and how to use the Python interface provided by CuPy to execute it, it is time to look at the different memory spaces in the CUDA programming model.

Registers

Registers are fast on-chip memories that are used to store operands for the operations executed by the computing cores.

Did we encounter registers in the vector_add code used in the previous episode? Yes we did! The variable item is, in fact, stored in a register for at least part, if not all, of a thread’s execution. In general all scalar variables defined in CUDA code are stored in registers.

Registers are local to a thread, and each thread has exclusive access to its own registers: values in registers cannot be accessed by other threads, even from the same block, and are not available for the host. Registers are also not permanent, therefore data stored in registers is only available during the execution of a thread.

Challenge: how many registers are we using?

How many registers are we using in the vector_add code?

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

Solution

In general, it is not possible to exactly know how many registers the compiler will use without examining the output generated by the compiler itself. However, we can roughly estimate the amount of necessary registers based on the variables used. We most probably need one register to store the variable item, two registers to store the content of A[item] and B[item], and one additional register to store the sum A[item] + B[item]. So the number of registers that vector_add probably uses is 4.

If we want to make registers use more explicit in the vector_add code, we can try to rewrite it in a slightly different, but equivalent, way.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   float temp_a, temp_b, temp_c;

   if ( item < size )
   {
       temp_a = A[item];
       temp_b = B[item];
       temp_c = temp_a + temp_b;
       C[item] = temp_c;
   }
}

In this new version of vector_add we explicitly declare three float variables to store the values loaded from memory and the sum of our input items, making the estimation of used registers more obvious.

This it totally unnecessary in the case of our example, because the compiler will determine on its own the right amount of registers to allocate per thread, and what to store in them. However, explicit register usage can be important for reusing items already loaded from memory.

Registers are the fastest memory on the GPU, so using them to increase data reuse is an important performance optimization.
We will look at some examples of manually using registers to improve performance in future episodes.

Small CUDA arrays, which size is known at compile time, will also be allocated in registers by the compiler. We can rewrite the previous version of vector_add to work with an array of registers.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   float temp[3];

   if ( item < size )
   {
       temp[0] = A[item];
       temp[1] = B[item];
       temp[2] = temp[0] + temp[1];
       C[item] = temp[2];
   }
}

Once again, this is not something that we would normally do, and it is provided only as an example of how to work with arrays of registers.

Global Memory

Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two. It is the largest memory space available, and therefore it can contain much more data than registers, but it is also slower to access. This memory space does not require any special memory space identifier.

Challenge: identify when global memory is used

Observe the code of vector_add and identify where global memory is used.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

Solution

The vectors A, B, and C are stored in global memory.

Memory allocated on the host, and passed as a parameter to a kernel, is by default allocated in global memory.

Global memory is accessible by all threads, from all thread blocks. This means that a thread can read and write any value in global memory.

While global memory is visible to all threads, remember that global memory is not coherent, and changes made by one thread block may not be available to other thread blocks during the kernel execution.
However, all memory operations are finalized when the kernel terminates.

Local Memory

Memory can also be statically allocated from within a kernel, and according to the CUDA programming model such memory will not be global but local memory. Local memory is only visible, and therefore accessible, by the thread allocating it. So all threads executing a kernel will have their own privately allocated local memory.

Challenge: use local memory

Modify the code of vector_add so that intermediate data products are stored in local memory, and only the final result is saved into global memory.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

Hint: have a look at the example using an array of registers.

Solution

We need to pass the size of the local array as a new parameter to the kernel, because if we just specified 3 in the code, the compiler would allocate registers and not local memory.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size, const int local_memory_size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   float local_memory[local_memory_size];
   
   if ( item < size )
   {
      local_memory[0] = A[item];
      local_memory[1] = B[item];
      local_memory[2] = local_memory[0] + local_memory[1];
      C[item] = local_memory[2];
   }
}

The host code could be modified adding one line and changing the way the kernel is called.

local_memory_size = 3
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size, local_memory_size))

Local memory is not not a particularly fast memory, and in fact it has similar throughput and latency of global memory, but it is much larger than registers. As an example, local memory is automatically used by the CUDA compiler to store spilled registers, i.e. to temporarily store variables that cannot be kept in registers anymore because there is not enough space in the register file, but that will be used again in the future and so cannot be erased.

Key Points

  • Registers can be used to locally store data and avoid repeated memory operations

  • Global memory is the main memory space and it is used to share data between host and GPU

  • Local memory is a particular type of memory that can be used to store data that does not fit in registers and is private to a thread


Shared Memory and Synchronization

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Question

Objectives
  • Learn how to share data between threads

  • Learn how to synchronize threads

So far we looked at how to use CUDA to accelerate the computation, but a common pattern in all the examples we encountered so far is that threads worked in isolation. While having different threads perform the same operation on different data is a good pattern for working with GPUs, there are cases in which threads need to communicate. This communication may be necessary because of the way the algorithm we are trying to implement works, or it may derive from a performance goal we are trying to achieve.

Shared Memory

Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.

To allocate an array in shared memory we need to preface the definition with the identifier __shared__.

Challenge: use of shared memory

Modify the following code to use shared memory for the temp array.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  float temp[3];

  if ( item < size )
  {
      temp[0] = A[item];
      temp[1] = B[item];
      temp[2] = temp[0] + temp[1];
      C[item] = temp[2];
  }
}

Solution

To use shared memory for the temp array add the identifier __shared__ to its definition, like in the following code.

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  __shared__ float temp[3];

  if ( item < size )
  {
      temp[0] = A[item];
      temp[1] = B[item];
      temp[2] = temp[0] + temp[1];
      C[item] = temp[2];
  }
}

While syntactically correct, the previous example is functionally wrong. The reason is that the temp array is not anymore private to the thread allocating it, but it is now shared by the whole thread block.

Challenge: what is the result of the previous code block?

The previous code example is functionally wrong. Do you know what the result of its execution will be?

Solution

The result is non deterministic, and definitely not the same as the previous versions of vector_add. Threads will overwrite each other temporary values,and there will be no guarantee on which value is visible by each thread.

To fix the previous kernel we should allocate enough shared memory for each thread to store three values, so that each thread has its own section of the shared memory array to work with.

To allocate enough memory we need to replace the constant 3 in __shared__ float temp[3]; with something else. If we know that each thread block has 1024 threads, we can write something like the following:

__shared__ float temp[3 * 1024];

But we know by experience that having constants in the code is not a scalable and maintainable solution. The problem is that we need to have a constant value if we want to declare a shared memory array, because the compiler needs to know how much memory to allocate.

A solution to this problem is to declare our array as a pointer, such as:

extern __shared__ float temp[];

And then use CuPy to instruct the compiler about how much shared memory, in bytes, each thread block needs:

# execute the code
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size), shared_mem=((size // 2) * 3 * cupy.dtype(cupy.float32).itemsize))

So before compiling and executing the kernel, we need to set attributes.max_dynamic_shared_size_bytes with the number of bytes necessary. As you may notice, we had to retrieve the size in bytes of the data type cupy.float32, and this is done with cupy.dtype(cupy.float32).itemsize.

After these changes, the body of the kernel needs to be modified to use the right indices:

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  int offset = threadIdx.x * 3;
  extern __shared__ float temp[];

  if ( item < size )
  {
      temp[offset + 0] = A[item];
      temp[offset + 1] = B[item];
      temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
      C[item] = temp[offset + 2];
  }
}

And for completeness, we present the full Python code.

import math
import numpy
import cupy

# vector size
size = 2048

# GPU memory allocation
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
gpu_args = (a_gpu, b_gpu, c_gpu, size)

# CPU memory allocation
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = numpy.zeros(size, dtype=numpy.float32)

# CUDA code
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  int offset = threadIdx.x * 3;
  extern __shared__ float temp[];

  if ( item < size )
  {
      temp[offset + 0] = A[item];
      temp[offset + 1] = B[item];
      temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
      C[item] = temp[offset + 2];
  }
}
'''

# compile and execute code
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")
threads_per_block = 32
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
vector_add_gpu(grid_size, block_size, gpu_args, shared_mem=(threads_per_block * 3 * cupy.dtype(cupy.float32).itemsize))

# execute Python code and compare results
vector_add(a_cpu, b_cpu, c_cpu, size)
numpy.allclose(c_cpu, c_gpu)

The code is now correct, although it is still not very useful. We are definitely using shared memory, and we are using it the correct way, but there is no performance gain we achieved by doing so. In practice, we are making our code slower, not faster, because shared memory is slower than registers.

Let us, therefore, work on an example where using shared memory is actually useful. We start again with some Python code.

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1
    return output_array

The histogram function, as the name suggests, computes the histogram of an array of integers, i.e. counts how many instances of each integer are in input_array, and writes the count in output_array. We can now generate some data and run the code.

input_array = numpy.random.randint(256, size=2048, dtype=numpy.int32)
output_array = numpy.zeros(256, dtype=numpy.int32)
output_array = histogram(input_array, output_array)

Everything as expected. We can now write equivalent code in CUDA.

__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    output[input[item]] = output[input[item]] + 1;
}

Challenge: error in the histogram

If you look at the CUDA histogram code, there is a logical error that prevents the code to produce the right result. Can you spot it?

Solution

The GPU is a highly parallel device, executing multiple threads at the same time. In the previous code different threads could be updating the same output item at the same time, producing wrong results.

To solve this problem, we need to use a function from the CUDA library named atomicAdd. This function ensures that the increment of output_array happens in an atomic way, so that there are no conflicts in case multiple threads want to update the same item at the same time.

__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    atomicAdd(&(output[input[item]]), 1);
}

And the full Python code snippet.

import math
import numpy
import cupy

# input size
size = 2048

# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)

# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    atomicAdd(&(output[input[item]]), 1);
}
'''

# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

# execute code on CPU and GPU
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
histogram(input_cpu, output_cpu)

# compare results
numpy.allclose(output_cpu, output_gpu)

The CUDA code is now correct, and computes the same result as the Python code. However, we are accumulating the results directly in global memory, and the more conflicts we have in global memory, the lower performance our histogram will have. Moreover, the access pattern to the output array is very irregular, being dependent on the content of the input array. The best performance is obtained on the GPU when consecutive threads access consecutive addresses in memory, and this is not the case in our code.

As you may expect, we can improve performance by using shared memory.

Challenge: use shared memory to speed up the histogram

Implement a new version of the histogram function that uses shared memory.

Hint: try to reduce conflicts, and improve the memory access pattern. Hint: for this exercise, assume that the size of output is the same as the number of threads in a block.

Solution

__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    extern __shared__ int temp_histogram[];

    atomicAdd(&(temp_histogram[input[item]]), 1);
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}

Thread Synchronization

There is still one potentially big issue in the histogram code we just wrote, and the issue is that shared memory is not coherent without explicit synchronization. The problem lies in the following two lines of code:

atomicAdd(&(temp_histogram[input[item]]), 1);
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);

In the first line each thread updates one arbitrary position in shared memory, depending on the value of the input, while in the second line each thread reads the element in shared memory corresponding to its thread ID. However, the changes to shared memory are not automatically available to all other threads, and therefore the final result may not be correct.

To solve this issue, we need to explicitly synchronize all threads in a block, so that memory operations are also finalized and visible to all. To synchronize threads in a block, we use the __syncthreads() CUDA function. Moreover, shared memory is not initialized, and the programmer needs to take care of that too.

__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    extern __shared__ int temp_histogram[];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadId.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}

And the full Python code snippet.

import math
import numpy
import cupy

# input size
size = 2048

# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)

# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    extern __shared__ int temp_histogram[];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadId.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
'''

# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

# execute code on CPU and GPU
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu), shared_mem=(threads_per_block * cupy.dtype(cupy.int32).itemsize))
histogram(input_cpu, output_cpu)

# compare results
numpy.allclose(output_cpu, output_gpu)

Key Points


Constant Memory

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Question

Objectives
  • Understanding when to use constant memory

Constant Memory

Constant memory is a read-only cache which content can be broadcasted to multiple threads in a block. A variable allocated in constant memory needs to be declared in CUDA by using the special __constant__ identifier, and it must be a global variable, i.e. it must be declared in the scope that contains the kernel, not inside the kernel itself. If all of this sounds complex do not worry, we are going to see how this works with an example.

extern "C" {
#define BLOCKS 2

__constant__ float factors[BLOCKS];

__global__ void sum_and_multiply(const float * A, const float * B, float * C, const int size)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    C[item] = (A[item] + B[item]) * factors[blockIdx.x];
}
}

In the previous code snippet we implemented a kernel that, given two vectors A and B, stores their element-wise sum in a third vector, C, scaled by a certain factor; this factor is the same for all threads in the same thread block. Because these factors are shared, i.e. all threads in the same thread block use the same factor for scaling their sums, it is a good idea to use constant memory for the factors array. In fact you can see that the definition of factors is preceded by the __constant__ keyword, and said definition is in the global scope. It is important to note that the size of the constant array needs to be known at compile time, therefore the use of the define preprocessor statement. On the kernel side there is no need to do more, the factors vector can be normally accessed inside the code as any other vector, and because it is a global variable it does not need to be passed to the kernel as a function argument.

The initialization of constant memory happens on the host side, and we show how this is done in the next code snippet.

# compile the code
module = cupy.RawModule(code=cuda_code)
# allocate and copy constant memory
factors_ptr = module.get_global("factors")
factors_gpu = cupy.ndarray(2, cupy.float32, factors_ptr)
factors_gpu[...] = cupy.random.random(2, dtype=cupy.float32)

From the previous code it is clear that dealing with constant memory is a slightly more verbose affair than usual. First, we need to compile the code, that in this case is contained in a Python string named cuda_code. This is necessary because constant memory is defined in the CUDA code, so we need CUDA to allocate the necessary memory, and then provide us with a pointer to this memory. By calling the method get_global we ask the CUDA subsystem to provide us with the location of a global object, in this case the array factors. We can then create our own CuPy array and point that to the object returned by get_global, so that we can use it in Python as we would normally do. Note that we use the constant 2 for the size of the array, the same number we are using in the CUDA code; it is important that we use the same number or we may end up accessing memory that is outside the bound of the array. Lastly, we initialize the array with some random floating point numbers.

Challenge: print the content of constant memory

What should be the output of the following line of code?

print(factors_gpu)

Solution

In our case the output of this line of code is two floating point numbers, e.g. [0.11390183 0.2585096 ]. However, we are not really accessing the content of the GPU’s constant memory from the host, we are simply accessing the host-side copy of the data maintained by CuPy.

We can now combine all the code together and execute it.

# size of the vectors
size = 2048

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
# prepare arguments
args = (a_gpu, b_gpu, c_gpu, size)

# CUDA code
cuda_code = r'''
extern "C" {
#define BLOCKS 2

__constant__ float factors[BLOCKS];

__global__ void sum_and_multiply(const float * A, const float * B, float * C, const int size)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    C[item] = (A[item] + B[item]) * factors[blockIdx.x];
}
}
'''

# compile and access the code
module = cupy.RawModule(code=cuda_code)
sum_and_multiply = module.get_function("sum_and_multiply")
# allocate and copy constant memory
factors_ptr = module.get_global("factors")
factors_gpu = cupy.ndarray(2, cupy.float32, factors_ptr)
factors_gpu[...] = cupy.random.random(2, dtype=cupy.float32)

sum_and_multiply((2, 1, 1), (size // 2, 1, 1), args)

As you can see the code is not very general, it uses constants and works only with two blocks, but it is a working example of how to use constant memory.

Challenge: generalize the previous code

Have a look again at the code using constant memory, and make it general enough to be able to run on input of arbitrary size. Experiment with some different input sizes.

Solution

One of the possible solutions is the following one.

# size of the vectors
size = 10**6

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
# prepare arguments
args = (a_gpu, b_gpu, c_gpu, size)

# CUDA code
cuda_code = r'''
extern "C" {
__constant__ float factors[BLOCKS];

__global__ void sum_and_multiply(const float * A, const float * B, float * C, const int size)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    if ( item < size )
    {
        C[item] = (A[item] + B[item]) * factors[blockIdx.x];
    }
}
}
'''

# compute the number of blocks and replace "BLOCKS" in the CUDA code
threads_per_block = 1024
num_blocks = int(math.ceil(size / threads_per_block))
cuda_code = cuda_code.replace("BLOCKS", f"{num_blocks}") 

# compile and access the code
module = cupy.RawModule(code=cuda_code)
sum_and_multiply = module.get_function("sum_and_multiply")
# allocate and copy constant memory
factors_ptr = module.get_global("factors")
factors_gpu = cupy.ndarray(num_blocks, cupy.float32, factors_ptr)
factors_gpu[...] = cupy.random.random(num_blocks, dtype=cupy.float32)

sum_and_multiply((num_blocks, 1, 1), (threads_per_block, 1, 1), args)

Key Points