Content from Introduction


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “What is a Graphics Processing Unit?”
  • “Can a GPU be used for anything else than graphics?”
  • “Are GPUs faster than CPUs?”

Objectives

  • “Understand the differences between CPU and GPU”
  • “See the possible performance benefits of GPU acceleration”

Graphics Processing Unit

The Graphics Processing Unit (GPU) is one of the components of a computer’s video card, together with specialized memory and different Input/Output (I/O) units. In the context of the video card, the GPU fulfills a role similar to the one that the Central Processing Unit (CPU) has in a general purpose computing system: it processes input data to generate some kind of output. In the traditional context of video cards, GPUs process data in order to render images on an output device, such as a screen. However, modern GPUs are general purpose computing devices that can be used to perform any kind of computation, and this is what we are going to do in this lesson.

Parallel by Design

But what is the reason to use GPUs to perform general purpose computation, when computers already have fast CPUs that are able to perform any kind of computation? One way to answer this question is to go back to the roots of what a GPU is designed to do: render images.

An image can be seen as a matrix of points called pixels (a portmanteau of the words picture and element), with each pixel representing the color the image should have in that particular point, and the traditional task performed by video cards is to produce the images a user will see on the screen. A single 4K UHD image contains more than 8 million pixels. For a GPU to render a continuous stream of 25 4K frames (images) per second, enough so that users not experience delay in a videogame, movie, or any other video output, it must process over 200 million pixels per second. So GPUs are designed to render multiple pixels at the same time, and they are designed to do it efficiently. This design principle results in the GPU being, from a hardware point of view, very different from a CPU.

The CPU is a very general purpose device, good at different tasks, being them parallel or sequential in nature; it is also designed for interaction with the user, so it has to be responsive and guarantee minimal latency. In practice, we want our CPU to be available whenever we sent it a new task. The result is a device where most of the silicon is used for memory caches and control-flow logic, not just compute units.

By contrast, most of the silicon on a GPU is actually used for compute units. The GPU does not need an overly complicated cache hierarchy, nor does it need complex control logic, because the overall goal is not to minimize the latency of any given thread, but to maximize the throughput of the whole computation. With many compute units available, the GPU can run massively parallel programs, programs in which thousands of threads are executed at the same time, while thousands more are ready for execution to hide the cost of memory operations.

A high-level introduction on the differences between CPU and GPU can also be found in the following YouTube video.

Screenshot of the YoutTube video showing a GPU

Speed Benefits

So, GPUs are massively parallel devices that can execute thousands of threads at the same time. But what does it mean in practice for the user? Why anyone would need to use a GPU to compute something that can be easily computed on a CPU? We can try to answer this question with an example.

Suppose we want to sort a large array in Python. Using NumPy, we first need to create an array of random single precision floating point numbers.

PYTHON

import numpy as np
size = 4096 * 4096
input = np.random.random(size).astype(np.float32)

We then time the execution of the NumPy sort() function, to see how long sorting this array takes on the CPU.

PYTHON

%timeit -n 1 -r 1 output = np.sort(input)

While the timing of this operation will differ depending on the system on which you run the code, these are the results for one experiment running on a Jupyter notebook on Google Colab.

OUTPUT

1.83 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

We now perform the same sorting operation, but this time we will be using CuPy to execute the sort() function on the GPU. CuPy is an open-source library, compatible with NumPy, for GPU computing in Python.

PYTHON

from cupyx.profiler import benchmark
import cupy as cp
input_gpu = cp.asarray(input)
execution_gpu = benchmark(cp.sort, (input_gpu,), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

We also report the output, obtained on the same notebook on Google Colab; as always note that your result will vary based on the environment and GPU you are using.

OUTPUT

0.008949 s

Notice that the first thing we need to do, is to copy the input data to the GPU. The distinction between data on the GPU and that on the host is a very important one that we will get back to later.

Host vs. Device

From now on we may also call the GPU the device, while we refer to other computations as taking place on the host. We’ll also talk about host memory and device memory, but much more on memory in later episodes!

Sorting an array using CuPy, and therefore using the GPU, is clearly much faster than using NumPy, but can we quantify how much faster? Having recorded the average execution time of both operations, we can then compute the speedup of using CuPy over NumPy. The speedup is defined as the ratio between the sequential (NumPy in our case) and parallel (CuPy in our case) execution times; beware that both execution times need to be in the same unit, this is why we had to convert the GPU execution time from milliseconds to seconds.

PYTHON

speedup = 1.83 / 0.008949
print(speedup)

With the result of the previous operation being the following.

OUTPUT

204.49212202480723

We can therefore say that just by using the GPU with CuPy to sort an array of size 4096 * 4096 we achieved a performance improvement of 204 times.

Key Points

  • “CPUs and GPUs are both useful and each has its own place in our toolbox”
  • “In the context of GPU programming, we often refer to the GPU as the device and the CPU as the host
  • “Using GPUs to accelerate computation can provide large performance gains”
  • “Using the GPU with Python is not particularly difficult”

Content from Using your GPU with CuPy


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “How can I increase the performance of code that uses NumPy?”
  • “How can I copy NumPy arrays to the GPU?”

Objectives

  • “Be able to indicate if an array, represented by a variable in an iPython shell, is stored in host or device memory.”
  • “Be able to copy the contents of this array from host to device memory and vice versa.”
  • “Be able to select the appropriate function to either convolve an image using either CPU or GPU compute power.”
  • “Be able to quickly estimate the speed benefits for a simple calculation by moving it from the CPU to the GPU.”

Introduction to CuPy

CuPy is a GPU array library that implements a subset of the NumPy and SciPy interfaces. This makes it a very convenient tool to use the compute power of GPUs for people that have some experience with NumPy, without the need to write code in a GPU programming language such as CUDA, OpenCL, or HIP.

Convolution in Python

We start by generating an artificial “image” on the host using Python and NumPy; the host is the CPU on the laptop, desktop, or cluster node you are using right now, and from now on we may use host to refer to the CPU and device to refer to the GPU. The image will be all zeros, except for isolated pixels with value one, on a regular grid. The plan is to convolve it with a Gaussian and inspect the result. We will also record the time it takes to execute this convolution on the host.

We can interactively write and executed the code in an iPython shell or a Jupyter notebook.

PYTHON

import numpy as np

# Construct an image with repeated delta functions
deltas = np.zeros((2048, 2048))
deltas[8::16,8::16] = 1

To get a feeling of how the whole image looks like, we can display the top-left corner of it.

PYTHON

import pylab as pyl
# Necessary command to render a matplotlib image in a Jupyter notebook.
%matplotlib inline

# Display the image
# You can zoom in using the menu in the window that will appear
pyl.imshow(deltas[0:32, 0:32])
pyl.show()

After executing the code, you should see the following image.

Deltas array
Deltas array

Background

The computation we want to perform on this image is a convolution, once on the host and once on the device so we can compare the results and execution times. In computer vision applications, convolutions are often used to filter images and if you want to know more about them, we encourage you to check out this github repository by Vincent Dumoulin and Francesco Visin with some great animations. We have already seen that we can think of an image as a matrix of color values, when we convolve that image with a particular filter, we generate a new matrix with different color values. An example of convolution can be seen in the figure below (illustration by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons).

Example of animated convolution
Example of animated convolution

In our example, we will convolve our image with a 2D Gaussian function shown below:

\[G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right),\]

where \(x\) and \(y\) are the “coordinates in our matrix, i.e. our row and columns, and \[\sigma\] controls the width of the Gaussian distribution. Convolving images with 2D Gaussian functions will change the value of each pixel to be a weighted average of the pixels around it, thereby”smoothing” the image. Convolving images with a Gaussian function reduces the noise in the image, which is often required in edge-detection since most algorithms to do this are sensitive to noise.

Callout

It is often useful to identify the dataflow inherent in a problem. Say, if we want to square a list of numbers, all the operations are independent. The dataflow of a one-to-one operation is called a map.

Data flow of a map operation
Data flow of a map operation

A convolution is slightly more complicated. Here we have a many-to-one data flow, which is also known as a stencil.

Data flow of a stencil operation
Data flow of a stencil operation

GPU’s are exceptionally well suited to compute algorithms that follow one of these patterns.

Convolution on the CPU Using SciPy

Let us first construct the Gaussian, and then display it. Remember that at this point we are still doing everything with standard Python, and not using the GPU yet.

PYTHON

x, y = np.meshgrid(np.linspace(-2, 2, 15), np.linspace(-2, 2, 15))
dst = np.sqrt(x*x + y*y)
sigma = 1
muu = 0.000
gauss = np.exp(-((dst-muu)**2/(2.0 * sigma**2)))
pyl.imshow(gauss)
pyl.show()

This should show you a symmetrical two-dimensional Gaussian.

Two-dimensional Gaussian
Two-dimensional Gaussian

Now we are ready to do the convolution on the host. We do not have to write this convolution function ourselves, as it is very conveniently provided by SciPy. Let us also record the time it takes to perform this convolution and inspect the top left corner of the convolved image.

PYTHON

from scipy.signal import convolve2d as convolve2d_cpu

convolved_image_using_CPU = convolve2d_cpu(deltas, gauss)
pyl.imshow(convolved_image_using_CPU[0:32, 0:32])
pyl.show()
%timeit -n 1 -r 1 convolve2d_cpu(deltas, gauss)

Obviously, the time to perform this convolution will depend very much on the power of your CPU, but I expect you to find that it takes a couple of seconds.

OUTPUT

2.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

When you display the corner of the image, you can see that the “ones” surrounded by zeros have actually been blurred by a Gaussian, so we end up with a regular grid of Gaussians.

Regular grid of Gaussians
Regular grid of Gaussians

Convolution on the GPU Using CuPy

This is part of a lesson on GPU programming, so let us use the GPU. Although there is a physical connection - i.e. a cable - between the CPU and the GPU, they do not share the same memory space. This image depicts the different components of CPU and GPU and how they are connected:

CPU and GPU are two separate entities, each with its own memory
CPU and GPU are two separate entities, each with its own memory

This means that an array created from e.g. an iPython shell using NumPy is physically located into the main memory of the host, and therefore available for the CPU but not the GPU. It is not yet present in GPU memory, which means that we need to copy our data, the input image and the convolving function to the GPU, before we can execute any code on it. In practice, we have the arrays deltas and gauss in the host’s RAM, and we need to copy them to GPU memory using CuPy.

PYTHON

import cupy as cp

deltas_gpu = cp.asarray(deltas)
gauss_gpu = cp.asarray(gauss)

Now it is time to do the convolution on the GPU. SciPy does not offer functions that can use the GPU, so we need to import the convolution function from another library, called cupyx; cupyx.scipy contains a subset of all SciPy routines. You will see that the GPU convolution function from the cupyx library looks very much like the convolution function from SciPy we used previously. In general, NumPy and CuPy look very similar, as well as the SciPy and cupyx libraries, and this is on purpose to facilitate the use of the GPU by programmers that are already familiar with NumPy and SciPy. Let us again record the time to execute the convolution, so that we can compare it with the time it took on the host.

PYTHON

from cupyx.scipy.signal import convolve2d as convolve2d_gpu

convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu)
%timeit -n 7 -r 1 convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu)

Similar to what we had previously on the host, the execution time of the GPU convolution will depend very much on the GPU used. These are the results using a NVIDIA Tesla T4 on Google Colab.

OUTPUT

98.2 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 7 loops each)

This is a lot faster than on the host, a performance improvement, or speedup, of 24439 times. Impressive, but is it true?

Measuring performance

So far we used timeit to measure the performance of our Python code, no matter if it was running on the CPU or was GPU accelerated. However, execution on the GPU is asynchronous: the control is given back to the Python interpreter immediately, while the GPU is still executing the task. Because of this, we cannot use timeit anymore: the timing would not be correct.

CuPy provides a function, benchmark that we can use to measure the time it takes the GPU to execute our kernels.

PYTHON

from cupyx.profiler import benchmark

execution_gpu = benchmark(convolve2d_gpu, (deltas_gpu, gauss_gpu), n_repeat=10)

The previous code executes convolve2d_gpu ten times, and stores the execution time of each run, in seconds, inside the gpu_times attribute of the variable execution_gpu. We can then compute the average execution time and print it, as shown.

PYTHON

gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

Another advantage of the benchmark method is that it excludes the compile time, and warms up the GPU, so that measurements are more stable and representative.

With the new measuring code in place, we can measure performance again.

OUTPUT

0.020642 s

We now have a more reasonable, but still impressive, speedup of 116 times over the host code.

Challenge: convolution on the GPU without CuPy

Try to convolve the NumPy array deltas with the NumPy array gauss directly on the GPU, without using CuPy arrays. If this works, it should save us the time and effort of transferring deltas and gauss to the GPU.

We can directly try to use the GPU convolution function convolve2d_gpu with deltas and gauss as inputs.

PYTHON

convolve2d_gpu(deltas, gauss)

However, this gives a long error message ending with:

OUTPUT

TypeError: Unsupported type <class 'numpy.ndarray'>

It is unfortunately not possible to access NumPy arrays directly from the GPU because they exist in the Random Access Memory (RAM) of the host and not in GPU memory.

Validation

To check that we actually computed the same output on the host and the device we can compare the two output arrays convolved_image_using_GPU and convolved_image_using_CPU.

PYTHON

np.allclose(convolved_image_using_GPU, convolved_image_using_CPU)

As you may expect, the result of the comparison is positive, and in fact we computed the same results on the host and the device.

OUTPUT

array(True)

Challenge: fairer comparison of CPU vs. GPU

Compute again the speedup achieved using the GPU, but try to take also into account the time spent transferring the data to the GPU and back.

Hint: to copy a CuPy array back to the host (CPU), use the cp.asnumpy() function.

A convenient solution is to group both the transfers, to and from the GPU, and the convolution into a single Python function, and then time its execution, like in the following example.

PYTHON

def transfer_compute_transferback():
    deltas_gpu = cp.asarray(deltas)
    gauss_gpu = cp.asarray(gauss)
    convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu)
    convolved_image_using_GPU_copied_to_host = cp.asnumpy(convolved_image_using_GPU)
   
execution_gpu = benchmark(transfer_compute_transferback, (), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

OUTPUT

0.035400 s

The speedup taking into account the data transfers decreased from 116 to 67. Taking into account the necessary data transfers when computing the speedup is a better, and more fair, way to compare performance. As a note, because data transfers force the GPU to sync with the host, this could also be measured with timeit and still provide correct measurements.

A shortcut: performing NumPy routines on the GPU

We saw earlier that we cannot execute routines from the cupyx library directly on NumPy arrays. In fact we need to first transfer the data from host to device memory. Vice versa, if we try to execute a regular SciPy routine (i.e. designed to run the CPU) on a CuPy array, we will also encounter an error. Try the following:

PYTHON

convolve2d_cpu(deltas_gpu, gauss_gpu)

This results in

OUTPUT

......
......
......
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

So SciPy routines cannot have CuPy arrays as input. We can, however, execute a simpler command that does not require SciPy. Instead of 2D convolution, we can do 1D convolution. For that we can use a NumPy routine instead of a SciPy routine. The convolve routine from NumPy performs linear (1D) convolution. To generate some input for a linear convolution, we can flatten our image from 2D to 1D (using ravel()), but we also need a 1D kernel. For the latter we will take the diagonal elements of our 2D Gaussian kernel. Try the following three instructions for linear convolution on the CPU:

PYTHON

deltas_1d = deltas.ravel()
gauss_1d = gauss.diagonal()
%timeit -n 1 -r 1 np.convolve(deltas_1d, gauss_1d)

You could arrive at something similar to this timing result:

OUTPUT

270 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

We have performed a regular linear convolution using our CPU. Now let us try something bold. We will transfer the 1D arrays to the GPU and use the NumPy routine to do the convolution.

PYTHON

deltas_1d_gpu = cp.asarray(deltas_1d)
gauss_1d_gpu = cp.asarray(gauss_1d)

execution_gpu = benchmark(np.convolve, (deltas_1d_gpu, gauss_1d_gpu), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

You may be surprised that we can issue these commands without error. Contrary to SciPy routines, NumPy accepts CuPy arrays, i.e. arrays that exist in GPU memory, as input. In the CuPy documentation you can find some background on why NumPy routines can handle CuPy arrays.

Also, remember when we used np.allclose with a NumPy and a CuPy array as input? That worked for the same reason.

The linear convolution is actually performed on the GPU, which also results in a lower execution time.

OUTPUT

0.014529 s

Without much effort, we obtained a 18 times speedup.

A real world example: image processing for radio astronomy

In this section, we will perform the four major steps in image processing for radio astronomy: determination of background characteristics, segmentation, connected component labelling and source measurements.

Import the FITS image

Start by importing a 2048² pixels image of the Galactic Center, an image made from observations by the Indian Giant Metrewave Radio Telescope (GMRT) at 150 MHz. The image is stored in this repository as a FITS file, and to read it we need the astropy Python package.

PYTHON

from astropy.io import fits

with fits.open("GMRT_image_of_Galactic_Center.fits") as hdul:
    data = hdul[0].data.byteswap().newbyteorder()

The latter two methods are needed to convert byte ordering from big endian to little endian.

Inspect the image

Let us have a look at part of this image.

PYTHON

from matplotlib.colors import LogNorm

maxim = data.max()

fig = pyl.figure(figsize=(50, 12.5))
ax = fig.add_subplot(1, 1, 1)
im_plot = ax.imshow(np.fliplr(data), cmap=pyl.cm.gray_r, norm=LogNorm(vmin = maxim/10, vmax=maxim/100))
pyl.colorbar(im_plot, ax=ax)
Image of the Galactic Center
Image of the Galactic Center

The data has been switched left to right because Right Ascension increases to the left, so now it adheres to this astronomical convention. We can see a few dozen sources, especially in the lower left and upper right, which both have relatively good image quality. The band across the image from upper left to lower right has the worst image quality because there are many extended sources - especially the Galactic Center itself, in the middle - which are hard to deconvolve with the limited spacings from this observation. You can, however, see a couple of ring-like structures. These are actually supernova shells, i.e. the remnants from massive stars that exploded at the end of their lifes.

Determine the background characteristics of the image

We want to identify all the sources - meaning e.g. stars, supernova remnants and distant galaxies - in this image and measure their positions and fluxes. How do we separate source pixels from background pixels? When do we know if a pixel with a high value belongs to a source or is simply a noise peak? We assume the background noise, which is a reflection of the limited sensitivity of the radio telescope, has a normal distribution. The chance of having a background pixel with a value above 5 times the standard deviation is 2.9e-7. We have 2²² = 4.2e6 pixels in our image, so the chance of catching at least one random noise peak by setting a threshold of 5 times the standard deviation is less than 50%. We refer to the standard deviation as \(\sigma\).

How do we measure the standard deviation of the background pixels? First we need to separate them from the source pixels, based on their values, in the sense that high pixel values more likely come from sources. The technique we use is called \(\kappa, \sigma\) clipping. First we take all pixels and compute the standard deviation (\(\sigma\)). Then we compute the median and clip all pixels larger than \(median + 3 * \sigma\) and smaller than \(median - 3 * \sigma\). From the clipped set, we compute the median and standard deviation again and clip again. Continue until no more pixels are clipped. The standard deviation from this final set of pixel values is the basis for the next step.

Before clipping, let us investigate some properties of our unclipped data.

PYTHON

mean_ = data.mean()
median_ = np.median(data)
stddev_ = np.std(data)
max_ = np.amax(data)
print(f"mean = {mean_:.3e}, median = {median_:.3e}, sttdev = {stddev_:.3e}, \
maximum = {max_:.3e}")

The maximum flux density is 2506 mJy/beam, coming from the Galactic Center itself, so from the center of the image, while the overall standard deviation is 19.9 mJy/beam:

OUTPUT

mean = 3.898e-04, median = 1.571e-05, sttdev = 1.993e-02, maximum = 2.506e+00

You might observe that \(\kappa, \sigma\) clipping is a compute intense task, that is why we want to do it on a GPU. But let’s first issue the algorithm on a CPU.

This is the NumPy code to do this:

PYTHON

# Flattening our 2D data first makes subsequent steps easier.
data_flat = data.ravel()
# Here is a kappa, sigma clipper for the CPU
def kappa_sigma_clipper(data_flat):
    while True:
         med = np.median(data_flat)
         std = np.std(data_flat)
         clipped_lower = data_flat.compress(data_flat > med - 3 * std)
         clipped_both = clipped_lower.compress(clipped_lower < med + 3 * std)
         if len(clipped_both) == len(data_flat):
             break
         data_flat = clipped_both  
    return data_flat

data_clipped = kappa_sigma_clipper(data_flat)
timing_ks_clipping_cpu = %timeit -o kappa_sigma_clipper(data_flat)
fastest_ks_clipping_cpu = timing_ks_clipping_cpu.best
print(f"Fastest CPU ks clipping time = \
       {1000 * fastest_ks_clipping_cpu:.3e} ms.")

OUTPUT

793 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Fastest CPU ks clipping time = 7.777e+02 ms.

So that is close to 1 second to perform these computations. Hopefully, we can speed this up using the GPU. How has \(\kappa, \sigma\) clipping influenced our statistics?

PYTHON

clipped_mean_ = data_clipped.mean()
clipped_median_ = np.median(data_clipped)
clipped_stddev_ = np.std(data_clipped)
clipped_max_ = np.amax(data_clipped)
print(f"mean of clipped = {clipped_mean_:.3e}, median of clipped = \
{clipped_median_:.3e} \n standard deviation of clipped = \
{clipped_stddev_:.3e}, maximum of clipped = {clipped_max_:.3e}")

All output statistics have become smaller which is reassuring; it seems data_clipped contains mostly background pixels:

OUTPUT

mean of clipped = -1.945e-06, median of clipped = -9.796e-06 
 standard deviation of clipped = 1.334e-02, maximum of clipped = 4.000e-02

Challenge:\(\kappa, \sigma\)clipping on the GPU

Now that you understand how the \(\kappa, \sigma\) clipping algorithm works, perform it on the GPU using CuPy and compute the speedup. Include the data transfer to and from the GPU in your calculation.

PYTHON

def ks_clipper_gpu(data_flat):
    data_flat_gpu = cp.asarray(data_flat)
    data_gpu_clipped = kappa_sigma_clipper(data_flat_gpu)
    return cp.asnumpy(data_gpu_clipped)

data_clipped_on_GPU = ks_clipper_gpu(data_flat)
timing_ks_clipping_gpu = benchmark(ks_clipper_gpu, \
                                   (data_flat, ), n_repeat=10)
fastest_ks_clipping_gpu = np.amin(timing_ks_clipping_gpu.gpu_times)
print(f"{1000 * fastest_ks_clipping_gpu:.3e} ms")

OUTPUT

6.329e+01 ms

PYTHON

speedup_factor = fastest_ks_clipping_cpu/fastest_ks_clipping_gpu
print(f"The speedup factor for ks clipping is: {speedup_factor:.3e}")

OUTPUT

The speedup factor for ks clipping is: 1.232e+01

Segment the image

We have seen that clipping at the \(5 \sigma\) level of an image this size (2048² pixels) will yield a chance of less than 50% that from all the sources we detect at least one will be a noise peak. So let us set the threshold at \(5 \sigma\) and segment it. First check that we find the same standard deviation from our clipper on the GPU:

PYTHON

stddev_gpu_ = np.std(data_clipped_on_GPU)
print(f"standard deviation of background_noise = {stddev_gpu_:.4f} Jy/beam")

OUTPUT

standard deviation of background_noise = 0.0133 Jy/beam

With the standard deviation computed we apply the \(5 \sigma\) threshold to the image.

PYTHON

threshold = 5 * stddev_gpu_
segmented_image = np.where(data > threshold, 1,  0)
timing_segmentation_CPU = %timeit -o np.where(data > threshold, 1,  0)
fastest_segmentation_CPU = timing_segmentation_CPU.best 
print(f"Fastest CPU segmentation time = {1000 * fastest_segmentation_CPU:.3e} \
ms.")

OUTPUT

6.41 ms ± 55.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Fastest CPU segmentation time = 6.294e+00 ms.

Labelling of the segmented data

This is called connected component labelling (CCL). It will replace pixel values in the segmented image - just consisting of zeros and ones - of the first connected group of pixels with the value 1 - so nothing changed, but just for that first group - the pixel values in the second group of connected pixels will all be 2, the third connected group of pixels will all have the value 3 etc.

This is a CPU code for connected component labelling.

PYTHON

from scipy.ndimage import label as label_cpu
labelled_image = np.empty(data.shape)
number_of_sources_in_image = label_cpu(segmented_image, output = labelled_image)
sigma_unicode = "\u03C3"
print(f"The number of sources in the image at the 5{sigma_unicode} level is \
{number_of_sources_in_image}.")

timing_CCL_CPU = %timeit -o label_cpu(segmented_image, output = labelled_image)
fastest_CCL_CPU = timing_CCL_CPU.best
print(f"Fastest CPU CCL time = {1000 * fastest_CCL_CPU:.3e} ms.")

This gives, on my machine:

OUTPUT

The number of sources in the image at the 5σ level is 185.
26.3 ms ± 965 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Fastest CPU CCL time = 2.546e+01 ms.

Let us not just accept the answer, but also do a sanity check. What are the values in the labelled image?

PYTHON

print(f"These are all the pixel values we can find in the labelled image: \
{np.unique(labelled_image)}")

This should show the following output:

OUTPUT

These are all the pixel values we can find in the labelled image: [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139.
 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153.
 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167.
 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181.
 182. 183. 184. 185.]

Source measurements

We are ready for the final step. We have been given observing time to make this beautiful image of the Galactic Center, we have determined its background statistics, we have separated actual cosmic sources from noise and now we want to measure these cosmic sources. What are their positions and what are their flux densities?

Again, the algorithms from scipy.ndimage help us to determine these quantities. This is the CPU code for measuring our sources.

PYTHON

from scipy.ndimage import center_of_mass as com_cpu
from scipy.ndimage import sum_labels as sl_cpu
all_positions = com_cpu(data, labelled_image, \
                               range(1, number_of_sources_in_image+1))
all_integrated_fluxes = sl_cpu(data, labelled_image, \
                               range(1, number_of_sources_in_image+1))

print (f'These are the ten highest integrated fluxes of the sources in my \n\
image: {np.sort(all_integrated_fluxes)[-10:]}')

which gives the Galactic Center as the most luminous source, which makes sense when we look at our image.

OUTPUT

These are the ten highest integrated fluxes of the sources in my image: [ 38.90615184  41.91485894  43.02203498  47.30590784  51.23707351
  58.07289425  68.85673917  70.31223921  95.16443585 363.58937774]

Now we can try to measure the execution times for both algorithms, like this:

PYTHON

%%timeit -o
all_positions = com_cpu(data, labelled_image, \
                               range(1, number_of_sources_in_image+1))
all_integrated_fluxes = sl_cpu(data, labelled_image, \
                               range(1, number_of_sources_in_image+1))

Which yields, on my machine:

OUTPUT

797 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

<TimeitResult : 797 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

To collect the result from that timing in our next cell block, we need a trick that uses the _ variable.

PYTHON

timing_source_measurements_CPU = _
fastest_source_measurements_CPU = timing_source_measurements_CPU.best
print(f"Fastest CPU set of source measurements = \
{1000 * fastest_source_measurements_CPU:.3e} ms.")

Which yields

OUTPUT

Fastest CPU set of source measurements = 7.838e+02 ms.

Challenge: putting it all together

Combine the first two steps of image processing for astronomy, i.e. determining background characteristics e.g. through \(\kappa, \sigma\) clipping and segmentation into a single function, that works for both CPU and GPU. Next, write a function for connected component labelling and source measurements on the GPU and calculate the overall speedup factor for the combined four steps of image processing in astronomy on the GPU relative to the CPU. Finally, verify your output by comparing with the previous output, using the CPU.

PYTHON

def first_two_steps_for_both_CPU_and_GPU(data):
    data_flat = data.ravel()
    data_clipped = kappa_sigma_clipper(data_flat)
    stddev_ = np.std(data_clipped)
    threshold = 5 * stddev_
    segmented_image = np.where(data > threshold, 1,  0)
    return segmented_image

def ccl_and_source_measurements_on_CPU(data_CPU, segmented_image_CPU):
    labelled_image_CPU = np.empty(data_CPU.shape)
    number_of_sources_in_image = label_cpu(segmented_image_CPU, 
                                       output= labelled_image_CPU)
    all_positions = com_cpu(data_CPU, labelled_image_CPU, 
                            np.arange(1, number_of_sources_in_image+1))
    all_fluxes = sl_cpu(data_CPU, labelled_image_CPU, 
                            np.arange(1, number_of_sources_in_image+1))
    return np.array(all_positions), np.array(all_fluxes)

CPU_output = ccl_and_source_measurements_on_CPU(data, \
                 first_two_steps_for_both_CPU_and_GPU(data))

timing_complete_processing_CPU =  \
    benchmark(ccl_and_source_measurements_on_CPU, (data, \
       first_two_steps_for_both_CPU_and_GPU(data)), \
       n_repeat=10)

fastest_complete_processing_CPU = \
    np.amin(timing_complete_processing_CPU.cpu_times)

print(f"The four steps of image processing for astronomy take \
{1000 * fastest_complete_processing_CPU:.3e} ms\n on our CPU.")

from cupyx.scipy.ndimage import label as label_gpu
from cupyx.scipy.ndimage import center_of_mass as com_gpu
from cupyx.scipy.ndimage import sum_labels as sl_gpu

def ccl_and_source_measurements_on_GPU(data_GPU, segmented_image_GPU):
    labelled_image_GPU = cp.empty(data_GPU.shape)
    number_of_sources_in_image = label_gpu(segmented_image_GPU, 
                                           output= labelled_image_GPU)
    all_positions = com_gpu(data_GPU, labelled_image_GPU, 
                            cp.arange(1, number_of_sources_in_image+1))
    all_fluxes = sl_gpu(data_GPU, labelled_image_GPU, 
                            cp.arange(1, number_of_sources_in_image+1))
    # This seems redundant, but we want to return ndarrays (Numpy)
    # and what we have are lists. These first have to be converted to
    # Cupy arrays before they can be converted to Numpy arrays.
    return cp.asnumpy(cp.asarray(all_positions)), \
           cp.asnumpy(cp.asarray(all_fluxes))

GPU_output = ccl_and_source_measurements_on_GPU(cp.asarray(data), \
                 first_two_steps_for_both_CPU_and_GPU(cp.asarray(data)))

timing_complete_processing_GPU =  \
    benchmark(ccl_and_source_measurements_on_GPU, (cp.asarray(data), \
       first_two_steps_for_both_CPU_and_GPU(cp.asarray(data))), \
       n_repeat=10)

fastest_complete_processing_GPU = \
    np.amin(timing_complete_processing_GPU.gpu_times)

print(f"The four steps of image processing for astronomy take \
{1000 * fastest_complete_processing_GPU:.3e} ms\n on our GPU.")

overall_speedup_factor = fastest_complete_processing_CPU/ \
                         fastest_complete_processing_GPU
print(f"This means that the overall speedup factor GPU vs CPU equals: \
{overall_speedup_factor:.3e}\n")

all_positions_agree = np.allclose(CPU_output[0], GPU_output[0])
print(f"The CPU and GPU positions agree: {all_positions_agree}\n")

all_fluxes_agree = np.allclose(CPU_output[1], GPU_output[1])
print(f"The CPU and GPU fluxes agree: {all_positions_agree}\n")

OUTPUT

The four steps of image processing for astronomy take 1.060e+03 ms 
 on our CPU.
The four steps of image processing for astronomy take 5.770e+01 ms 
 on our GPU.
This means that the overall speedup factor GPU vs CPU equals: 1.838e+01

The CPU and GPU positions agree: True

The CPU and GPU fluxes agree: True

Key Points

  • “CuPy provides GPU accelerated version of many NumPy and Scipy functions.”
  • “Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code.”

Content from Accelerate your Python code with Numba


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “How can I run my own Python functions on the GPU?”

Objectives

  • “Learn how to use Numba decorators to improve the performance of your Python code.”
  • “Run your first application on the GPU.”

Using Numba to execute Python code on the GPU

Numba is a Python library that “translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library”. You might want to try it to speed up your code on a CPU. However, Numba can also translate a subset of the Python language into CUDA, which is what we will be using here. So the idea is that we can do what we are used to, i.e. write Python code and still benefit from the speed that GPUs offer us.

We want to compute all prime numbers - i.e. numbers that have only 1 or themselves as exact divisors - between 1 and 10000 on the CPU and see if we can speed it up, by deploying a similar algorithm on a GPU. This is code that you can find on many websites. Small variations are possible, but it will look something like this:

PYTHON

def find_all_primes_cpu(upper):
    all_prime_numbers = []
    for num in range(0, upper):
        prime = True
        for i in range(2, (num // 2) + 1):
            if (num % i) == 0:
                prime = False
                break
        if prime:
            all_prime_numbers.append(num)
    return all_prime_numbers

Calling find_all_primes_cpu(10_000) will return all prime numbers between 1 and 10000 as a list. Let us time it:

PYTHON

%timeit -n 10 -r 1 find_all_primes_cpu(10_000)

You will probably find that find_all_primes_cpu takes several hundreds of milliseconds to complete:

OUTPUT

176 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)

As a quick sidestep, add Numba’s JIT (Just in Time compilation) decorator to the find_all_primes_cpu function. You can either add it to the function definition or to the call, so either in this way:

PYTHON

from numba import jit

@jit(nopython=True)
def find_all_primes_cpu(upper):
    all_prime_numbers = []
    for num in range(0, upper):
        prime = True
        for i in range(2, (num // 2) + 1):
            if (num % i) == 0:
                prime = False
                break
        if prime:
            all_prime_numbers.append(num)
    return all_prime_numbers

%timeit -n 10 -r 1 find_all_primes_cpu(10_000)

or in this way:

PYTHON

from numba import jit

upper = 10_000
%timeit -n 10 -r 1 jit(nopython=True)(find_all_primes_cpu)(upper)

which can give you a timing result similar to this:

OUTPUT

69.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)

So twice as fast, by using a simple decorator. The speedup is much larger for upper = 100_000, but that takes a little too much waiting time for this course. Despite the jit(nopython=True) decorator the computation is still performed on the CPU. Let us move the computation to the GPU. There are a number of ways to achieve this, one of them is the usage of the jit(device=True) decorator, but it depends very much on the nature of the computation. Let us write our first GPU kernel which checks if a number is a prime, using the cuda.jit decorator, so different from the jit decorator for CPU computations. It is essentially the inner loop of find_all_primes_cpu:

PYTHON

from numba import cuda

@cuda.jit
def check_prime_gpu_kernel(num, result):
   result[0] =  num
   for i in range(2, (num // 2) + 1):
       if (num % i) == 0:
           result[0] = 0
           break

A number of things are worth noting. CUDA kernels do not return anything, so you have to supply for an array to be modified. All arguments have to be arrays, if you work with scalars, make them arrays of length one. This is the case here, because we check if a single number is a prime or not. Let us see if this works:

PYTHON

import numpy as np

result = np.zeros((1), np.int32)
check_prime_gpu_kernel[1, 1](11, result)
print(result[0])
check_prime_gpu_kernel[1, 1](12, result)
print(result[0])

If we have not made any mistake, the first call should return “11”, because 11 is a prime number, while the second call should return “0” because 12 is not a prime:

OUTPUT

11
0

Note the extra arguments in square brackets - [1, 1] - that are added to the call of check_prime_gpu_kernel: these indicate the number of threads we want to run on the GPU. While this is an important argument, we will explain it later and for now we can keep using 1.

Challenge: compute prime numbers

Write a new function find_all_primes_cpu_and_gpu that uses check_prime_gpu_kernel instead of the inner loop of find_all_primes_cpu. How long does this new function take to find all primes up to 10000?

One possible implementation of this function is the following one.

PYTHON

def find_all_primes_cpu_and_gpu(upper):
    all_prime_numbers = []
    for num in range(0, upper):
        result = np.zeros((1), np.int32)
        check_prime_gpu_kernel[1,1](num, result)
        if result[0] != 0:
            all_prime_numbers.append(num)
    return all_prime_numbers
   
%timeit -n 10 -r 1 find_all_primes_cpu_and_gpu(10_000)

OUTPUT

6.21 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)

As you may have noticed, find_all_primes_cpu_and_gpu is much slower than the original find_all_primes_cpu. The reason is that the overhead of calling the GPU, and transferring data to and from it, for each number of the sequence is too large. To be efficient the GPU needs enough work to keep all of its cores busy.

Let us give the GPU a work load large enough to compensate for the overhead of data transfers to and from the GPU. For this example of computing primes we can best use the vectorize decorator for a new check_prime_gpu function that takes an array as input instead of upper in order to increase the work load. This is the array we have to use as input for our new check_prime_gpu function, instead of upper, a single integer:

PYTHON

numbers = np.arange(0, 10_000, dtype=np.int32)

So that input to the new check_prime_gpu function is simply the array of numbers we need to check for primes. check_prime_gpu looks similar to check_prime_gpu_kernel, but it is not a kernel, so it can return values:

PYTHON

import numba as nb

@nb.vectorize(['int32(int32)'], target='cuda')
def check_prime_gpu(num):
    for i in range(2, (num // 2) + 1):
       if (num % i) == 0:
           return 0
    return num

where we have added the vectorize decorator from Numba. The argument of check_prime_gpu seems to be defined as a scalar (single integer in this case), but the vectorize decorator will allow us to use an array as input. That array should consist of 4B (bytes) or 32b (bit) integers, indicated by (int32). The return array will also consist of 32b integers, with zeros for the non-primes. The nonzero values are the primes.

Let us run it and record the elapsed time:

PYTHON

%timeit -n 10 -r 1 check_prime_gpu(numbers)

which should show you a significant speedup:

OUTPUT

5.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)

This amounts to a speedup of our code of a factor 11 compared to the jit(nopython=True) decorated code on the CPU.

Key Points

  • “Numba can be used to run your own Python functions on the GPU.”
  • “Functions may need to be changed to run correctly on a GPU.”

Content from A Better Look at the GPU


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “How does a GPU work?”

Objectives

  • “Understand how the GPU is organized.”
  • “Understand the building blocks of the general GPU programming model.”

So far we have learned how to replace calls to NumPy and SciPy functions to equivalent ones running on the GPU using CuPy, and how to run some of our own Python functions on the GPU using Numba. This was possible even without much knowledge of how a GPU works. In fact, the only thing we mentioned in previous episodes about the GPU is that it is a device specialized in running parallel workloads, and that it is its own system, connected to our main memory and CPU by some kind of bus.

The connection between CPU and GPU
The connection between CPU and GPU

However, before moving to a programming language designed especially for GPUs, we need to introduce some concepts that will be useful to understand the next episodes.

The GPU, a High Level View at the Hardware

We can see the GPU like a collection of processors, sharing some common memory, akin to a traditional multi-processor system. Each processor executes code independently of the others, and internally it has tens to hundreds of cores, and some private memory space; in some GPUs the different processors can even execute different programs. The cores are often grouped in groups, and each group executes the same code, instruction by instruction, in the same order and at the same time. All cores have access to the processor’s private memory space.

How Programs are Executed

Let us assume we are sending our code to the GPU for execution; how is the code being executed by the different processors? First of all, we need to know that each processor will execute the same code. As an example, we can look back at some code that we executed on the GPU using Numba, like the following snippet.

PYTHON

import numba as nb

@nb.vectorize(['int32(int32)'], target='cuda')
def check_prime_gpu(num):
    for i in range(2, (num // 2) + 1):
       if (num % i) == 0:
           return 0
    return num

We did not need to write a different check_prime_gpu function for each processor, or core, on the GPU; actually, we have no idea how many processors and cores are available on the GPU we just used to execute this code!

So we can imagine that each processors receives its copy of the check_prime_gpu function, and executes it independently of the other processors. We also know that by executing the following Python snippet, we are telling the GPU to execute our function on all numbers between 0 and 100000.

PYTHON

check_prime_gpu(np.arange(0, 10_000, dtype=np.int32))

So each processor will get a copy of the code, and one subset of the numbers between 0 and 10000. If we assume that our GPU has 4 processors, each of them will get around 2500 numbers to process; the processing of these numbers will be split among the various cores that the processor has. Again, let us assume that each processor has 8 cores, divided in 2 groups of 4 cores. Therefore, the 2500 numbers to process will be divided inside the processors in sets of 4 elements, and these sets will be scheduled for execution on the 2 groups of cores that each processor has available. While the processors cannot communicate with each other, the cores of the same processor can; however, there is no communication in our example code.

While so far in the lesson we had no control over the way in which the computation is mapped to the GPU for execution, this is something that we will address soon.

Different Memories

Another detail that we need to understand is that GPUs have different memories. We have a main memory that is available to all processors on the GPU; this memory, as we already know, is often physically separate from the CPU memory, but copies to and from are possible. Using this memory we can send data to the GPU, and copy results back to the CPU. This memory is not coherent, meaning that there is no guarantee that code running on one GPU processor will see the results of code running on a different GPU processor.

Internally, each processor has its own memory. This memory is faster than the GPU main memory, but smaller in size, and it is also coherent, although we need to wait for all cores to finish their memory operations before the results produced by some cores are available to all other cores in the same processor. Therefore, this memory can be used for communication among cores.

Finally, each core has also a very small, but very fast, memory, that is used mainly to store the operands of the instructions executed by each core. This memory is private, and cannot generally be used for communication.

Additional Material

A short, but at the same time detailed, introduction to GPU hardware and programming model can be found in the following video, extracted from the University of Utah’s undergraduate course on Computer Organization and presented by Rajeev Balasubramonian.

Screenshot of the YouTube video showing a slide

Content from Your First GPU Kernel


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “How can I parallelize a Python application on a GPU?”
  • “How to write a GPU program?”
  • “What is CUDA?”

Objectives

  • “Recognize possible data parallelism in Python code”
  • “Understand the structure of a CUDA program”
  • “Execute a CUDA program in Python using CuPy”
  • “Measure the execution time of a CUDA kernel with CuPy”

Summing Two Vectors in Python

We start by introducing a program that, given two input vectors of the same size, stores the sum of the corresponding elements of the two input vectors into a third one.

PYTHON

def vector_add(A, B, C, size):
    for item in range(0, size):
        C[item] = A[item] + B[item]

One of the characteristics of this program is that each iteration of the for loop is independent from the other iterations. In other words, we could reorder the iterations and still produce the same output, or even compute each iteration in parallel or on a different device, and still come up with the same output. These are the kind of programs that we would call naturally parallel, and they are perfect candidates for being executed on a GPU.

Summing Two Vectors in CUDA

While we could just use CuPy to run something equivalent to our vector_add on a GPU, our goal is to learn how to write code that can be executed by GPUs, therefore we now begin learning CUDA.

The CUDA-C language is a GPU programming language and API developed by NVIDIA. It is mostly equivalent to C/C++, with some special keywords, built-in variables, and functions.

We begin our introduction to CUDA by writing a small kernel, i.e. a GPU program, that computes the same function that we just described in Python.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}

Callout

We are aware that CUDA is a proprietary solution, and that there are open-source alternatives such as OpenCL. However, CUDA is the most used platform for GPU programming and therefore we decided to use it for our teaching material.

Running Code on the GPU with CuPy

Before delving deeper into the meaning of all lines of code, and before starting to understand how CUDA works, let us execute the code on a GPU and check if it is correct or not. To compile the code and manage the GPU in Python we are going to use the interface provided by CuPy.

PYTHON

import cupy

# size of the vectors
size = 1024

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)

# CUDA vector_add
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
'''
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")

vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

And to be sure that the CUDA code does exactly what we want, we can execute our sequential Python code and compare the results.

PYTHON

import numpy as np

a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)

vector_add(a_cpu, b_cpu, c_cpu, size)

# test
if np.allclose(c_cpu, c_gpu):
    print("Correct results!")

OUTPUT

Correct results!

Understanding the CUDA Code

We can now move back to the CUDA code and analyze it line by line to highlight the differences between CUDA-C and standard C.

C

__global__ void vector_add(const float * A, const float * B, float * C, const int size)

This is the definition of our CUDA vector_add function. The __global__ keyword is an execution space identifier, and it is specific to CUDA. What this keyword means is that the defined function will be able to run on the GPU, but can also be called from the host (in our case the Python interpreter running on the CPU). All of our kernel definitions will be preceded by this keyword.

Other execution space identifiers in CUDA-C are __host__, and __device__. Functions annotated with the __host__ identifier will run on the host, and be only callable from the host, while functions annotated with the __device__ identifier will run on the GPU, but can only be called from the GPU itself. We are not going to use these identifiers as often as __global__.

The following table offers a recapitulation of the keywords we just introduced.

Keyword Description
__global__ the function is visible to the host and the GPU, and runs on the GPU
__host__ the function is visible only to the host, and runs on the host
__device__ the function is visible only to the GPU, and runs on the GPU

The following is the part of the code in which we do the actual work.

C

int item = threadIdx.x;
C[item] = A[item] + B[item];

As you may see, it looks similar to the innermost loop of our vector_add Python function, with the main difference being in how the value of the item variable is evaluated.

In fact, while in Python the content of item is the result of the range function, in CUDA we are reading a special variable, i.e. threadIdx, containing a triplet that indicates the id of a thread inside a three-dimensional CUDA block. In this particular case we are working on a one dimensional vector, and therefore only interested in the first dimension, that is stored in the x field of this variable.

Challenge: loose threads

We know enough now to pause for a moment and do a little exercise. Assume that in our vector_add kernel we replace the following line:

C

int item = threadIdx.x;

With this other line of code:

C

int item = 1;

What will the result of this change be?

  1. Nothing changes
  2. Only the first thread is working
  3. Only C[1] is written
  4. All elements of C are zero

The correct answer is number 3, only the element C[1] is written, and we do not even know by which thread!

Computing Hierarchy in CUDA

In the previous example we had a small vector of size 1024, and each of the 1024 threads we generated was working on one of the element.

What would happen if we changed the size of the vector to a larger number, such as 2048? We modify the value of the variable size and try again.

PYTHON

# size of the vectors
size = 2048

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)

# CUDA vector_add
vector_add_gpu = cupy.RawKernel(r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
''', "vector_add")

vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

This is how the output should look like when running the code in a Jupyter Notebook:

OUTPUT

---------------------------------------------------------------------------

CUDADriverError                           Traceback (most recent call last)

<ipython-input-4-a26bc8acad2fin <module>()
     19 ''', "vector_add")
     20 
---21 vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))
     22 
     23 print(c_gpu)

cupy/core/raw.pyx in cupy.core.raw.RawKernel.__call__()

cupy/cuda/function.pyx in cupy.cuda.function.Function.__call__()

cupy/cuda/function.pyx in cupy.cuda.function._launch()

cupy_backends/cuda/api/driver.pyx in cupy_backends.cuda.api.driver.launchKernel()

cupy_backends/cuda/api/driver.pyx in cupy_backends.cuda.api.driver.check_status()

CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument

The reason for this error is that most GPUs will not allow us to execute a block composed of more than 1024 threads. If we look at the parameters of our functions we see that the first two parameters are two triplets.

PYTHON

vector_add_gpu((1, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

The first triplet specifies the size of the CUDA grid, while the second triplet specifies the size of the CUDA block. The grid is a three-dimensional structure in the CUDA programming model and it represent the organization of a whole kernel execution. A grid is made of one or more independent blocks, and in the case of our previous snippet of code we have a grid composed by a single block (1, 1, 1). The size of this block is specified by the second triplet, in our case (size, 1, 1). While blocks are independent of each other, the thread composing a block are not completely independent, they share resources and can also communicate with each other.

To go back to our example, we can modify che grid specification from (1, 1, 1) to (2, 1, 1), and the block specification from (size, 1, 1) to (size // 2, 1, 1). If we run the code again, we should now get the expected output.

We already introduced the special variable threadIdx when introducing the vector_add CUDA code, and we said it contains a triplet specifying the coordinates of a thread in a thread block. CUDA has other variables that are important to understand the coordinates of each thread and block in the overall structure of the computation.

These special variables are blockDim, blockIdx, and gridDim, and they are all triplets. The triplet contained in blockDim represents the size of the calling thread’s block in three dimensions. While the content of threadIdx is different for each thread in the same block, the content of blockDim is the same because the size of the block is the same for all threads. The coordinates of a block in the computational grid are contained in blockIdx, therefore the content of this variable will be the same for all threads in the same block, but different for threads in different blocks. Finally, gridDim contains the size of the grid in three dimensions, and it is again the same for all threads.

The following table offers a recapitulation of the keywords we just introduced.

Keyword Description
threadIdx the ID of a thread in a block
blockDim the size of a block, i.e. the number of threads per dimension
blockIdx the ID of a block in the grid
gridDim the size of the grid, i.e. the number of blocks per dimension

Challenge: hidden variables

Given the following snippet of code:

PYTHON

size = 512
vector_add_gpu((4, 1, 1), (size, 1, 1), (a_gpu, b_gpu, c_gpu, size))

What is the content of the blockDim and gridDim variables inside the CUDA vector_add kernel?

The content of blockDim is (512, 1, 1) and the content of gridDim is (4, 1, 1), for all threads.

What happens if we run the code that we just modified to work on an vector of 2048 elements, and compare the results with our CPU version?

PYTHON

# size of the vectors
size = 2048

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)

# CPU code
def vector_add(A, B, C, size):
    for item in range(0, size):
        C[item] = A[item] + B[item]

# CUDA vector_add
vector_add_gpu = cupy.RawKernel(r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
    int item = threadIdx.x;
    C[item] = A[item] + B[item];
}
''', "vector_add")

# execute the code
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size))
vector_add(a_cpu, b_cpu, c_cpu, size)

# test
if np.allclose(c_cpu, c_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

OUTPUT

Wrong results!

The results are wrong! In fact, while we increased the number of threads we launch, we did not modify the kernel code to compute the correct results using the new builtin variables we just introduced.

Challenge: scaling up

In the following code, fill in the blank to work with vectors that are larger than the largest CUDA block (i.e. 1024).

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = ______________;
   C[item] = A[item] + B[item];
}

The correct answer is (blockIdx.x * blockDim.x) + threadIdx.x. The following code is the complete vector_add that can work with vectors larger than 1024 elements.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   C[item] = A[item] + B[item];
}

Vectors of Arbitrary Size

So far we have worked with a number of threads that is the same as the elements in the vector. However, in a real world scenario we may have to process vectors of arbitrary size, and to do this we need to modify both the kernel and the way it is launched.

Challenge: more work than necessary

We modified the vector_add kernel to include a check for the size of the vector, so that we only compute elements that are within the vector boundaries. However the code is not correct as it is written now. Can you rewrite the code to make it work?

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   if ( item < size )
   {
      int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   }
   C[item] = A[item] + B[item];
}

The correct way to modify the vector_add to work on vectors of arbitrary size is to first compute the coordinates of each thread, and then perform the sum only on elements that are within the vector boundaries, as shown in the following snippet of code.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

To test our changes we can modify the size of the vectors from 2048 to 10000, and execute the code again.

OUTPUT

---------------------------------------------------------------------------

CUDADriverError                           Traceback (most recent call last)

<ipython-input-20-00d938215d28in <module>()
     31 
     32 # Execute the code
---33 vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size))
     34 vector_add(a_cpu, b_cpu, c_cpu, size)
     35 

cupy/core/raw.pyx in cupy.core.raw.RawKernel.__call__()

cupy/cuda/function.pyx in cupy.cuda.function.Function.__call__()

cupy/cuda/function.pyx in cupy.cuda.function._launch()

cupy/cuda/driver.pyx in cupy.cuda.driver.launchKernel()

cupy/cuda/driver.pyx in cupy.cuda.driver.check_status()

CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument

This error is telling us that CUDA cannot launch a block with size // 2 threads, because the maximum amount of threads in a kernel is 1024 and we are requesting 5000 threads.

What we need to do is to make grid and block more flexible, so that they can adapt to vectors of arbitrary size. To do that, we can replace the Python code to call vector_add_gpu with the following code.

PYTHON

import math

grid_size = (int(math.ceil(size / 1024)), 1, 1)
block_size = (1024, 1, 1)

vector_add_gpu(grid_size, block_size, (a_gpu, b_gpu, c_gpu, size))

With these changes we always have blocks composed of 1024 threads, but we adapt the number of blocks so that we always have enough to threads to compute all elements in the vector. If we want to be able to easily modify the number of threads per block, we can even rewrite the code like the following:

PYTHON

threads_per_block = 1024
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

vector_add_gpu(grid_size, block_size, (a_gpu, b_gpu, c_gpu, size))

So putting this all together in a full snippet we can execute the code again.

PYTHON

vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}
'''
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")

threads_per_block = 1024
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

vector_add_gpu(grid_size, block_size, (a_gpu, b_gpu, c_gpu, size))

if np.allclose(c_cpu, c_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

OUTPUT

Correct results!

Challenge: compute prime numbers with CUDA

Given the following Python code, similar to what we have seen in the previous episode about Numba, write the missing CUDA kernel that computes all the prime numbers up to a certain upper bound.

PYTHON

import numpy as np
import cupy
import math
from cupyx.profiler import benchmark

# CPU version
def all_primes_to(upper : int, prime_list : list):
    for num in range(0, upper):
        prime = True
        for i in range(2, (num // 2) + 1):
            if (num % i) == 0:
                prime = False
                break
        if prime:
            prime_list[num] = 1

upper_bound = 100_000
all_primes_cpu = np.zeros(upper_bound, dtype=np.int32)

# GPU version
check_prime_gpu_code = r'''
extern "C"
__global__ void all_primes_to(int size, int * const all_prime_numbers)
{
   for ( int number = 0; number < size; number++ )
   {
       int result = 1;
       for ( int factor = 2; factor <= number / 2; factor++ )
       {
           if ( number % factor == 0 )
           {
               result = 0;
               break;
           }
       }
>
       all_prime_numbers[number] = result;
   }
}
'''
# Allocate memory
all_primes_gpu = cupy.zeros(upper_bound, dtype=cupy.int32)

# Setup the grid
all_primes_to_gpu = cupy.RawKernel(check_prime_gpu_code, "all_primes_to")
grid_size = (int(math.ceil(upper_bound / 1024)), 1, 1)
block_size = (1024, 1, 1)

# Benchmark and test
%timeit -n 1 -r 1 all_primes_to(upper_bound, all_primes_cpu)
execution_gpu = benchmark(all_primes_to_gpu, (grid_size, block_size, (upper_bound, all_primes_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")
>
if np.allclose(all_primes_cpu, all_primes_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

There is no need to modify anything in the code, except the body of the CUDA all_primes_to inside the check_prime_gpu_code string, as we did in the examples so far.

Be aware that the provided CUDA code is a direct port of the Python code, and therefore very slow. If you want to test it, user a lower value for upper_bound.

One possible solution for the CUDA kernel is provided in the following code.

C


extern "C"
__global__ void all_primes_to(int size, int * const all_prime_numbers)
{
    int number = (blockIdx.x * blockDim.x) + threadIdx.x;
    int result = 1;

    if ( number < size )
    {
        for ( int factor = 2; factor <= number / 2; factor++ )
        {
            if ( number % factor == 0 )
            {
                result = 0;
                break;
            }
        }

        all_prime_numbers[number] = result;
    }
}

The outermost loop in Python is replaced by having each thread testing for primeness a different number of the sequence. Having one number assigned to each thread via its ID, the kernel implements the innermost loop the same way it is implemented in Python.

Key Points

  • “Precede your kernel definition with the __global__ keyword”
  • “Use built-in variables threadIdx, blockIdx, gridDim and blockDim to identify each thread”

Content from Registers, Global, and Local Memory


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “What are registers?”
  • “How to share data between host and GPU?”
  • “Which memory is accessible to threads and thread blocks?”

Objectives

  • “Understanding the difference between registers and device memory”
  • “Understanding the difference between local and global memory”

Now that we know how to write a CUDA kernel to run code on the GPU, and how to use the Python interface provided by CuPy to execute it, it is time to look at the different memory spaces in the CUDA programming model.

Registers

Registers are fast on-chip memories that are used to store operands for the operations executed by the computing cores.

Did we encounter registers in the vector_add code used in the previous episode? Yes we did! The variable item is, in fact, stored in a register for at least part, if not all, of a thread’s execution. In general all scalar variables defined in CUDA code are stored in registers.

Registers are local to a thread, and each thread has exclusive access to its own registers: values in registers cannot be accessed by other threads, even from the same block, and are not available for the host. Registers are also not permanent, therefore data stored in registers is only available during the execution of a thread.

Challenge: how many registers are we using?

Can you guess how many registers are we using in the following vector_add code?

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

In general, it is not possible to exactly know how many registers the compiler will use without examining the output generated by the compiler itself. However, we can roughly estimate the amount of necessary registers based on the variables used. We most probably need one register to store the variable item, two registers to store the content of A[item] and B[item], and one additional register to store the sum A[item] + B[item]. So the number of registers that vector_add probably uses is 4.

If we want to make registers use more explicit in the vector_add code, we can try to rewrite it in a slightly different, but equivalent, way.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   float temp_a, temp_b, temp_c;

   if ( item < size )
   {
       temp_a = A[item];
       temp_b = B[item];
       temp_c = temp_a + temp_b;
       C[item] = temp_c;
   }
}

In this new version of vector_add we explicitly declare three float variables to store the values loaded from memory and the sum of our input items, making the estimation of used registers more obvious.

This it totally unnecessary in the case of our example, because the compiler will determine on its own the right amount of registers to allocate per thread, and what to store in them. However, explicit register usage can be important for reusing items already loaded from memory.

Callout

Registers are the fastest memory on the GPU, so using them to increase data reuse is an important performance optimization. We will look at some examples of manually using registers to improve performance in future episodes.

Small CUDA arrays, which size is known at compile time, will also be allocated in registers by the compiler. We can rewrite the previous version of vector_add to work with an array of registers.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   float temp[3];

   if ( item < size )
   {
       temp[0] = A[item];
       temp[1] = B[item];
       temp[2] = temp[0] + temp[1];
       C[item] = temp[2];
   }
}

Once again, this is not something that we would normally do, and it is provided only as an example of how to work with arrays of registers.

Global Memory

Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two. It is the largest memory space available, and therefore it can contain much more data than registers, but it is also slower to access. This memory space does not require any special memory space identifier.

Challenge: identify when global memory is used

Observe the code of the following vector_add and identify where global memory is used.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

The vectors A, B, and C are stored in global memory.

Memory allocated on the host, and passed as a parameter to a kernel, is by default allocated in global memory.

Global memory is accessible by all threads, from all thread blocks. This means that a thread can read and write any value in global memory.

Callout

While global memory is visible to all threads, remember that global memory is not coherent, and changes made by one thread block may not be available to other thread blocks during the kernel execution. However, all memory operations are finalized when the kernel terminates.

Local Memory

Memory can also be statically allocated from within a kernel, and according to the CUDA programming model such memory will not be global but local memory. Local memory is only visible, and therefore accessible, by the thread allocating it. So all threads executing a kernel will have their own privately allocated local memory.

Challenge: use local memory

Modify the following of vector_add so that intermediate data products are stored in local memory, and only the final result is saved into global memory.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}

Hint: have a look at the example using an array of registers, but find a way to use a variable and not a constant for the size.

We need to pass the size of the local array as a new parameter to the kernel, because if we just specified 3 in the code, the compiler would allocate registers and not local memory.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size, const int local_memory_size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   float local_memory[local_memory_size];
   
   if ( item < size )
   {
      local_memory[0] = A[item];
      local_memory[1] = B[item];
      local_memory[2] = local_memory[0] + local_memory[1];
      C[item] = local_memory[2];
   }
}

The host code could be modified adding one line and changing the way the kernel is called.

PYTHON

local_memory_size = 3
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size, local_memory_size))

Local memory is not not a particularly fast memory, and in fact it has similar throughput and latency of global memory, but it is much larger than registers. As an example, local memory is automatically used by the CUDA compiler to store spilled registers, i.e. to temporarily store variables that cannot be kept in registers anymore because there is not enough space in the register file, but that will be used again in the future and so cannot be erased.

Key Points

  • “Registers can be used to locally store data and avoid repeated memory operations”
  • “Global memory is the main memory space and it is used to share data between host and GPU”
  • “Local memory is a particular type of memory that can be used to store data that does not fit in registers and is private to a thread”

Content from Shared Memory and Synchronization


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “Is there a way to share data between threads of a same block?”
  • “Can threads inside a block wait for other threads?”

Objectives

  • “Learn how to share data between threads”
  • “Learn how to synchronize threads”

So far we looked at how to use CUDA to accelerate the computation, but a common pattern in all the examples we encountered so far is that threads worked in isolation. While having different threads perform the same operation on different data is a good pattern for working with GPUs, there are cases in which threads need to communicate. This communication may be necessary because of the way the algorithm we are trying to implement works, or it may derive from a performance goal we are trying to achieve.

Shared Memory

Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.

To allocate an array in shared memory we need to preface the definition with the identifier __shared__.

Challenge: use of shared memory

Modify the following code to allocate the temp array in shared memory.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  float temp[3];

  if ( item < size )
  {
      temp[0] = A[item];
      temp[1] = B[item];
      temp[2] = temp[0] + temp[1];
      C[item] = temp[2];
  }
}

To use shared memory for the temp array add the identifier __shared__ to its definition, like in the following code.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  __shared__ float temp[3];
>
  if ( item < size )
  {
      temp[0] = A[item];
      temp[1] = B[item];
      temp[2] = temp[0] + temp[1];
      C[item] = temp[2];
  }
}

While syntactically correct, the previous example is functionally wrong. The reason is that the temp array is not anymore private to the thread allocating it, but it is now shared by the whole thread block.

Challenge: what is the result of the previous code block?

The previous code example is functionally wrong. Can you guess what the result of its execution will be?

The result is non deterministic, and definitely not the same as the previous versions of vector_add. Threads will overwrite each other temporary values,and there will be no guarantee on which value is visible by each thread.

To fix the previous kernel we should allocate enough shared memory for each thread to store three values, so that each thread has its own section of the shared memory array to work with.

To allocate enough memory we need to replace the constant 3 in __shared__ float temp[3] with something else. If we know that each thread block has 1024 threads, we can write something like the following:

C

__shared__ float temp[3 * 1024];

But we know by experience that having constants in the code is not a scalable and maintainable solution. The problem is that we need to have a constant value if we want to declare a shared memory array, because the compiler needs to know how much memory to allocate.

A solution to this problem is to not specify the size of the array, and allocate the memory somewhere else.

C

extern __shared__ float temp[];

And then use CuPy to instruct CUDA about how much shared memory, in bytes, each thread block needs. This can be done by adding the named parameter shared_mem to the kernel call.

PYTHON

vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size), shared_mem=((size // 2) * 3 * cupy.dtype(cupy.float32).itemsize))

As you may have noticed, we had to retrieve the size in bytes of the data type cupy.float32, and this is done with cupy.dtype(cupy.float32).itemsize.

After these changes, the body of the kernel needs to be modified to use the right indices:

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  int offset = threadIdx.x * 3;
  extern __shared__ float temp[];

  if ( item < size )
  {
      temp[offset + 0] = A[item];
      temp[offset + 1] = B[item];
      temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
      C[item] = temp[offset + 2];
  }
}

And for completeness, we present the full Python code.

PYTHON

import math
import numpy as np
import cupy

# vector size
size = 2048

# GPU memory allocation
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
gpu_args = (a_gpu, b_gpu, c_gpu, size)

# CPU memory allocation
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)

# CUDA code
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  int offset = threadIdx.x * 3;
  extern __shared__ float temp[];

  if ( item < size )
  {
      temp[offset + 0] = A[item];
      temp[offset + 1] = B[item];
      temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
      C[item] = temp[offset + 2];
  }
}
'''

# compile and execute code
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")
threads_per_block = 32
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
vector_add_gpu(grid_size, block_size, gpu_args, shared_mem=(threads_per_block * 3 * cupy.dtype(cupy.float32).itemsize))

# execute Python code and compare results
vector_add(a_cpu, b_cpu, c_cpu, size)
np.allclose(c_cpu, c_gpu)

The code is now correct, although it is still not very useful. We are definitely using shared memory, and we are using it the correct way, but there is no performance gain we achieved by doing so. Actually, we are making our code slower, not faster, because shared memory is slower than registers.

Let us, therefore, work on an example where using shared memory is actually useful. We start again with some Python code.

PYTHON

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1

The histogram function, as the name suggests, computes the histogram of an array of integers, i.e. counts how many instances of each integer are in input_array, and writes the count in output_array. We can now generate some data and run the code.

PYTHON

input_array = np.random.randint(256, size=2048, dtype=np.int32)
output_array = np.zeros(256, dtype=np.int32)
histogram(input_array, output_array)

Everything as expected. We can now write the equivalent code in CUDA.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    output[input[item]] = output[input[item]] + 1;
}

Challenge: error in the histogram

If you look at the CUDA histogram code, there is a logical error that prevents it to produce the correct results. Can you find it?

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    output[input[item]] = output[input[item]] + 1;
}

The GPU is a highly parallel device, executing multiple threads at the same time. In the previous code different threads could be updating the same output item at the same time, producing wrong results.

To solve this problem, we need to use a function from the CUDA library named atomicAdd. This function ensures that the increment of output_array happens in an atomic way, so that there are no conflicts in case multiple threads want to update the same item at the same time.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    atomicAdd(&(output[input[item]]), 1);
}

And the full Python code snippet.

PYTHON

import math
import numpy as np
import cupy
from cupyx.profiler import benchmark

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1

# input size
size = 2**25

# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)

# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    atomicAdd(&(output[input[item]]), 1);
}
'''

# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

The CUDA code is now correct, and computes the same result as the Python code; it is also faster than the Python code, as you can see from measuring the execution time. However, we are accumulating the results directly in global memory, and the more conflicts we have in global memory, the lower the performance of our histogram will be. Moreover, the access pattern to the output array is very irregular, being dependent on the content of the input array. GPUs are designed for very regular computations, and so if we can make the histogram more regular we can hope in a further improvement in performance.

As you may expect, we can improve the memory access pattern by using shared memory.

Challenge: use shared memory to speed up the histogram

Implement a new version of the CUDA histogram function that uses shared memory to reduce conflicts in global memory. Modify the following code and follow the suggestions in the comments.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    // Declare temporary histogram in shared memory
    int temp_histogram[256];

    // Update the temporary histogram in shared memory
    atomicAdd();
    // Update the global histogram in global memory, using the temporary histogram
    atomicAdd();
}

Hint: for this exercise, you can safely assume that the size of output is the same as the number of threads in a block.

Hint: atomicAdd can be used on both global and shared memory.

The following code shows one of the possible solutions.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];

    atomicAdd(&(temp_histogram[input[item]]), 1);
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}

The idea behind this solution is to reduce the expensive conflicts in global memory by having a temporary histogram in shared memory. After a block has finished processing its fraction of the input array, and the local histogram is populated, threads collaborate to update the global histogram. Not only this solution potentially reduces the conflicts in global memory, it also produces a better access pattern because threads read adjacent items of the input array, and write to adjacent elements of the output array during the second call to atomicAdd.

Thread Synchronization

There is still one potentially big issue in the histogram code we just wrote, and the issue is that shared memory is not coherent without explicit synchronization. The problem lies in the following two lines of code:

C

atomicAdd(&(temp_histogram[input[item]]), 1);
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);

In the first line each thread updates one arbitrary position in shared memory, depending on the value of the input, while in the second line each thread reads the element in shared memory corresponding to its thread ID. However, the changes to shared memory are not automatically available to all other threads, and therefore the final result may not be correct.

To solve this issue, we need to explicitly synchronize all threads in a block, so that memory operations are also finalized and visible to all. To synchronize threads in a block, we use the __syncthreads() CUDA function. Moreover, shared memory is not initialized, and the programmer needs to take care of that too. So we need to first initialize temp_histogram, wait that all threads are done doing this, perform the computation in shared memory, wait again that all threads are done, and only then update the global array.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadIdx.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}

And the full Python code snippet.

PYTHON

import math
import numpy as np
import cupy
from cupyx.profiler import benchmark

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1

# input size
size = 2**25

# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)

# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadIdx.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
'''

# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

While both versions of the GPU histogram are correct, the one using shared memory is faster; but how fast? On a NVIDIA Tesla T4 accessed via Google Colab, the shared memory version is ten times faster than the version doing atomic operations on global memory.

Key Points

  • “Shared memory is faster than global memory and local memory”
  • “Shared memory can be used as a user-controlled cache to speedup code”
  • “Size of shared memory arrays must be known at compile time if allocated inside a thread”
  • “It is possible to declare extern shared memory arrays and pass the size during kernel invocation”
  • “Use __shared__ to allocate memory in the shared memory space”
  • “Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block”

Content from Constant Memory


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “Is there a way to have a read-only cache in CUDA?”

Objectives

  • “Understanding when and how to use constant memory”

Constant Memory

Constant memory is a read-only cache which content can be broadcasted to multiple threads in a block. A variable allocated in constant memory needs to be declared in CUDA by using the special __constant__ identifier, and it must be a global variable, i.e. it must be declared in the scope that contains the kernel, not inside the kernel itself. If all of this sounds complex do not worry, we are going to see how this works with an example.

C

extern "C" {
#define BLOCKS 2

__constant__ float factors[BLOCKS];

__global__ void sum_and_multiply(const float * A, const float * B, float * C, const int size)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    C[item] = (A[item] + B[item]) * factors[blockIdx.x];
}
}

In the previous code snippet we implemented a kernel that, given two vectors A and B, stores their element-wise sum in a third vector, C, scaled by a certain factor; this factor is the same for all threads in the same thread block. Because these factors are shared, i.e. all threads in the same thread block use the same factor for scaling their sums, it is a good idea to use constant memory for the factors array. In fact you can see that the definition of factors is preceded by the __constant__ keyword, and said definition is in the global scope. It is important to note that the size of the constant array needs to be known at compile time, therefore the use of the define preprocessor statement. On the kernel side there is no need to do more, the factors vector can be normally accessed inside the code as any other vector, and because it is a global variable it does not need to be passed to the kernel as a function argument.

The initialization of constant memory happens on the host side, and we show how this is done in the next code snippet.

PYTHON

# compile the code
module = cupy.RawModule(code=cuda_code)
# allocate and copy constant memory
factors_ptr = module.get_global("factors")
factors_gpu = cupy.ndarray(2, cupy.float32, factors_ptr)
factors_gpu[...] = cupy.random.random(2, dtype=cupy.float32)

From the previous code it is clear that dealing with constant memory is a slightly more verbose affair than usual. First, we need to compile the code, that in this case is contained in a Python string named cuda_code. This is necessary because constant memory is defined in the CUDA code, so we need CUDA to allocate the necessary memory, and then provide us with a pointer to this memory. By calling the method get_global we ask the CUDA subsystem to provide us with the location of a global object, in this case the array factors. We can then create our own CuPy array and point that to the object returned by get_global, so that we can use it in Python as we would normally do. Note that we use the constant 2 for the size of the array, the same number we are using in the CUDA code; it is important that we use the same number or we may end up accessing memory that is outside the bound of the array. Lastly, we initialize the array with some random floating point numbers.

Challenge: print the content of constant memory

What should be the output of the following line of code?

PYTHON

print(factors_gpu)

In our case the output of this line of code is two floating point numbers, e.g. [0.11390183 0.2585096 ]. However, we are not really accessing the content of the GPU’s constant memory from the host, we are simply accessing the host-side copy of the data maintained by CuPy.

We can now combine all the code together and execute it.

PYTHON

# size of the vectors
size = 2048

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
# prepare arguments
args = (a_gpu, b_gpu, c_gpu, size)

# CUDA code
cuda_code = r'''
extern "C" {
#define BLOCKS 2

__constant__ float factors[BLOCKS];

__global__ void sum_and_multiply(const float * A, const float * B, float * C, const int size)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    C[item] = (A[item] + B[item]) * factors[blockIdx.x];
}
}
'''

# compile and access the code
module = cupy.RawModule(code=cuda_code)
sum_and_multiply = module.get_function("sum_and_multiply")
# allocate and copy constant memory
factors_ptr = module.get_global("factors")
factors_gpu = cupy.ndarray(2, cupy.float32, factors_ptr)
factors_gpu[...] = cupy.random.random(2, dtype=cupy.float32)

sum_and_multiply((2, 1, 1), (size // 2, 1, 1), args)

As you can see the code is not very general, it uses constants and works only with two blocks, but it is a working example of how to use constant memory.

Challenge: generalize the previous code

Have a look again at the code using constant memory, and make it general enough to be able to run on input of arbitrary size. Experiment with some different input sizes.

One of the possible solutions is the following one.

PYTHON

# size of the vectors
size = 10**6

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
# prepare arguments
args = (a_gpu, b_gpu, c_gpu, size)

# CUDA code
cuda_code = r'''
extern "C" {
__constant__ float factors[BLOCKS];

__global__ void sum_and_multiply(const float * A, const float * B, float * C, const int size)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    if ( item < size )
    {
        C[item] = (A[item] + B[item]) * factors[blockIdx.x];
    }
}
}
'''

# compute the number of blocks and replace "BLOCKS" in the CUDA code
threads_per_block = 1024
num_blocks = int(math.ceil(size / threads_per_block))
cuda_code = cuda_code.replace("BLOCKS", f"{num_blocks}") 

# compile and access the code
module = cupy.RawModule(code=cuda_code)
sum_and_multiply = module.get_function("sum_and_multiply")
# allocate and copy constant memory
factors_ptr = module.get_global("factors")
factors_gpu = cupy.ndarray(num_blocks, cupy.float32, factors_ptr)
factors_gpu[...] = cupy.random.random(num_blocks, dtype=cupy.float32)

sum_and_multiply((num_blocks, 1, 1), (threads_per_block, 1, 1), args)

Key Points

  • “Globally scoped arrays, which size is known at compile time, can be stored in constant memory using the __constant__ identifier”

Content from Concurrent access to the GPU


Last updated on 2024-03-12 | Edit this page

Overview

Questions

  • “Is it possible to concurrently execute more than one kernel on a single GPU?”

Objectives

  • “Understand how to use CUDA streams and events”

Concurrently execute two kernels on the same GPU

So far we only focused on completing one operation at the time on the GPU, writing and executing a single CUDA kernel each time. However the GPU has enough resources to perform more than one task at the same time.

Let us assume that, for our program, we need to compute both a list of prime numbers, and a histogram, two kernels that we developed in this same lesson. We could write both kernels in CUDA, and then execute them on the GPU, as shown in the following code.

PYTHON

import numpy as np
import cupy
import math
from cupyx.profiler import benchmark

upper_bound = 100_000
histogram_size = 2**25

# GPU code
check_prime_gpu_code = r'''
extern "C"
__global__ void all_primes_to(int size, int * const all_prime_numbers)
{
    int number = (blockIdx.x * blockDim.x) + threadIdx.x;
    int result = 1;

    if ( number < size )
    {
        for ( int factor = 2; factor <= number / 2; factor++ )
        {
            if ( number % factor == 0 )
            {
                result = 0;
                break;
            }
        }

        all_prime_numbers[number] = result;
    }
}
'''
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadIdx.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
'''

# Allocate memory
all_primes_gpu = cupy.zeros(upper_bound, dtype=cupy.int32)
input_gpu = cupy.random.randint(256, size=histogram_size, dtype=cupy.int32)
output_gpu = cupy.zeros(256, dtype=cupy.int32)

# Compile and setup the grid
all_primes_to_gpu = cupy.RawKernel(check_prime_gpu_code, "all_primes_to")
grid_size_primes = (int(math.ceil(upper_bound / 1024)), 1, 1)
block_size_primes = (1024, 1, 1)
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block_hist = 256
grid_size_hist = (int(math.ceil(histogram_size / threads_per_block_hist)), 1, 1)
block_size_hist = (threads_per_block_hist, 1, 1)

# Execute the kernels
all_primes_to_gpu(grid_size_primes, block_size_primes, (upper_bound, all_primes_gpu))
histogram_gpu(grid_size_hist, block_size_hist, (input_gpu, output_gpu))

# Save results
output_one = all_primes_gpu
output_two = output_gpu

In the previous code, after allocating memory and compiling, we execute and measure the performance of one kernel, and when we are done we do the same for the other kernel.

This is technically correct, but there is no reason why one kernel should be executed before the other, because there is no dependency between these two operations.

Therefore, while this is fine in our example, in a real application we may want to run the two kernels concurrently on the GPU to reduce the total execution time. To achieve this in CUDA we need to use CUDA streams.

A stream is a sequence of GPU operations that is executed in order, and so far we have been implicitly using the defaul stream. This is the reason why all the operations we issued, such as data transfers and kernel invocations, are performed in the order we specify them in the Python code, and not in any other.

Have you wondered why after requesting data transfers to and from the GPU, we do not stop to check if they are complete before performing operations on such data? The reason is that within a stream all operations are carried out in order, so the kernel calls in our code are always performed after the data transfer from host to device is complete, and so on.

If we want to create new CUDA streams, we can do it this way using CuPy.

PYTHON

stream_one = cupy.cuda.Stream()
stream_two = cupy.cuda.Stream()

We can then execute the kernels in different streams by using the Python with statement.

PYTHON

with stream_one:
    all_primes_to_gpu(grid_size_primes, block_size_primes, (upper_bound, all_primes_gpu))
with stream_two:
    histogram_gpu(grid_size_hist, block_size_hist, (input_gpu, output_gpu))

Using the with statement we implicitly execute the CUDA operations in the code block using that stream. The result of doing this is that the second kernel, i.e. histogram_gpu, does not need to wait for all_primes_to_gpu to finish before being executed.

Stream synchronization

If we need to wait for all operations on a certain stream to finish, we can call the synchronize method. Continuing with the previous example, in the following Python snippet we wait for the execution of all_primes_to_gpu on stream_one to finish.

PYTHON

stream_one.synchronize()

This synchronization primitive is useful when we need to be sure that all operations on a stream are finished, before continuing. It is, however, a bit coarse grained. Imagine to have a stream with a whole sequence of operations enqueued, and another stream with one data dependency on one of these operations. If we use synchronize, we wait until all operations of said stream are completed before executing the other stream, thus negating the whole reason of using streams in the first place.

A possible solution is to insert a CUDA event at a certain position in the stream, and then wait specifically for that event. Events are created in Python in the following way.

PYTHON

interesting_event = cupy.cuda.Event()

And can then be added to a stream by using the record method. In the following example we will create two streams: in the first we will execute histogram_gpu twice, while in the second one we will execute all_primes_to_gpu with the condition that the kernel in the second stream is executed only after the first histogram is computed.

PYTHON

stream_one = cupy.cuda.Stream()
stream_two = cupy.cuda.Stream()
sync_point = cupy.cuda.Event()

with stream_one:
    all_primes_to_gpu(grid_size_primes, block_size_primes, (upper_bound, all_primes_gpu))
    sync_point.record(stream=stream_one)
    all_primes_to_gpu(grid_size_primes, block_size_primes, (upper_bound, all_primes_gpu))
with stream_two:
    stream_two.wait_event(sync_point)
    histogram_gpu(grid_size_hist, block_size_hist, (input_gpu, output_gpu))

With streams and events, it is possible to build up arbitrary execution graphs for complex operations on the GPU.

Measure execution time using streams and events

We can also use streams and events to measure the execution time of kernels, without having to use the CuPy benchmark function. In the following example, we go back to the vector_add code and add code that uses events on the default stream to measure the execution time.

PYTHON

import math
import cupy
import numpy as np

# size of the vectors
size = 100_000_000

# allocating and populating the vectors
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)

# CPU code
def vector_add(A, B, C, size):
    for item in range(0, size):
        C[item] = A[item] + B[item]

# CUDA vector_add
vector_add_gpu = cupy.RawKernel(r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
   int item = (blockIdx.x * blockDim.x) + threadIdx.x;
   if ( item < size )
   {
      C[item] = A[item] + B[item];
   }
}
''', "vector_add")

# execute the code and measure time
threads_per_block = 1024
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
%timeit -n 1 -r 1 vector_add(a_cpu, b_cpu, c_cpu, size)
gpu_times = []
for _ in range(0, 10):
    start_gpu = cupy.cuda.Event()
    end_gpu = cupy.cuda.Event()
    start_gpu.record()
    vector_add_gpu(grid_size, block_size, (a_gpu, b_gpu, c_gpu, size))
    end_gpu.record()
    end_gpu.synchronize()
    gpu_times.append(cupy.cuda.get_elapsed_time(start_gpu, end_gpu))
gpu_avg_time = np.average(gpu_times)
print(f"{gpu_avg_time:.6f} s")

# test
if np.allclose(c_cpu, c_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

Key Points

  • “”