Skip to main content
Beta
This lesson is in the beta phase, which means that it is ready for teaching by instructors outside of the original author team.
Key Points
“GPUs achieve high throughput by running thousands of threads in
parallel, unlike CPUs which prioritise low latency”
“In the context of GPU programming, we often refer to the GPU as the
device and the CPU as the host ”
“Using GPUs to accelerate computation can provide large performance
gains”
“CuPy provides GPU-accelerated versions of many NumPy and SciPy
functions.”
“Always have CPU and GPU versions of your code so that you can
compare performance, as well as validate your code.”
“GPU execution is asynchronous; use
cupyx.profiler.benchmark() rather than timeit
for accurate measurements.”
“Numba can be used to run your own Python functions on the
GPU.”
“Functions may need to be changed to run correctly on a GPU.”
Your First GPU KernelSumming Two Vectors in Python Summing Two Vectors in CUDA Running Code on the GPU Understanding the CUDA Code Computing Hierarchy in CUDA Vectors of Arbitrary Size
“Precede your kernel definition with the __global__
keyword”
“Use built-in variables threadIdx,
blockIdx, gridDim and blockDim to
identify each thread”
“The global index of a thread is
(blockIdx.x * blockDim.x) + threadIdx.x”
“Add a bounds check (if (item < size)) in your
kernel when the total number of threads may exceed the input size”
“Registers are the fastest GPU memory; use them to store
intermediate values and avoid repeated reads from global memory”
“Global memory is the main memory space and it is used to share data
between host and GPU”
“Local memory is private to each thread and has similar latency to
global memory; the compiler uses it automatically when a thread runs out
of registers”
“Shared memory is faster than global memory and local memory”
“Shared memory can be used as a user-controlled cache to speedup
code”
“Size of shared memory arrays must be known at compile time if
allocated inside a thread”
“It is possible to declare extern shared memory arrays
and pass the size during kernel invocation”
“Use __shared__ to allocate memory in the shared memory
space”
“Use __syncthreads() to wait for shared memory
operations to be visible to all threads in a block”
“Globally scoped arrays, which size is known at compile time, can be
stored in constant memory using the __constant__
identifier”