Key Points

IntroductionGraphics Processing UnitParallel by DesignSpeed Benefits


  • “GPUs achieve high throughput by running thousands of threads in parallel, unlike CPUs which prioritise low latency”
  • “In the context of GPU programming, we often refer to the GPU as the device and the CPU as the host
  • “Using GPUs to accelerate computation can provide large performance gains”

Using your GPU with CuPyIntroduction to CuPyConvolutions in PythonA scientific application: image processing for radio astronomy


  • “CuPy provides GPU-accelerated versions of many NumPy and SciPy functions.”
  • “Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code.”
  • “GPU execution is asynchronous; use cupyx.profiler.benchmark() rather than timeit for accurate measurements.”

Accelerate your Python code with NumbaUsing Numba to execute Python code on the GPU


  • “Numba can be used to run your own Python functions on the GPU.”
  • “Functions may need to be changed to run correctly on a GPU.”

A Better Look at the GPUThe GPU, a High Level View at the HardwareHow Programs are ExecutedDifferent MemoriesAdditional Material


Your First GPU KernelSumming Two Vectors in PythonSumming Two Vectors in CUDARunning Code on the GPUUnderstanding the CUDA CodeComputing Hierarchy in CUDAVectors of Arbitrary Size


  • “Precede your kernel definition with the __global__ keyword”
  • “Use built-in variables threadIdx, blockIdx, gridDim and blockDim to identify each thread”
  • “The global index of a thread is (blockIdx.x * blockDim.x) + threadIdx.x
  • “Add a bounds check (if (item < size)) in your kernel when the total number of threads may exceed the input size”

Registers, Global, and Local MemoryRegistersGlobal MemoryLocal Memory


  • “Registers are the fastest GPU memory; use them to store intermediate values and avoid repeated reads from global memory”
  • “Global memory is the main memory space and it is used to share data between host and GPU”
  • “Local memory is private to each thread and has similar latency to global memory; the compiler uses it automatically when a thread runs out of registers”

Shared Memory and SynchronizationShared MemoryThread Synchronization


  • “Shared memory is faster than global memory and local memory”
  • “Shared memory can be used as a user-controlled cache to speedup code”
  • “Size of shared memory arrays must be known at compile time if allocated inside a thread”
  • “It is possible to declare extern shared memory arrays and pass the size during kernel invocation”
  • “Use __shared__ to allocate memory in the shared memory space”
  • “Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block”

Constant MemoryConstant Memory


  • “Globally scoped arrays, which size is known at compile time, can be stored in constant memory using the __constant__ identifier”

Concurrent access to the GPUConcurrently execute two kernels on the same GPUStream synchronizationMeasure execution time using streams and events


  • “”