Key Points

IntroductionGraphics Processing UnitParallel by DesignSpeed Benefits


  • “GPUs achieve high throughput by running thousands of threads in parallel, unlike CPUs which prioritise low latency”
  • “In the context of GPU programming, we often refer to the GPU as the device and the CPU as the host
  • “Using GPUs to accelerate computation can provide large performance gains”

Using your GPU with CuPyIntroduction to CuPyConvolutions in PythonA scientific application: image processing for radio astronomy


  • “CuPy provides GPU-accelerated versions of many NumPy and SciPy functions.”
  • “Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code.”
  • “GPU execution is asynchronous; use cupyx.profiler.benchmark() rather than timeit for accurate measurements.”

Using PyTorch for GPU ComputingIntroduction to PyTorch


  • “PyTorch tensors can be created and manipulated directly on the GPU”
  • “Use .to(device) to move data between CPU and GPU, and .cpu().numpy() to convert back to NumPy”
  • “The @torch.compile decorator can significantly speed up tensor operations by fusing them into optimized GPU kernels”

Accelerate your Python code with NumbaUsing Numba to execute Python code on the GPU


  • “Numba can be used to run your own Python functions on the GPU.”
  • “Functions may need to be changed to run correctly on a GPU.”

A Better Look at the GPUThe GPU, a High Level View at the HardwareHow Programs are ExecutedDifferent MemoriesAdditional Material


Your First GPU KernelSumming Two Vectors in PythonSumming Two Vectors in CUDARunning Code on the GPUUnderstanding the CUDA CodeComputing Hierarchy in CUDAVectors of Arbitrary Size


  • “Precede your kernel definition with the __global__ keyword”
  • “Use built-in variables threadIdx, blockIdx, gridDim and blockDim to identify each thread”
  • “The global index of a thread is (blockIdx.x * blockDim.x) + threadIdx.x
  • “Add a bounds check (if (item < size)) in your kernel when the total number of threads may exceed the input size”

Registers, Global, and Local MemoryRegistersGlobal MemoryLocal Memory


  • “Registers are the fastest GPU memory; use them to store intermediate values and avoid repeated reads from global memory”
  • “Global memory is the main memory space and it is used to share data between host and GPU”
  • “Local memory is private to each thread and has similar latency to global memory; the compiler uses it automatically when a thread runs out of registers”

Shared Memory and SynchronizationShared MemoryThread Synchronization


  • “Shared memory is faster than global memory and local memory”
  • “Shared memory can be used as a user-controlled cache to speed up code”
  • “Size of shared memory arrays must be known at compile time, unless declared as extern
  • “Use __shared__ to allocate shared memory; use extern __shared__ with shmem_size for dynamically-sized arrays”
  • “Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block”

Constant MemoryConstant Memory


  • “Globally scoped arrays, whose size is known at compile time, can be stored in constant memory using the __constant__ identifier”

Concurrent access to the GPUConcurrently execute two kernels on the same GPUStream synchronizationMeasure execution time using streams and events


  • “Use CUDA streams to run kernels concurrently on the same GPU”
  • “Use events to synchronize between streams at a fine-grained level”
  • “Use events with timing enabled to measure kernel execution time”