IntroductionGraphics Processing UnitParallel by DesignSpeed Benefits


  • “CPUs and GPUs are both useful and each has its own place in our toolbox”
  • “In the context of GPU programming, we often refer to the GPU as the device and the CPU as the host
  • “Using GPUs to accelerate computation can provide large performance gains”
  • “Using the GPU with Python is not particularly difficult”

Using your GPU with CuPyIntroduction to CuPyConvolution in PythonConvolution on the CPU Using SciPyConvolution on the GPU Using CuPyMeasuring performanceValidationA shortcut: performing NumPy routines on the GPUA real world example: image processing for radio astronomySource measurements


  • “CuPy provides GPU accelerated version of many NumPy and Scipy functions.”
  • “Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code.”

Accelerate your Python code with NumbaUsing Numba to execute Python code on the GPU


  • “Numba can be used to run your own Python functions on the GPU.”
  • “Functions may need to be changed to run correctly on a GPU.”

A Better Look at the GPUThe GPU, a High Level View at the HardwareHow Programs are ExecutedDifferent MemoriesAdditional Material


Your First GPU KernelSumming Two Vectors in PythonSumming Two Vectors in CUDARunning Code on the GPU with CuPyUnderstanding the CUDA CodeComputing Hierarchy in CUDAVectors of Arbitrary Size


  • “Precede your kernel definition with the __global__ keyword”
  • “Use built-in variables threadIdx, blockIdx, gridDim and blockDim to identify each thread”

Registers, Global, and Local MemoryRegistersGlobal MemoryLocal Memory


  • “Registers can be used to locally store data and avoid repeated memory operations”
  • “Global memory is the main memory space and it is used to share data between host and GPU”
  • “Local memory is a particular type of memory that can be used to store data that does not fit in registers and is private to a thread”

Shared Memory and SynchronizationShared MemoryThread Synchronization


  • “Shared memory is faster than global memory and local memory”
  • “Shared memory can be used as a user-controlled cache to speedup code”
  • “Size of shared memory arrays must be known at compile time if allocated inside a thread”
  • “It is possible to declare extern shared memory arrays and pass the size during kernel invocation”
  • “Use __shared__ to allocate memory in the shared memory space”
  • “Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block”

Constant MemoryConstant Memory


  • “Globally scoped arrays, which size is known at compile time, can be stored in constant memory using the __constant__ identifier”

Concurrent access to the GPUConcurrently execute two kernels on the same GPUStream synchronizationMeasure execution time using streams and events


  • “”