GPU Programming: Key Points

IntroductionGraphics Processing Unit Parallel by DesignSpeed Benefits

“CPUs and GPUs are both useful and each has its own place in our toolbox”
“In the context of GPU programming, we often refer to the GPU as the device and the CPU as the host”
“Using GPUs to accelerate computation can provide large performance gains”
“Using the GPU with Python is not particularly difficult”

“CuPy provides GPU accelerated version of many NumPy and Scipy functions.”
“Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code.”

“Precede your kernel definition with the __global__ keyword”
“Use built-in variables threadIdx, blockIdx, gridDim and blockDim to identify each thread”

“Registers can be used to locally store data and avoid repeated memory operations”
“Global memory is the main memory space and it is used to share data between host and GPU”
“Local memory is a particular type of memory that can be used to store data that does not fit in registers and is private to a thread”

“Shared memory is faster than global memory and local memory”
“Shared memory can be used as a user-controlled cache to speedup code”
“Size of shared memory arrays must be known at compile time if allocated inside a thread”
“It is possible to declare extern shared memory arrays and pass the size during kernel invocation”
“Use __shared__ to allocate memory in the shared memory space”
“Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block”

“Globally scoped arrays, which size is known at compile time, can be stored in constant memory using the __constant__ identifier”