Shared Memory and Synchronization
Last updated on 2024-11-19 | Edit this page
Estimated time: 55 minutes
Overview
Questions
- “Is there a way to share data between threads of a same block?”
- “Can threads inside a block wait for other threads?”
Objectives
- “Learn how to share data between threads”
- “Learn how to synchronize threads”
So far we looked at how to use CUDA to accelerate the computation, but a common pattern in all the examples we encountered so far is that threads worked in isolation. While having different threads perform the same operation on different data is a good pattern for working with GPUs, there are cases in which threads need to communicate. This communication may be necessary because of the way the algorithm we are trying to implement works, or it may derive from a performance goal we are trying to achieve.
Shared Memory
Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.
To allocate an array in shared memory we need to preface the
definition with the identifier __shared__
.
To use shared memory for the temp
array add the
identifier __shared__
to its definition, like in the
following code.
While syntactically correct, the previous example is functionally
wrong. The reason is that the temp
array is not anymore
private to the thread allocating it, but it is now shared by the whole
thread block.
Challenge: what is the result of the previous code block?
The previous code example is functionally wrong. Can you guess what the result of its execution will be?
The result is non deterministic, and definitely not the same as the
previous versions of vector_add
. Threads will overwrite
each other temporary values,and there will be no guarantee on which
value is visible by each thread.
To fix the previous kernel we should allocate enough shared memory for each thread to store three values, so that each thread has its own section of the shared memory array to work with.
To allocate enough memory we need to replace the constant 3 in
__shared__ float temp[3]
with something else. If we know
that each thread block has 1024 threads, we can write something like the
following:
But we know by experience that having constants in the code is not a scalable and maintainable solution. The problem is that we need to have a constant value if we want to declare a shared memory array, because the compiler needs to know how much memory to allocate.
A solution to this problem is to not specify the size of the array, and allocate the memory somewhere else.
And then use CuPy to instruct CUDA about how much shared memory, in
bytes, each thread block needs. This can be done by adding the named
parameter shared_mem
to the kernel call.
PYTHON
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size), shared_mem=((size // 2) * 3 * cupy.dtype(cupy.float32).itemsize))
As you may have noticed, we had to retrieve the size in bytes of the
data type cupy.float32
, and this is done with
cupy.dtype(cupy.float32).itemsize
.
After these changes, the body of the kernel needs to be modified to use the right indices:
C
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
int offset = threadIdx.x * 3;
extern __shared__ float temp[];
if ( item < size )
{
temp[offset + 0] = A[item];
temp[offset + 1] = B[item];
temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
C[item] = temp[offset + 2];
}
}
And for completeness, we present the full Python code.
PYTHON
import math
import numpy as np
import cupy
# vector size
size = 2048
# GPU memory allocation
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
gpu_args = (a_gpu, b_gpu, c_gpu, size)
# CPU memory allocation
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)
# CUDA code
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
int offset = threadIdx.x * 3;
extern __shared__ float temp[];
if ( item < size )
{
temp[offset + 0] = A[item];
temp[offset + 1] = B[item];
temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
C[item] = temp[offset + 2];
}
}
'''
# compile and execute code
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")
threads_per_block = 32
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
vector_add_gpu(grid_size, block_size, gpu_args, shared_mem=(threads_per_block * 3 * cupy.dtype(cupy.float32).itemsize))
# execute Python code and compare results
vector_add(a_cpu, b_cpu, c_cpu, size)
np.allclose(c_cpu, c_gpu)
The code is now correct, although it is still not very useful. We are definitely using shared memory, and we are using it the correct way, but there is no performance gain we achieved by doing so. Actually, we are making our code slower, not faster, because shared memory is slower than registers.
Let us, therefore, work on an example where using shared memory is actually useful. We start again with some Python code.
PYTHON
def histogram(input_array, output_array):
for item in input_array:
output_array[item] = output_array[item] + 1
The histogram
function, as the name suggests, computes
the histogram of an array of integers, i.e. counts how many instances of
each integer are in input_array
, and writes the count in
output_array
. We can now generate some data and run the
code.
PYTHON
input_array = np.random.randint(256, size=2048, dtype=np.int32)
output_array = np.zeros(256, dtype=np.int32)
histogram(input_array, output_array)
Everything as expected. We can now write the equivalent code in CUDA.
C
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
output[input[item]] = output[input[item]] + 1;
}
Challenge: error in the histogram
If you look at the CUDA histogram
code, there is a
logical error that prevents it to produce the correct results. Can you
find it?
The GPU is a highly parallel device, executing multiple threads at the same time. In the previous code different threads could be updating the same output item at the same time, producing wrong results.
To solve this problem, we need to use a function from the CUDA
library named atomicAdd
. This function ensures that the
increment of output_array
happens in an atomic way, so that
there are no conflicts in case multiple threads want to update the same
item at the same time.
C
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
atomicAdd(&(output[input[item]]), 1);
}
And the full Python code snippet.
PYTHON
import math
import numpy as np
import cupy
from cupyx.profiler import benchmark
def histogram(input_array, output_array):
for item in input_array:
output_array[item] = output_array[item] + 1
# input size
size = 2**25
# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)
# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
atomicAdd(&(output[input[item]]), 1);
}
'''
# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
print("Correct results!")
else:
print("Wrong results!")
# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")
The CUDA code is now correct, and computes the same result as the
Python code; it is also faster than the Python code, as you can see from
measuring the execution time. However, we are accumulating the results
directly in global memory, and the more conflicts we have in global
memory, the lower the performance of our histogram
will be.
Moreover, the access pattern to the output array is very irregular,
being dependent on the content of the input array. GPUs are designed for
very regular computations, and so if we can make the histogram more
regular we can hope in a further improvement in performance.
As you may expect, we can improve the memory access pattern by using shared memory.
The following code shows one of the possible solutions.
C
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
__shared__ int temp_histogram[256];
atomicAdd(&(temp_histogram[input[item]]), 1);
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
The idea behind this solution is to reduce the expensive conflicts in
global memory by having a temporary histogram in shared memory. After a
block has finished processing its fraction of the input array, and the
local histogram is populated, threads collaborate to update the global
histogram. Not only this solution potentially reduces the conflicts in
global memory, it also produces a better access pattern because threads
read adjacent items of the input
array, and write to
adjacent elements of the output
array during the second
call to atomicAdd
.
Thread Synchronization
There is still one potentially big issue in the
histogram
code we just wrote, and the issue is that shared
memory is not coherent without explicit synchronization. The problem
lies in the following two lines of code:
C
atomicAdd(&(temp_histogram[input[item]]), 1);
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
In the first line each thread updates one arbitrary position in shared memory, depending on the value of the input, while in the second line each thread reads the element in shared memory corresponding to its thread ID. However, the changes to shared memory are not automatically available to all other threads, and therefore the final result may not be correct.
To solve this issue, we need to explicitly synchronize all threads in
a block, so that memory operations are also finalized and visible to
all. To synchronize threads in a block, we use the
__syncthreads()
CUDA function. Moreover, shared memory is
not initialized, and the programmer needs to take care of that too. So
we need to first initialize temp_histogram
, wait that all
threads are done doing this, perform the computation in shared memory,
wait again that all threads are done, and only then update the global
array.
C
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
__shared__ int temp_histogram[256];
// Initialize shared memory and synchronize
temp_histogram[threadIdx.x] = 0;
__syncthreads();
// Compute shared memory histogram and synchronize
atomicAdd(&(temp_histogram[input[item]]), 1);
__syncthreads();
// Update global histogram
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
And the full Python code snippet.
PYTHON
import math
import numpy as np
import cupy
from cupyx.profiler import benchmark
def histogram(input_array, output_array):
for item in input_array:
output_array[item] = output_array[item] + 1
# input size
size = 2**25
# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)
# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
__shared__ int temp_histogram[256];
// Initialize shared memory and synchronize
temp_histogram[threadIdx.x] = 0;
__syncthreads();
// Compute shared memory histogram and synchronize
atomicAdd(&(temp_histogram[input[item]]), 1);
__syncthreads();
// Update global histogram
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
'''
# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
print("Correct results!")
else:
print("Wrong results!")
# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")
While both versions of the GPU histogram are correct, the one using shared memory is faster; but how fast? On a NVIDIA Tesla T4 accessed via Google Colab, the shared memory version is ten times faster than the version doing atomic operations on global memory.
Key Points
- “Shared memory is faster than global memory and local memory”
- “Shared memory can be used as a user-controlled cache to speedup code”
- “Size of shared memory arrays must be known at compile time if allocated inside a thread”
- “It is possible to declare
extern
shared memory arrays and pass the size during kernel invocation” - “Use
__shared__
to allocate memory in the shared memory space” - “Use
__syncthreads()
to wait for shared memory operations to be visible to all threads in a block”