Shared Memory and Synchronization

Last updated on 2024-03-12 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • “Is there a way to share data between threads of a same block?”
  • “Can threads inside a block wait for other threads?”

Objectives

  • “Learn how to share data between threads”
  • “Learn how to synchronize threads”

So far we looked at how to use CUDA to accelerate the computation, but a common pattern in all the examples we encountered so far is that threads worked in isolation. While having different threads perform the same operation on different data is a good pattern for working with GPUs, there are cases in which threads need to communicate. This communication may be necessary because of the way the algorithm we are trying to implement works, or it may derive from a performance goal we are trying to achieve.

Shared Memory

Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.

To allocate an array in shared memory we need to preface the definition with the identifier __shared__.

Challenge: use of shared memory

Modify the following code to allocate the temp array in shared memory.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  float temp[3];

  if ( item < size )
  {
      temp[0] = A[item];
      temp[1] = B[item];
      temp[2] = temp[0] + temp[1];
      C[item] = temp[2];
  }
}

To use shared memory for the temp array add the identifier __shared__ to its definition, like in the following code.

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  __shared__ float temp[3];
>
  if ( item < size )
  {
      temp[0] = A[item];
      temp[1] = B[item];
      temp[2] = temp[0] + temp[1];
      C[item] = temp[2];
  }
}

While syntactically correct, the previous example is functionally wrong. The reason is that the temp array is not anymore private to the thread allocating it, but it is now shared by the whole thread block.

Challenge: what is the result of the previous code block?

The previous code example is functionally wrong. Can you guess what the result of its execution will be?

The result is non deterministic, and definitely not the same as the previous versions of vector_add. Threads will overwrite each other temporary values,and there will be no guarantee on which value is visible by each thread.

To fix the previous kernel we should allocate enough shared memory for each thread to store three values, so that each thread has its own section of the shared memory array to work with.

To allocate enough memory we need to replace the constant 3 in __shared__ float temp[3] with something else. If we know that each thread block has 1024 threads, we can write something like the following:

C

__shared__ float temp[3 * 1024];

But we know by experience that having constants in the code is not a scalable and maintainable solution. The problem is that we need to have a constant value if we want to declare a shared memory array, because the compiler needs to know how much memory to allocate.

A solution to this problem is to not specify the size of the array, and allocate the memory somewhere else.

C

extern __shared__ float temp[];

And then use CuPy to instruct CUDA about how much shared memory, in bytes, each thread block needs. This can be done by adding the named parameter shared_mem to the kernel call.

PYTHON

vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size), shared_mem=((size // 2) * 3 * cupy.dtype(cupy.float32).itemsize))

As you may have noticed, we had to retrieve the size in bytes of the data type cupy.float32, and this is done with cupy.dtype(cupy.float32).itemsize.

After these changes, the body of the kernel needs to be modified to use the right indices:

C

extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  int offset = threadIdx.x * 3;
  extern __shared__ float temp[];

  if ( item < size )
  {
      temp[offset + 0] = A[item];
      temp[offset + 1] = B[item];
      temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
      C[item] = temp[offset + 2];
  }
}

And for completeness, we present the full Python code.

PYTHON

import math
import numpy as np
import cupy

# vector size
size = 2048

# GPU memory allocation
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
gpu_args = (a_gpu, b_gpu, c_gpu, size)

# CPU memory allocation
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)

# CUDA code
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
  int item = (blockIdx.x * blockDim.x) + threadIdx.x;
  int offset = threadIdx.x * 3;
  extern __shared__ float temp[];

  if ( item < size )
  {
      temp[offset + 0] = A[item];
      temp[offset + 1] = B[item];
      temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
      C[item] = temp[offset + 2];
  }
}
'''

# compile and execute code
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")
threads_per_block = 32
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
vector_add_gpu(grid_size, block_size, gpu_args, shared_mem=(threads_per_block * 3 * cupy.dtype(cupy.float32).itemsize))

# execute Python code and compare results
vector_add(a_cpu, b_cpu, c_cpu, size)
np.allclose(c_cpu, c_gpu)

The code is now correct, although it is still not very useful. We are definitely using shared memory, and we are using it the correct way, but there is no performance gain we achieved by doing so. Actually, we are making our code slower, not faster, because shared memory is slower than registers.

Let us, therefore, work on an example where using shared memory is actually useful. We start again with some Python code.

PYTHON

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1

The histogram function, as the name suggests, computes the histogram of an array of integers, i.e. counts how many instances of each integer are in input_array, and writes the count in output_array. We can now generate some data and run the code.

PYTHON

input_array = np.random.randint(256, size=2048, dtype=np.int32)
output_array = np.zeros(256, dtype=np.int32)
histogram(input_array, output_array)

Everything as expected. We can now write the equivalent code in CUDA.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    output[input[item]] = output[input[item]] + 1;
}

Challenge: error in the histogram

If you look at the CUDA histogram code, there is a logical error that prevents it to produce the correct results. Can you find it?

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    output[input[item]] = output[input[item]] + 1;
}

The GPU is a highly parallel device, executing multiple threads at the same time. In the previous code different threads could be updating the same output item at the same time, producing wrong results.

To solve this problem, we need to use a function from the CUDA library named atomicAdd. This function ensures that the increment of output_array happens in an atomic way, so that there are no conflicts in case multiple threads want to update the same item at the same time.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    atomicAdd(&(output[input[item]]), 1);
}

And the full Python code snippet.

PYTHON

import math
import numpy as np
import cupy
from cupyx.profiler import benchmark

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1

# input size
size = 2**25

# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)

# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;

    atomicAdd(&(output[input[item]]), 1);
}
'''

# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

The CUDA code is now correct, and computes the same result as the Python code; it is also faster than the Python code, as you can see from measuring the execution time. However, we are accumulating the results directly in global memory, and the more conflicts we have in global memory, the lower the performance of our histogram will be. Moreover, the access pattern to the output array is very irregular, being dependent on the content of the input array. GPUs are designed for very regular computations, and so if we can make the histogram more regular we can hope in a further improvement in performance.

As you may expect, we can improve the memory access pattern by using shared memory.

Challenge: use shared memory to speed up the histogram

Implement a new version of the CUDA histogram function that uses shared memory to reduce conflicts in global memory. Modify the following code and follow the suggestions in the comments.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    // Declare temporary histogram in shared memory
    int temp_histogram[256];

    // Update the temporary histogram in shared memory
    atomicAdd();
    // Update the global histogram in global memory, using the temporary histogram
    atomicAdd();
}

Hint: for this exercise, you can safely assume that the size of output is the same as the number of threads in a block.

Hint: atomicAdd can be used on both global and shared memory.

The following code shows one of the possible solutions.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];

    atomicAdd(&(temp_histogram[input[item]]), 1);
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}

The idea behind this solution is to reduce the expensive conflicts in global memory by having a temporary histogram in shared memory. After a block has finished processing its fraction of the input array, and the local histogram is populated, threads collaborate to update the global histogram. Not only this solution potentially reduces the conflicts in global memory, it also produces a better access pattern because threads read adjacent items of the input array, and write to adjacent elements of the output array during the second call to atomicAdd.

Thread Synchronization

There is still one potentially big issue in the histogram code we just wrote, and the issue is that shared memory is not coherent without explicit synchronization. The problem lies in the following two lines of code:

C

atomicAdd(&(temp_histogram[input[item]]), 1);
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);

In the first line each thread updates one arbitrary position in shared memory, depending on the value of the input, while in the second line each thread reads the element in shared memory corresponding to its thread ID. However, the changes to shared memory are not automatically available to all other threads, and therefore the final result may not be correct.

To solve this issue, we need to explicitly synchronize all threads in a block, so that memory operations are also finalized and visible to all. To synchronize threads in a block, we use the __syncthreads() CUDA function. Moreover, shared memory is not initialized, and the programmer needs to take care of that too. So we need to first initialize temp_histogram, wait that all threads are done doing this, perform the computation in shared memory, wait again that all threads are done, and only then update the global array.

C

extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadIdx.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}

And the full Python code snippet.

PYTHON

import math
import numpy as np
import cupy
from cupyx.profiler import benchmark

def histogram(input_array, output_array):
    for item in input_array:
        output_array[item] = output_array[item] + 1

# input size
size = 2**25

# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)

# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
    int item = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int temp_histogram[256];
 
    // Initialize shared memory and synchronize
    temp_histogram[threadIdx.x] = 0;
    __syncthreads();

    // Compute shared memory histogram and synchronize
    atomicAdd(&(temp_histogram[input[item]]), 1);
    __syncthreads();

    // Update global histogram
    atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
'''

# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)

# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
    print("Correct results!")
else:
    print("Wrong results!")

# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")

While both versions of the GPU histogram are correct, the one using shared memory is faster; but how fast? On a NVIDIA Tesla T4 accessed via Google Colab, the shared memory version is ten times faster than the version doing atomic operations on global memory.

Key Points

  • “Shared memory is faster than global memory and local memory”
  • “Shared memory can be used as a user-controlled cache to speedup code”
  • “Size of shared memory arrays must be known at compile time if allocated inside a thread”
  • “It is possible to declare extern shared memory arrays and pass the size during kernel invocation”
  • “Use __shared__ to allocate memory in the shared memory space”
  • “Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block”