GPU Programming
Last updated on 2026-04-15 | Edit this page
Overview
Questions
- Can I have some Cuda please?
Objectives
- See what the current state of GPU computing over Julia is.
State of things
There are separate packages for GPU computing, depending on the hardware you have.
| Brand | Lib |
|---|---|
| NVidia | CUDA.jl |
| AMD | AMDGPU.jl |
| Intel | oneAPI.jl |
| Apple | Metal.jl |
Each of these offer similar abstractions for dealing with array based
computations, though each with slightly different names. CUDA by far has
the best and most mature support. We can get reasonably portable code by
using the above packages in conjunction with
KernelAbstractions.
Load the correct packages for your GPU as follows:
Simple vector addition
Let us first consider some simple code to add two vectors (of length 1024) together:
And you can run this with some randomly initialised vectors:
JULIA
a = randn(Float32, 1024)
b = randn(Float32, 1024)
c = Vector{Float32}(undef, 1024)
simple_vector_add(a, b, c)
c # Print out c in the console
What part of the above simple_vector_add() function do
you think is the “kernel”? Do you already have some idea how we might
modify that for use on a GPU?
A kernel for vector addition
We can adapt simple_vector_add() to make a new function
where we have removed the for-loop and kept only the core operation:
We make use of KernelAbstractions macros to define this
as a kernel (more on this later) and to allow the function to ask what
I it should act on. We no longer determine how or in what
order this kernel will be applied - that will be determined by the
relevant backend for your device.
Running on the host (CPU)
Before trying it on GPU, we can first test out our new kernel
function on the normal, plain CPU. We do this by choosing the
CPU() device:
Moving our local arrays onto the device (GPU)
We would like to try running our vector addition kernel on the GPU.
However, we run into a problem - the GPU has its own memory, separate
from the RAM on the host. Our vectors, a, b
and c, currently reside in the host’s memory, so the GPU
has no access to them and cannot perform any computations.
Therefore, we must first create (on-GPU) device arrays. As before, this depends on the GPU you have:
Note that a_dev and b_dev refer to arrays
on the GPU device itself. This means that the addition
(a_dev .+ b_dev) was actually performed on the GPU, and the
result copied back to host for us to see.
Computations on the device
Earlier we used dev = CPU() to set the location on which
to run our kernel. We may now do the following to set that to be the
GPU:
Now we should be able to run vector_add() on the
GPU.
Run the vector-add on the device
Depending on your machine, try and run the above vector addition on
your GPU. Most PC (Windows or Linux) laptops have an on-board Intel GPU
that can be exploited using oneAPI. If you have a Mac, give
it a shot with Metal. Some of you may have brought a gaming
laptop with a real GPU, then CUDA or AMDGPU is
your choice.
Compare the run time of vector_add to the threaded CPU
version and the array implementation. Make sure to run
KernelAbstractions.synchronize(backend) when you time
results.
JULIA
c_dev = oneArray(zeros(Float32, 1024))
vector_add(dev, 512)(a_dev, b_dev, c_dev, ndrange=1024)
all(Array(c_dev) .== a .+ b)
function test_vector_add()
vector_add(dev, 512)(a_dev, b_dev, c_dev, ndrange=1024)
KernelAbstractions.synchronize(dev)
end
@benchmark test_vector_add()
@benchmark begin c_dev .= a_dev .+ b_dev; KernelAbstractions.synchronize(dev) end
@benchmark c .= a .+ b
Using Adapt.jl
In practice, we will often group variables together in structs (for convenience) and may wish to pass these to our kernel function. However, the struct, and its contents, must be fully converted to a GPU form (as we did with the arrays earlier).
Fortunately, we can do this easily using Adapt.jl:
So, let us say we change our earlier VectorAdd kernel to
one that takes the vectors in a struct:
JULIA
@kernel function vector_add_struct(s)
I = @index(Global)
s.c[I] = s.a[I] + s.b[I]
end
struct VectorAddStruct{T}
a::T
b::T
c::T
end
We can then make a new VectorAddStruct:
But these types are all on the host and therefore the struct cannot be operated on by the GPU.
Using Adapt’s @adapt_structure macro, we can
automatically create a conversion function for recrusively converting
this VectorAddStruct to something using GPU types.
Let’s quickly look at the function that is being generated for us:
So now, using this method, we can adapt our struct for use on (in this case) an Intel GPU:
And run our struct kernel:
Generating the Julia Fractal on GPU
Implement the Julia fractal
Use the GPU library that is appropriate for your laptop. Do you manage to get any speedup?
JULIA
function julia_host(c::ComplexF32, s::Float32, maxit::Int, out::AbstractMatrix{Int})
w, h = size(out)
Threads.@threads for idx in CartesianIndices(out)
x = (idx[1] - w ÷ 2) * s
y = (idx[2] - h ÷ 2) * s
z = x + 1f0im * y
out[idx] = maxit
for k = 1:maxit
z = z*z + c
if abs(z) > 2f0
out[idx] = k
break
end
end
end
end
c = -0.7269f0+0.1889f0im
out_host = Matrix{Int}(undef, 1920, 1080)
julia_host(c, 1f0/600, 1024, out_host)
Hint 1: There exists a @index(Global, Cartesian)
macro.
Hint 2: on Intel we needed a gang size that divides the width of the
image, in this case 480 seemed to work.
JULIA
@kernel function julia_dev(c::ComplexF32, s::Float32, maxit::Int, out)
w, h = size(out)
idx = @index(Global, Cartesian)
x = (idx[1] - w ÷ 2) * s
y = (idx[2] - h ÷ 2) * s
z = x + 1f0im * y
out[idx] = maxit
for k = 1:maxit
z = z*z + c
if abs(z) > 2f0
out[idx] = k
break
end
end
end
- We can compile Julia code to GPU backends using
KernelAbstractions. - Even on smaller laptops, significant speedups can be achieved, given the right problem.