GPU Programming
Last updated on 2025-02-12 | Edit this page
Overview
Questions
- Can I have some Cuda please?
Objectives
- See what the current state of GPU computing over Julia is.
State of things
There are separate packages for GPU computing, depending on the hardware you have.
Brand | Lib |
---|---|
NVidia | CUDA.jl |
AMD | AMDGPU.jl |
Intel | oneAPI.jl |
Apple | Metal.jl |
Each of these offer similar abstractions for dealing with array based
computations, though each with slightly different names. CUDA by far has
the best and most mature support. We can get reasonably portable code by
using the above packages in conjunction with
KernelAbstractions
.
KernelAbstractions
Computations on the device
Run the vector-add on the device
Depending on your machine, try and run the above vector addition on
your GPU. Most PC (Windows or Linux) laptops have an on-board Intel GPU
that can be exploited using oneAPI
. If you have a Mac, give
it a shot with Metal
. Some of you may have brought a gaming
laptop with a real GPU, then CUDA
or AMDGPU
is
your choice.
Compare the run time of vector_add
to the threaded CPU
version and the array implementation. Make sure to run
KernelAbstractions.synchronize(backend)
when you time
results.
JULIA
c_dev = oneArray(zeros(Float32, 1024))
vector_add(dev, 512)(a_dev, b_dev, c_dev, ndrange=1024)
all(Array(c_dev) .== a .+ b)
function test_vector_add()
vector_add(dev, 512)(a_dev, b_dev, c_dev, ndrange=1024)
KernelAbstractions.synchronize(dev)
end
@benchmark test_vector_add()
@benchmark begin c_dev .= a_dev .+ b_dev; KernelAbstractions.synchronize(dev) end
@benchmark c .= a .+ b
Implement the Julia fractal
Use the GPU library that is appropriate for your laptop. Do you manage to get any speedup?
JULIA
function julia_host(c::ComplexF32, s::Float32, maxit::Int, out::AbstractMatrix{Int})
w, h = size(out)
Threads.@threads for idx in CartesianIndices(out)
x = (idx[1] - w ÷ 2) * s
y = (idx[2] - h ÷ 2) * s
z = x + 1f0im * y
out[idx] = maxit
for k = 1:maxit
z = z*z + c
if abs(z) > 2f0
out[idx] = k
break
end
end
end
end
c = -0.7269f0+0.1889f0im
out_host = Matrix{Int}(undef, 1920, 1080)
julia_host(c, 1f0/600, 1024, out_host)
Hint 1: There exists a @index(Global, Cartesian)
macro.
Hint 2: on Intel we needed a gang size that divides the width of the
image, in this case 480
seemed to work.
JULIA
@kernel function julia_dev(c::ComplexF32, s::Float32, maxit::Int, out)
w, h = size(out)
idx = @index(Global, Cartesian)
x = (idx[1] - w ÷ 2) * s
y = (idx[2] - h ÷ 2) * s
z = x + 1f0im * y
out[idx] = maxit
for k = 1:maxit
z = z*z + c
if abs(z) > 2f0
out[idx] = k
break
end
end
end
Key Points
- We can compile Julia code to GPU backends using
KernelAbstractions
. - Even on smaller laptops, significant speedups can be achieved, given the right problem.