Threads And Processes
Last updated on 2025-03-31 | Edit this page
Estimated time: 90 minutes
Overview
Questions
- What is the Global Interpreter Lock (GIL)?
- How do I use multiple threads in Python?
Objectives
- Understand the GIL.
- Understand the difference between the
threading
andmultiprocessing
libraries in Python.
Threading
Another possibility of parallelizing code is to use the
threading
module. This module is built into Python. We will
use it to estimate \(\pi\) once again
in this section.
An example of using threading to speed up your code is:
PYTHON
%%time
n = 10**7
t1 = Thread(target=calc_pi, args=(n,))
t2 = Thread(target=calc_pi, args=(n,))
t1.start()
t2.start()
t1.join()
t2.join()
Discussion: where’s the speed-up?
While mileage may vary, parallelizing calc_pi
,
calc_pi_numpy
and calc_pi_numba
in this way
will not give the theoretical speed-up. calc_pi_numba
should give some speed-up, but nowhere near the ideal scaling
for the number of cores. This is because, at any given time, Python only
allows a single thread to access the interpreter, a feature also known
as the Global Interpreter Lock.
A few words about the Global Interpreter Lock
The Global Interpreter Lock (GIL) is an infamous feature of the Python interpreter. It both guarantees inner thread sanity, making programming in Python safer, and prevents us from using multiple cores from a single Python instance. This becomes an obvious problem when we want to perform parallel computations. Roughly speaking, there are two classes of solutions to circumvent/lift the GIL:
- Run multiple Python instances using
multiprocessing
. - Keep important code outside Python using OS operations, C++ extensions, Cython, Numba.
The downside of running multiple Python instances is that we need to
share the program state between different processes. To this end, you
need to serialize objects. Serialization entails converting a Python
object into a stream of bytes that can then be sent to the other process
or, for example, stored to disk. This is typically done using
pickle
, json
, or similar, and creates a large
overhead. The alternative is to bring parts of our code outside Python.
Numpy has many routines that are largely situated outside of the GIL.
Trying out and profiling your application is the only way to know for
sure.
There are several options to make your own routines not subjected to
the GIL: fortunately, numba
makes this very easy.
We can unlock the GIL in Numba code by setting
nogil=True
inside the numba.jit
decorator:
PYTHON
import random
@numba.jit(nopython=True, nogil=True)
def calc_pi_nogil(N):
M = 0
for i in range(N):
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
if x**2 + y**2 < 1:
M += 1
return 4 * M / N
The nopython
argument forces Numba to compile the code
without referencing any Python objects, while the nogil
argument disables the GIL during the execution of the function.
Use nopython=True
or
@numba.njit
It is generally good practice to use nopython=True
with
@numba.jit
to make sure the entire function is running
without referencing Python objects, because that will dramatically slow
down most Numba code. The decorator @numba.njit
even has
nopython=True
by default.
Now we can run the benchmark again, using calc_pi_nogil
instead of calc_pi
.
Exercise: try threading on a Numpy function
Many Numpy functions unlock the GIL. Try and sort two randomly
generated arrays using numpy.sort
in parallel.
Multiprocessing
Python also enables parallelisation with multiple processes via the
multiprocessing
module. It implements an API that is
superficially similar to threading:
PYTHON
import random
from multiprocessing import Process
# function in plain Python
def calc_pi(N):
M = 0
for i in range(N):
# Simulate impact coordinates
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
M += 1
return (4 * M / N, N) # result, iterations
if __name__ == '__main__':
n = 10**7
p1 = Process(target=calc_pi, args=(n,))
p2 = Process(target=calc_pi, args=(n,))
p1.start()
p2.start()
p1.join()
p2.join()
However, under the hood, processes are very different from threads. A new process is created by generating a fresh “copy” of the Python interpreter that includes all the resources associated to the parent. There are three different ways of doing this (spawn, fork, and forkserver), whose availability depends on the platform. We will use spawn as it is available on all platforms. You can read more about the others in the Python documentation. Since creating a process is resource-intensive, multiprocessing is beneficial under limited circumstances — namely, when the resource utilisation (or runtime) of a function is measureably larger than the overhead of creating a new process.
The non-intrusive and safe way of starting a new process is to
acquire a context
and work within that context. This
ensures that your application does not interfere with any other
processes that might be in use:
PYTHON
import multiprocessing as mp
def calc_pi(N):
...
if __name__ == '__main__':
# mp.set_start_method("spawn") # if not using a context
ctx = mp.get_context("spawn")
...
Passing objects and sharing state
We can pass objects between processes by using Queue
s
and Pipe
s. Multiprocessing queues behave similarly to
regular queues: - FIFO: first in, first out. -
queue_instance.put(<obj>)
to add. -
queue_instance.get()
to retrieve.
Exercise: reimplement calc_pi
to
use a queue to return the result
PYTHON
import multiprocessing as mp
import random
def calc_pi(N, que):
M = 0
for i in range(N):
# Simulate impact coordinates
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
M += 1
que.put((4 * M / N, N)) # result, iterations
if __name__ == "__main__":
ctx = mp.get_context("spawn")
que = ctx.Queue()
n = 10**7
p1 = ctx.Process(target=calc_pi, args=(n, que))
p2 = ctx.Process(target=calc_pi, args=(n, que))
p1.start()
p2.start()
for i in range(2):
print(que.get())
p1.join()
p2.join()
Process pool
The Pool
API provides a pool of worker processes that
can execute tasks. Methods of the Pool
object offer various
convenient ways to implement data parallelism in your program. The most
convenient way to create a pool object is with a context manager, either
using the toplevel function multiprocessing.Pool
, or by
calling the .Pool()
method on the context. With the
Pool
object, tasks can be submitted by calling methods like
.apply()
, .map()
, .starmap()
, or
their .*_async()
versions.
Exercise: adapt the original exercise to submit tasks to a pool
- Use the original
calc_pi
function (without the queue). - Submit batches of different sample size (different values of
N
). - As mentioned earlier, creating a new process entails overheads. Try a wide range of sample sizes and check if the runtime scales in keeping with that claim.
PYTHON
from itertools import repeat
import multiprocessing as mp
import random
from timeit import timeit
def calc_pi(N):
M = 0
for i in range(N):
# Simulate impact coordinates
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
M += 1
return (4 * M / N, N) # result, iterations
def submit(ctx, N):
with ctx.Pool() as pool:
pool.starmap(calc_pi, repeat((N,), 4))
if __name__ == "__main__":
ctx = mp.get_context("spawn")
for i in (1_000, 100_000, 10_000_000):
res = timeit(lambda: submit(ctx, i), number=5)
print(i, res)
Key Points
- If we want the most efficient parallelism on a single machine, we need to work around the GIL.
- If your code disables the GIL, threading will be more efficient than multiprocessing.
- If your code keeps the GIL, some of your code is still in Python and you are wasting precious compute time!