Threads And Processes

Last updated on 2025-03-31 | Edit this page

Estimated time: 90 minutes

Overview

Questions

What is the Global Interpreter Lock (GIL)?
How do I use multiple threads in Python?

Objectives

Understand the GIL.
Understand the difference between the threading and multiprocessing libraries in Python.

Threading

Another possibility of parallelizing code is to use the threading module. This module is built into Python. We will use it to estimate \(\pi\) once again in this section.

An example of using threading to speed up your code is:

PYTHON

from threading import (Thread)

PYTHON

%%time
n = 10**7
t1 = Thread(target=calc_pi, args=(n,))
t2 = Thread(target=calc_pi, args=(n,))

t1.start()
t2.start()

t1.join()
t2.join()

Discussion

Discussion: where’s the speed-up?

While mileage may vary, parallelizing calc_pi, calc_pi_numpy and calc_pi_numba in this way will not give the theoretical speed-up. calc_pi_numba should give some speed-up, but nowhere near the ideal scaling for the number of cores. This is because, at any given time, Python only allows a single thread to access the interpreter, a feature also known as the Global Interpreter Lock.

A few words about the Global Interpreter Lock

The Global Interpreter Lock (GIL) is an infamous feature of the Python interpreter. It both guarantees inner thread sanity, making programming in Python safer, and prevents us from using multiple cores from a single Python instance. This becomes an obvious problem when we want to perform parallel computations. Roughly speaking, there are two classes of solutions to circumvent/lift the GIL:

Run multiple Python instances using multiprocessing.
Keep important code outside Python using OS operations, C++ extensions, Cython, Numba.

The downside of running multiple Python instances is that we need to share the program state between different processes. To this end, you need to serialize objects. Serialization entails converting a Python object into a stream of bytes that can then be sent to the other process or, for example, stored to disk. This is typically done using pickle, json, or similar, and creates a large overhead. The alternative is to bring parts of our code outside Python. Numpy has many routines that are largely situated outside of the GIL. Trying out and profiling your application is the only way to know for sure.

There are several options to make your own routines not subjected to the GIL: fortunately, numba makes this very easy.

We can unlock the GIL in Numba code by setting nogil=True inside the numba.jit decorator:

PYTHON

import random

@numba.jit(nopython=True, nogil=True)
def calc_pi_nogil(N):
    M = 0
    for i in range(N):
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        if x**2 + y**2 < 1:
            M += 1
    return 4 * M / N

The nopython argument forces Numba to compile the code without referencing any Python objects, while the nogil argument disables the GIL during the execution of the function.

Callout

Use `nopython=True` or `@numba.njit`

It is generally good practice to use nopython=True with @numba.jit to make sure the entire function is running without referencing Python objects, because that will dramatically slow down most Numba code. The decorator @numba.njit even has nopython=True by default.

Now we can run the benchmark again, using calc_pi_nogil instead of calc_pi.

Challenge

Exercise: try threading on a Numpy function

Many Numpy functions unlock the GIL. Try and sort two randomly generated arrays using numpy.sort in parallel.

Show me the solution

PYTHON

n = 10**7
rnd1 = np.random.random(n)
rnd2 = np.random.random(n)

%timeit -n 10 -r 10 np.sort(rnd1)

PYTHON

%%timeit -n 10 -r 10
t1 = Thread(target=np.sort, args=(rnd1, ))
t2 = Thread(target=np.sort, args=(rnd2, ))

t1.start()
t2.start()

t1.join()
t2.join()

Multiprocessing

Python also enables parallelisation with multiple processes via the multiprocessing module. It implements an API that is superficially similar to threading:

PYTHON

import random
from multiprocessing import Process

# function in plain Python
def calc_pi(N):
    M = 0
    for i in range(N):
        # Simulate impact coordinates
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)

        # True if impact happens inside the circle
        if x**2 + y**2 < 1.0:
            M += 1
    return (4 * M / N, N)  # result, iterations

if __name__ == '__main__':

    n = 10**7
    p1 = Process(target=calc_pi, args=(n,))
    p2 = Process(target=calc_pi, args=(n,))

    p1.start()
    p2.start()

    p1.join()
    p2.join()

However, under the hood, processes are very different from threads. A new process is created by generating a fresh “copy” of the Python interpreter that includes all the resources associated to the parent. There are three different ways of doing this (spawn, fork, and forkserver), whose availability depends on the platform. We will use spawn as it is available on all platforms. You can read more about the others in the Python documentation. Since creating a process is resource-intensive, multiprocessing is beneficial under limited circumstances — namely, when the resource utilisation (or runtime) of a function is measureably larger than the overhead of creating a new process.

Callout

Protect process creation with an `if`-block

A module should be safely importable. Any code that creates processes, pools, or managers should be protected with:

PYTHON

if __name__ == "__main__":
    ...

The non-intrusive and safe way of starting a new process is to acquire a context and work within that context. This ensures that your application does not interfere with any other processes that might be in use:

PYTHON

import multiprocessing as mp

def calc_pi(N):
    ...

if __name__ == '__main__':
    # mp.set_start_method("spawn")  # if not using a context
    ctx = mp.get_context("spawn")
	...

We can pass objects between processes by using Queues and Pipes. Multiprocessing queues behave similarly to regular queues: - FIFO: first in, first out. - queue_instance.put(<obj>) to add. - queue_instance.get() to retrieve.

Challenge

Exercise: reimplement `calc_pi` to use a queue to return the result

Show me the solution

PYTHON

import multiprocessing as mp
import random


def calc_pi(N, que):
    M = 0
    for i in range(N):
        # Simulate impact coordinates
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)

        # True if impact happens inside the circle
        if x**2 + y**2 < 1.0:
            M += 1
    que.put((4 * M / N, N))  # result, iterations


if __name__ == "__main__":

    ctx = mp.get_context("spawn")
    que = ctx.Queue()

    n = 10**7
    p1 = ctx.Process(target=calc_pi, args=(n, que))
    p2 = ctx.Process(target=calc_pi, args=(n, que))

    p1.start()
    p2.start()

    for i in range(2):
        print(que.get())

    p1.join()
    p2.join()

Callout

Sharing state

It is also possible to share state between processes. The simplest way is to use shared memory via Value or Array. You can access the underlying value using the .value property. Note, you should explicitly acquire a lock before performing an operation that is not atomic (which cannot be done in one step, e.g., using the += operator):

PYTHON

with var.get_lock():
    var.value += 1

Since Python 3.8, you can also create a Numpy array backed by a shared memory buffer (multiprocessing.shared_memory.SharedMemory), which can then be accessed from separate processes by name (including from separate interactive shells!).

Process pool

The Pool API provides a pool of worker processes that can execute tasks. Methods of the Pool object offer various convenient ways to implement data parallelism in your program. The most convenient way to create a pool object is with a context manager, either using the toplevel function multiprocessing.Pool, or by calling the .Pool() method on the context. With the Pool object, tasks can be submitted by calling methods like .apply(), .map(), .starmap(), or their .*_async() versions.

Challenge

Exercise: adapt the original exercise to submit tasks to a pool

Use the original calc_pi function (without the queue).
Submit batches of different sample size (different values of N).
As mentioned earlier, creating a new process entails overheads. Try a wide range of sample sizes and check if the runtime scales in keeping with that claim.

Show me the solution

PYTHON

from itertools import repeat
import multiprocessing as mp
import random
from timeit import timeit


def calc_pi(N):
    M = 0
    for i in range(N):
        # Simulate impact coordinates
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)

        # True if impact happens inside the circle
        if x**2 + y**2 < 1.0:
            M += 1
    return (4 * M / N, N)  # result, iterations


def submit(ctx, N):
    with ctx.Pool() as pool:
        pool.starmap(calc_pi, repeat((N,), 4))


if __name__ == "__main__":
    ctx = mp.get_context("spawn")
    for i in (1_000, 100_000, 10_000_000):
        res = timeit(lambda: submit(ctx, i), number=5)
        print(i, res)

Key Points

If we want the most efficient parallelism on a single machine, we need to work around the GIL.
If your code disables the GIL, threading will be more efficient than multiprocessing.
If your code keeps the GIL, some of your code is still in Python and you are wasting precious compute time!

Threads And Processes

Overview

Questions

Objectives

Threading

PYTHON

PYTHON

Discussion: where’s the speed-up?

A few words about the Global Interpreter Lock

PYTHON

Use nopython=True or @numba.njit

Exercise: try threading on a Numpy function

Show me the solution

PYTHON

PYTHON

Multiprocessing

PYTHON

Protect process creation with an if-block

PYTHON

PYTHON

Passing objects and sharing state

Exercise: reimplement calc_pi to use a queue to return the result

Show me the solution

PYTHON

Sharing state

PYTHON

Process pool

Exercise: adapt the original exercise to submit tasks to a pool

Show me the solution

PYTHON

Use `nopython=True` or `@numba.njit`

Protect process creation with an `if`-block

Exercise: reimplement `calc_pi` to use a queue to return the result