Calling external C and C++ libraries from Python
Overview
Teaching: 60 min
Exercises: 30 minQuestions
What are some of my options in calling C and C++ libraries from Python code?
How does this work together with Numpy arrays?
How do I use this in multiple threads while lifting the GIL?
Objectives
Compile and link simple C programs into shared libraries.
Call these library from Python and time its executions.
Compare the performance with Numba decorated Python code.
Bypass the GIL when calling these libraries from multiple threads simultaneously.
Calling C and C++ libraries
Simple example using either pybind11 or ctypes
External C and C++ libraries can be called from Python code using a number of options, using e.g. Cython, CFFI, pybind11 and ctypes. We will discuss the last two, because they require the least amount of boilerplate, for simple cases - for more complex examples that may not be the case. Consider this simple C program, test.c, which adds up consecutive numbers:
#include <pybind11/pybind11.h>
namespace py = pybind11;
long long sum_range(long long high)
{
long long i;
long long s = 0LL;
for (i = 0LL; i < high; i++)
s += (long long)i;
return s;
}
PYBIND11_MODULE(test_pybind, m) {
m.doc() = "Export the sum_range function as sum_range";
m.def("sum_range", &sum_range, "Adds upp consecutive integer numbers from 0 up to and including high-1");
}
You can easily compile and link it into a shared object (.so) file. First you need pybind11. You can install it in a number of ways, like pip, but I prefer creating virtual environments using pipenv.
pip install pipenv
pipenv install pybind11
pipenv shell
c++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` test.c -o test_pybind.so
which generates a test_pybind.so
shared object which you can call from a iPython shell, like this:
%import test_pybind
%sum_range=test_pybind.sum_range
%high=1000000000
%brute_force_sum=sum_range(high)
Now you might want to check the output, by comparing with the well-known formula for the sum of consecutive integers.
%sum_from_formula=high*(high-1)//2
%sum_from_formula
%difference=sum_from_formula-brute_force_sum
%difference
Give this script a suitable name, like call_C_libraries.py
.
The same thing can be done using ctypes instead of pybind11, but requires slightly more boilerplate
on the Python side of the code and slightly less on the C side. test.c will be just the algorithm:
long long sum_range(long long high)
{
long long i;
long long s = 0LL;
for (i = 0LL; i < high; i++)
s += (long long)i;
return s;
}
Compile and link using
gcc -O3 -g -fPIC -c -o test.o test.c
ld -shared -o libtest.so test.o
which generates a libtest.so file.
You will need some extra boilerplate:
%import ctypes
%testlib = ctypes.cdll.LoadLibrary("./libtest.so")
%sum_range = testlib.sum_range
%sum_range.argtypes = [ctypes.c_longlong]
%sum_range.restype = ctypes.c_longlong
%high=1000000000
%brute_force_sum=sum_range(high)
Again, you can compare with the formula for the sum of consecutive integers.
%sum_from_formula=high*(high-1)/2
%sum_from_formula
%difference=sum_from_formula-brute_force_sum
%difference
Performance
Now we can time our compiled sum_range
C library, e.g. from the iPython interface:
%timeit sum_range(10**7)
2.69 ms ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you compare with the Numba timing from chapter 3, you will see that the C library for sum_range
is faster than
the numpy computation but significantly slower than the numba.jit
decorated function.
Challenge: Check if the Numba version of this conditional
sum range
function outperforms its C counterpart:Insprired by a blog by Christopher Swenson.
long long conditional_sum_range(long long to) { long long i; long long s = 0LL; for (i = 0LL; i < to; i++) if (i % 3 == 0) s += i; return s; }
Solution
Just insert a line
if i%3==0:
in the code forsum_range_numba
and rename it toconditional_sum_range_numba
.@numba.jit def conditional_sum_range_numba(a: int): x = 0 for i in range(a): if i%3==0: x += i return x
Let’s check how fast it runs.
%timeit conditional_sum_range_numba(10**7)
8.11 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Compare this with the run time for the C code for conditional_sum_range. Compile and link in the usual way, assuming the file name is still
test.c
:gcc -O3 -g -fPIC -c -o test.o test.c ld -shared -o libtest.so test.o
Again, we can time our compiled
conditional_sum_range
C library, e.g. from the iPython interface:import ctypes testlib = ctypes.cdll.LoadLibrary("./libtest.so") conditional_sum_range = testlib.conditional_sum_range conditional_sum_range.argtypes = [ctypes.c_longlong] conditional_sum_range.restype = ctypes.c_longlong %timeit conditional_sum_range(10**7)
7.62 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This shows that for this slightly more complicated example the C code is somewhat faster than the Numba decorated Python code.
Passing Numpy arrays to C libraries.
Now let us consider a more complex example. Instead of computing the sum of numbers up to a certain upper limit, let us
compute that for an array of upper limits. This will return an array of sums. How difficult is it to modify our C and Python code
to get this done? Well, you just need to replace &sum_range
by py::vectorize(sum_range)
:
PYBIND11_MODULE(test_pybind, m) {
m.doc() = "pybind11 example plugin"; // optional module docstring
m.def("sum_range", py::vectorize(sum_range), "Adds upp consecutive integer numbers from 0 up to and including high-1");
}
Now let’s see what happens if we pass test_pybind.so
an array instead of an integer.
%import test_pybind
%sum_range=test_pybind.sum_range
%ys=range(10)
%sum_range(ys)
gives
array([ 0, 0, 1, 3, 6, 10, 15, 21, 28, 36])
It does not crash! Instead, it returns an array which you can check to be correct by subtracting the previous sum from each sum (except the first):
%out=sum_range(ys)
%out[1:]-out[:-1]
which gives
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
the elements of ys
- except the last - as you would expect.
Call the C library from multiple threads simultaneously.
We can quickly show you how the C library compiled using pybind11 can be run multithreaded. try the following from an iPython shell:
%high=int(1e9)
%timeit(sum_range(high))
gives
274 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now try a straightforward parallellisation of 20 calls of sum_range
, over two threads, so 10 calls per thread.
This should take about 10 * 274ms = 2.74s
if parallellisation were running without overhead. Let’s try:
%import threading as T
%import time
%def timer():
% start_time = time.time()
% for x in range(10):
% t1 = T.Thread(target=sum_range, args=(high,))
% t2 = T.Thread(target=sum_range, args=(high,))
% t1.start()
% t2.start()
% t1.join()
% t2.join()
% end_time = time.time()
% print(f"Time elapsed = {end_time-start_time:.2f}s")
%timer()
This gives
Time elapsed = 5.59s
i.e. more than twice the time we would expect. What actually happened is that sum_range
was run sequentially instead of parallelly.
We need to add a single declaration to test.c: py::call_guard<py::gil_scoped_release>()
:
PYBIND11_MODULE(test_pybind, m) {
m.doc() = "pybind11 example plugin"; // optional module docstring
m.def("sum_range", py::vectorize(sum_range), "Adds upp consecutive integer numbers from 0 up to and including high-1");
}
like this:
PYBIND11_MODULE(test_pybind, m) {
m.doc() = "pybind11 example plugin"; // optional module docstring
m.def("sum_range", &sum_range, "A function which adds upp numbers from 0 up to and including high-1", py::call_guard<py::gil_scoped_release>());
}
Now compile again:
c++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` test.c -o test_pybind.so
Reimport the rebuilt shared object - this can only be done by quitting and relaunching the iPython interpreter - and time again.
%import test_pybind
%import time
%import threading as T
%
%sum_range=test_pybind.sum_range
%high=int(1e9)
%
%def timer():
% start_time = time.time()
% for x in range(10):
% t1 = T.Thread(target=sum_range, args=(high,))
% t2 = T.Thread(target=sum_range, args=(high,))
% t1.start()
% t2.start()
% t1.join()
% t2.join()
% end_time = time.time()
% print(f"Time elapsed = {end_time-start_time:.2f}s")
%timer()
This gives:
Time elapsed = 2.81s
as you would expect for two sum_range
modules running in parallel.
Key Points
Multiple options are available in calling external C and C++ libraries and that the best choice can depend on the complexity of your problem.
Obviously, there is an extra compile and link step, but you will get a much faster execution compared to pure Python.
Also, the GIL will be circumvented in calling these libaries.
Numba might also offer you the speedup you want with even less effort.