Understanding Python (NumPy/Pandas)

Last updated on 2024-03-28 | Edit this page

Estimated time: 30 minutes

Overview

Questions

Why are Python loops slow?
Why is NumPy often faster than raw Python?
How can processing rows of a Pandas data table be made faster?

Objectives

Able to identify when Python code can be rewritten to perform execution in the back-end.
Able to utilise NumPy’s vectorisation when operating on arrays of data.
Able to efficiently process rows when working with data tables.

Python is an interpreted programming language. When you execute your .py file, the (default) CPython back-end compiles your Python source code to an intermediate bytecode. This bytecode is then interpreted in software at runtime generating instructions for the processor as necessary. This interpretation stage, and other features of the language, harm the performance of Python (whilst improving it’s usability).

In comparison, many languages such as C/C++ compile directly to machine code. This allows the compiler to perform low-level optimisations that better exploit hardware nuance to achieve fast performance. This however comes at the cost of compiled software not being cross-platform.

Whilst Python will rarely be as fast as compiled languages like C/C++, it is possible to take advantage of the CPython back-end and packages such as NumPy and Pandas that have been written in compiled languages to expose this performance.

A simple example of this would be to perform a linear search of a list (in the previous episode we did say this is not recommended). The below example creates a list of 2500 integers in the inclusive-exclusive range [0, 5000). It then searches for all of the even numbers in that range. searchlistPython() is implemented manually, iterating ls checking each individual item in Python code. searchListC() in contrast uses the in operator to perform each search, which allows CPython to implement the inner loop in it’s C back-end.

PYTHON

import random

N = 2500  # Number of elements in list
M = 2  # N*M == Range over which the elements span

def generateInputs():
    random.seed(12)  # Ensure every list is the same
    return [random.randint(0, int(N*M)) for i in range(N)]
    
def searchListPython():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        for j in range(0, len(ls)):
            if ls[j] == i:
                ct += 1
                break

def searchListC():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        if i in ls:
            ct += 1

repeats = 1000
gen_time = timeit(generateInputs, number=repeats)
print(f"searchListPython: {timeit(searchListPython, number=repeats)-gen_time:.2f}ms")
print(f"searchListC: {timeit(searchListC, number=repeats)-gen_time:.2f}ms")

This results in the manual Python implementation being 5x slower, doing the exact same operation!

OUTPUT

searchListPython: 152.15ms
searchListC: 28.43ms

An easy approach to follow is that if two blocks of code do the same operation, the one that contains less Python is probably faster. This won’t apply if you’re using 3rd party packages written purely in Python though.

Python bytecode

You can use dis to view the bytecode generated by Python, the amount of byte-code more strongly correlates with how much code is being executed by the Python interpreter. However, this still does not account for whether functions called are implemented using Python or C.

The pure Python search compiles to 82 lines of byte-code.

PYTHON

import dis

def searchListPython():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        for j in range(0, len(ls)):
            if ls[j] == i:
                ct += 1
                break

dis.dis(searchListPython)

OUTPUT

 11           0 LOAD_GLOBAL              0 (generateInputs)
              2 CALL_FUNCTION            0
              4 STORE_FAST               0 (ls)

 12           6 LOAD_CONST               1 (0)
              8 STORE_FAST               1 (ct)

 13          10 LOAD_GLOBAL              1 (range)
             12 LOAD_CONST               1 (0)
             14 LOAD_GLOBAL              2 (int)
             16 LOAD_GLOBAL              3 (N)
             18 LOAD_GLOBAL              4 (M)
             20 BINARY_MULTIPLY
             22 CALL_FUNCTION            1
             24 LOAD_GLOBAL              4 (M)
             26 CALL_FUNCTION            3
             28 GET_ITER
        >>   30 FOR_ITER                24 (to 80)
             32 STORE_FAST               2 (i)

 14          34 LOAD_GLOBAL              1 (range)
             36 LOAD_CONST               1 (0)
             38 LOAD_GLOBAL              5 (len)
             40 LOAD_FAST                0 (ls)
             42 CALL_FUNCTION            1
             44 CALL_FUNCTION            2
             46 GET_ITER
        >>   48 FOR_ITER                14 (to 78)
             50 STORE_FAST               3 (j)

 15          52 LOAD_FAST                0 (ls)
             54 LOAD_FAST                3 (j)
             56 BINARY_SUBSCR
             58 LOAD_FAST                2 (i)
             60 COMPARE_OP               2 (==)
             62 POP_JUMP_IF_FALSE       38 (to 76)

 16          64 LOAD_FAST                1 (ct)
             66 LOAD_CONST               2 (1)
             68 INPLACE_ADD
             70 STORE_FAST               1 (ct)

 17          72 POP_TOP
             74 JUMP_FORWARD             1 (to 78)

 15     >>   76 JUMP_ABSOLUTE           24 (to 48)
        >>   78 JUMP_ABSOLUTE           15 (to 30)

 13     >>   80 LOAD_CONST               0 (None)
             82 RETURN_VALUE

Whereas the in variant only compiles to 54.

PYTHON

import dis

def searchListC():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        if i in ls:
            ct += 1

dis.dis(searchListC)

OUTPUT

  4           0 LOAD_GLOBAL              0 (generateInputs)
              2 CALL_FUNCTION            0
              4 STORE_FAST               0 (ls)

  5           6 LOAD_CONST               1 (0)
              8 STORE_FAST               1 (ct)

  6          10 LOAD_GLOBAL              1 (range)
             12 LOAD_CONST               1 (0)
             14 LOAD_GLOBAL              2 (int)
             16 LOAD_GLOBAL              3 (N)
             18 LOAD_GLOBAL              4 (M)
             20 BINARY_MULTIPLY
             22 CALL_FUNCTION            1
             24 LOAD_GLOBAL              4 (M)
             26 CALL_FUNCTION            3
             28 GET_ITER
        >>   30 FOR_ITER                10 (to 52)
             32 STORE_FAST               2 (i)

  7          34 LOAD_FAST                2 (i)
             36 LOAD_FAST                0 (ls)
             38 CONTAINS_OP              0
             40 POP_JUMP_IF_FALSE       25 (to 50)

  8          42 LOAD_FAST                1 (ct)
             44 LOAD_CONST               2 (1)
             46 INPLACE_ADD
             48 STORE_FAST               1 (ct)
        >>   50 JUMP_ABSOLUTE           15 (to 30)

  6     >>   52 LOAD_CONST               0 (None)
             54 RETURN_VALUE

Scope

When Python executes your code, it has to find the variables and functions that you’re using.

This adds an additional cost to accessing variables and calling functions in Python, which isn’t typically seen in compiled languages.

In particular, it will first check whether the variable or functions has been declared within the current function (local scope), if it can’t find it there it will check whether it has been declared in the file (global scope) after which it may even check whether it’s from an imported package.

These are not implicitly cached, therefore repeated accesses to variables and functions, will repeat these checks.

The implication, is that as local scope variables and functions are checked first, they will be faster to use.

If you’re only accessing a variable once or twice that’s nothing to worry about, this is a relatively small cost. But if a variable or functions is being accessed regularly, such as within a loop, the impact may become visible.

The below example provides a small demonstration of this in practice.

PY

from timeit import timeit

N = 1000000
repeats = 1000

def test_list_global():
    t = 0
    for i in range(N):
        if t > N:
            break
        t += i
        
def test_list_local():
    t = 0
    N_local = N
    for i in range(N_local):
        if t > N_local:
            break
        t += i
        
print(f"Global Scope: {timeit(test_list_global, number=repeats):.5f}ms")
print(f"Local Scope: {timeit(test_list_local, number=repeats):.5f}ms")

This is only a trivial example, whereby N has been copied to the local scope N_local, but local scope is about 20% faster than global scope!

OUTPUT

Global Scope: 0.06416ms
Local Scope: 0.05391ms

Consider copying highly accessed variables into local scope, you can always copy them back to global scope before you return from a function.

Copying functions to local scope works much the same as variables, e.g.

PY

import numpy as np

def my_function():
    uniform_local = np.random.uniform
    
    for i in range(10000):
        t = uniform_local()

Built-in Functions Operators

In order to take advantage of offloading computation to the CPython back-end it’s necessary to be aware of what functionality is present. Those available without importing packages are considered built-in functions.

Built-in functions are typically implemented in the CPython back-end, so their performance benefits from bypassing the Python interpreter.

In particular, those which are passed an iterable (e.g. lists) are likely to provide the greatest benefits to performance. The Python documentation provides equivalent Python code for many of these cases

all(): boolean and of all items
any(): boolean or of all items
max(): Return the maximum item
min(): Return the minimum item
sum(): Return the sum of all items

Callout

The built-in functions filter() and map() can be used for processing iterables However list-comprehension is likely to be more performant.

Using NumPy (Effectively)

NumPy is a commonly used package for scientific computing, which provides a wide variety of methods.

It adds restriction via it’s own basic numeric types, and static arrays to enable even greater performance than that of core Python. However if these restrictions are ignored, the performance can become significantly worse.

Arrays

NumPy’s arrays (not to be confused with the core Python array package) are static arrays. Unlike core Python’s lists, they do not dynamically resize. Therefore if you wish to append to a NumPy array, you must call resize() first. If you treat this like append() for a Python list, resizing for each individual append you will be performing significantly more copies and memory allocations than a Python list.

The below example sees lists and arrays constructed from range(100000).

PYTHON

from timeit import timeit
import numpy

N = 100000  # Number of elements in list/array

def list_append():
    ls = []
    for i in range(N):
        ls.append(i)

def array_resize():
    ar = numpy.zeros(1)
    for i in range(1, N):
        ar.resize(i+1)
        ar[i] = i
        
repeats = 1000
print(f"list_append: {timeit(list_append, number=repeats):.2f}ms")
print(f"array_resize: {timeit(array_resize, number=repeats):.2f}ms")

Resizing a NumPy array is 5.2x slower than a list, probably 10x slower than list comprehension.

OUTPUT

list_append: 3.50ms
array_resize: 18.04ms

Another difference, is that NumPy arrays typically require all data to be the same type (and a NumPy type). This enables more efficient access to elements, as they all exist contiguously in memory. In contrast, elements within Python lists can be of any type so the list always stores a pointer to where the element actually exists in memory, rather than the actual element. This has the side effect that if you are converting back and forth between Python lists and NumPy arrays, there is an additional overhead as it’s not as simple as copying a single block of memory.

Callout

If you construct a NumPy array from a list containing a complex object, it will store your data as Python types and you won’t be able to take advantage of NumPy’s optimisations.

SH

>python
>>> import numpy as np
>>> a = np.array([0.5, 5])
>>> type(a[0])
<class 'numpy.float64'>
>>> type(a[1])
<class 'numpy.float64'>
>>> b = np.array([0.5, 5,{"foo":5}])
>>> type(b[0])
<class 'float'>
>>> type(b[1])
<class 'int'>
>>> type(b[2])
<class 'dict'>

The below example demonstrates the overhead of mixing Python lists and NumPy functions.

SH

# Python list, numpy.random.choice()
>python -m timeit -s "import numpy; ls = list(range(10000))" "numpy.random.choice(ls)"
1000 loops, best of 5: 267 usec per loop

# NumPy array, numpy.random.choice()
>python -m timeit -s "import numpy; ar = numpy.arange(10000)" "numpy.random.choice(ar)"
50000 loops, best of 5: 4.06 usec per loop

Passing a Python list to numpy.random.choice() is 65.6x slower than passing a NumPy array. This is the additional overhead of converting the list to an array. If this function were called multiple times, it would make sense to transform the list to an array before calling the function so that overhead is only paid once.

Callout

SH

# Python list, Manually select 1 item
>python -m timeit -s "import numpy; ls = list(range(10000))" "ls[numpy.random.randint(len(ls))]"
200000 loops, best of 5: 1.19 usec per loop

# NumPy array, Manually select 1 item
>python -m timeit -s "import numpy; ar = numpy.arange(10000)" "ar[numpy.random.randint(len(ar))]"
200000 loops, best of 5: 1.22 usec per loop

Regardless, for this simple application of numpy.random.choice(), merely using numpy.random.randint(len()) is around 4x faster regardless whether a Python list or NumPy array is used.

With numpy.random.choice() being such a general function (it has many possible parameters), there is significant internal branching. If you don’t require this advanced functionality and are calling a function regularly, it can be worthwhile considering using a more limited function.

There is however a trade-off, using numpy.random.choice() can be clearer to someone reading your code, and is more difficult to use incorrectly.

Vectorisation

The manner by which NumPy stores data in arrays enables it’s functions to utilise vectorisation, whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays.

Earlier in this episode it was demonstrated that using core Python methods over a list, will outperform a loop performing the same calculation faster. The below example takes this a step further by demonstrating the calculation of dot product.

PYTHON

from timeit import timeit

N = 1000000  # Number of elements in list

gen_list = f"ls = list(range({N}))"
gen_array = f"import numpy;ar = numpy.arange({N}, dtype=numpy.int64)"

py_sum_ls = "sum([i*i for i in ls])"
py_sum_ar = "sum(ar*ar)"
np_sum_ar = "numpy.sum(ar*ar)"
np_dot_ar = "numpy.dot(ar, ar)"

repeats = 1000
print(f"python_sum_list: {timeit(py_sum_ls, setup=gen_list, number=repeats):.2f}ms")
print(f"python_sum_array: {timeit(py_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_sum_array: {timeit(np_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_dot_array: {timeit(np_dot_ar, setup=gen_array, number=repeats):.2f}ms")

python_sum_list uses list comprehension to perform the multiplication, followed by the Python core sum(). This comes out at 46.93ms
python_sum_array instead directly multiplies the two arrays, taking advantage of NumPy’s vectorisation. But uses the core Python sum(), this comes in slightly faster at 33.26ms.
numpy_sum_array again takes advantage of NumPy’s vectorisation for the multiplication, and additionally uses NumPy’s sum() implementation. These two rounds of vectorisation provide a much faster 1.44ms completion.
numpy_dot_array instead uses NumPy’s dot() to calculate the dot product in a single operation. This comes out the fastest at 0.29ms, 162x faster than python_sum_list.

OUTPUT

python_sum_list: 46.93ms
python_sum_array: 33.26ms
numpy_sum_array: 1.44ms
numpy_dot_array: 0.29ms

Parallel NumPy

NumPy can sometimes take advantage of auto parallelisation, particularly on HPC systems.

A small number of functions are backed by BLAS and LAPACK, enabling even greater speedup.

The supported functions mostly correspond to linear algebra operations.

The auto-parallelisation of these functions is hardware dependant, so you won’t always automatically get the additional benefit of parallelisation. However, HPC systems should be primed to take advantage, so try increasing the number of cores you request when submitting your jobs and see if it improves the performance.

This might be why numpy_dot_array is that much faster than numpy_sum_array in the previous example!

`vectorize()`

Python’s map() was introduced earlier, for applying a function to all elements within a list. NumPy provides vectorize() an equivalent for operating over it’s arrays.

This doesn’t actually make use of processor-level vectorisation, from the documentation:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

The below example demonstrates how the performance of vectorize() is only marginally faster than map().

PYTHON

N = 100000  # Number of elements in list/array

def genArray():
    return numpy.arange(N)

def plus_one(x):
    return x + 1
    
def python_map():
    ar = genArray()
    return list(map(plus_one, ar))

def numpy_vectorize():
    ar = genArray()
    return numpy.vectorize(plus_one)(ar)

repeats = 1000
gentime = timeit(genArray, number=repeats)
print(f"python_map: {timeit(python_map, number=repeats)-gentime:.2f}ms")
print(f"numpy_vectorize: {timeit(numpy_vectorize, number=repeats)-gentime:.2f}ms")

OUTPUT

python_map: 7.94ms
numpy_vectorize: 7.80ms

Using Pandas (Effectively)

Pandas is the most common Python package used for scientific computing when working with tabular data akin to spreadsheets (DataFrames).

Similar to NumPy, Pandas enables greater performance than pure Python implementations when used correctly, however incorrect usage can actively harm performance.

Operating on Rows

Pandas’ methods by default operate on columns. Each column or series can be thought of as a NumPy array, highly suitable for vectorisation.

Following the theme of this episode, iterating over the rows of a data frame using a for loop is not advised. The pythonic iteration will be slower than other approaches.

Pandas allows it’s own methods to be applied to rows in many cases by passing axis=1, where available these functions should be preferred over manual loops. Where you can’t find a suitable method, apply() can be used, which is similar to map()/vectorize(), to apply your own function to rows.

PYTHON

from timeit import timeit
import pandas
import numpy

N = 100000  # Number of rows in DataFrame

def genDataFrame():
    numpy.random.seed(12)  # Ensure each dataframe is identical
    return pandas.DataFrame(
    {
        "f_vertical": numpy.random.random(size=N),
        "f_horizontal": numpy.random.random(size=N),
        # todo some spurious columns
    })

def pythagoras(row):
    return (row["f_vertical"]**2 + row["f_horizontal"]**2)**0.5
    
def for_range():
    rtn = []
    df = genDataFrame()
    for row_idx in range(df.shape[0]):
        row = df.iloc[row_idx]
        rtn.append(pythagoras(row))
    return pandas.Series(rtn)

def for_iterrows():
    rtn = []
    df = genDataFrame()
    for row_idx, row in df.iterrows():
        rtn.append(pythagoras(row))
    return pandas.Series(rtn)
    
def pandas_apply():
    df = genDataFrame()
    return df.apply(pythagoras, axis=1)

repeats = 100
gentime = timeit(genDataFrame, number=repeats)
print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")

apply() is 3x faster than the two for approaches, as it avoids the Python for loop.

OUTPUT

for_range: 1582.47ms
for_iterrows: 1677.14ms
pandas_apply: 390.49ms

However, rows don’t exist in memory as arrays (columns do!), so apply() does not take advantage of NumPys vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns to NumPy.

PYTHON

def vectorize():
    df = genDataFrame()
    return pandas.Series(numpy.sqrt(numpy.square(df["f_vertical"]) + numpy.square(df["f_horizontal"])))
    
print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")

264x faster than apply(), 1000x faster than for iterrows()!

vectorize: 1.48ms

It won’t always be possible to take full advantage of vectorisation, for example you may have conditional logic.

An alternate approach is converting your dataframe to a Python dictionary using to_dict(orient='index'). This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension:

PYTHON

def to_dict():
    df = genDataFrame()
    df_as_dict = df.to_dict(orient='index')
    return pandas.Series([(r['f_vertical']**2 + r['f_horizontal']**2)**0.5 for r in df_as_dict.values()])

print(f"to_dict: {timeit(to_dict, number=repeats)*10-gentime:.2f}ms")

Whilst still nearly 100x slower than pure vectorisation, it’s twice as fast as apply().

SH

to_dict: 131.15ms

This is because indexing into Pandas’ Series (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large dataframe.

PYTHON

from timeit import timeit
import pandas as pandas

N = 100000  # Number of rows in DataFrame

def genInput():
    s = pandas.Series({'a' : 1, 'b' : 2})
    d = {'a' : 1, 'b' : 2}
    return s, d

def series():
    s, _ = genInput()
    for i in range(N):
        y = s['a'] * s['b']

def dictionary():
    _, d = genInput()
    for i in range(N):
        y = d['a'] * d['b']

repeats = 1000
print(f"series: {timeit(series, number=repeats):.2f}ms")
print(f"dictionary: {timeit(dictionary, number=repeats):.2f}ms")

65x slower!

OUTPUT

series: 237.25ms
dictionary: 3.63ms

Filter Early

If you can filter your rows before processing, rather than after, you may significantly reduce the amount of processing and memory used.

Key Points

Python is an interpreted language, this adds an additional overhead at runtime to the execution of Python code. Many core Python and NumPy functions are implemented in faster C/C++, free from this overhead.
NumPy can take advantage of vectorisation to process arrays, which can greatly improve performance.
Pandas’ data tables store columns as arrays, therefore operations applied to columns can take advantage of NumPys vectorisation.