Content from Introduction to Profiling


Last updated on 2024-03-28 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • Why should you profile your code?
  • How should you choose which type of profiler to use?
  • Which test case should be profiled?

Objectives

  • explain the benefits of profiling code and different types of profiler
  • identify the appropriate Python profiler for a given scenario
  • explain how to select an appropriate test case for profiling and why

Introduction


Performance profiling is the process of analysing and measuring the performance of a program or script, to understand where time is being spent during execution.

Profiling is useful when you have written any code that will be running for a substantial period of time. As your code grows in complexity, it becomes increasingly difficult to estimate where time is being spent during execution. Profiling allows you to narrow down where the time is being spent, to identify whether this is of concern or not.

Profiling is a relatively quick process which can either provide you the peace of mind that your code is efficient, or highlight the performance bottleneck. There is limited benefit to optimising components that may only contribute a tiny proportion of the overall runtime. Identifying bottlenecks allows optimisation to be precise and efficient, potentially leading to significant speedups enabling faster research. In extreme cases, addressing bottlenecks has enabled programs to run hundreds or thousands of times faster!

Increasingly, particularly with relation to HPC, attention is being paid to the energy usage of software. Profiling your software will provide you the confidence that your software is an efficient use of resources.

When to Profile


Profiling is most relevant to working code, when you have reached a stage that the code works and are considering deploying it.

Any code that will run for more than a few minutes over it’s lifetime, that isn’t a quick one-shot script can benefit from profiling.

Profiling should be a relatively quick and inexpensive process. If there are no significant bottlenecks in your code you can quickly be confident that your code is reasonably optimised. If you do identify a concerning bottleneck, further work to optimise your code and reduce the bottleneck could see significant improvements to the performance of your code and hence productivity.

All Programmers Can Benefit

Even professional programmers make oversights that can lead to poor performance, and can be identified through profiling.

For example Grand Theft Auto Online, which has allegedly earned over $7bn since it’s 2013 release, was notorious for it’s slow loading times. 8 years after it’s release a ‘hacker’ had enough, they reverse engineered and profiled the code to enable a 70% speedup!

How much revenue did that unnecessary bottleneck cost, through user churn?

How much time and energy was wasted, by unnecessarily slow loading screens?

The bottlenecked implementation was naively parsing a 10MB JSON file to create a list of unique items.

Repeatedly:

  • Checking the length of (C) strings, e.g. iterating till the terminating character is found, resolved by caching the results.
  • Performing a linear search of a list to check for duplicates before inserting, resolved by using an appropriate data structure (dictionary).
    • Allegedly duplicates were never even present in the JSON.

Why wasn’t this caught by one of the hundreds of developers with access to the source code?

Was more money saved by not investigating performance than committing time to profiling and fixing the issue?

Types of Profiler


There are multiple approaches to profiling, most programming languages have one or more tools available covering these approaches. Whilst these tools differ, their core functionality can be grouped into several categories.

Manual Profiling

Similar to using print() for debugging, manually timing sections of code can provide a rudimentary form of profiling.

PYTHON

import time

t_a = time.monotonic()
# A: Do something
t_b = time.monotonic()
# B: Do something else
t_c = time.monotonic()
# C: Do another thing
t_d = time.monotonic()

mainTimer_stop = time.monotonic()
print(f"A: {t_b - t_a} seconds")
print(f"B: {t_c - t_b} seconds")
print(f"C: {t_d - t_c} seconds")

Above is only one example of how you could manually profile your Python code, there are many similar techniques.

Whilst this can be appropriate for profiling narrow sections of code, it becomes increasingly impractical as a project grows in size and complexity. Furthermore, it’s also unproductive to be routinely adding and removing these small changes if they interfere with the required outputs of a project.

Benchmarking

You may have previously used timeit for timing Python code.

This package returns the total runtime of an isolated block of code, without providing a more granular timing breakdown. Therefore, it is better described as a tool for benchmarking.

Function-Level Profiling

Software is typically comprised of a hierarchy of function calls, both functions written by the developer and those used from the language’s standard library and third party packages.

Function-level profiling analyses where time is being spent with respect to functions. Typically function-level profiling will calculate the number of times each function is called and the total time spent executing each function, inclusive and exclusive of child function calls.

This allows functions that occupy a disproportionate amount of the total runtime to be quickly identified and investigated.

In this course we will cover the usage of the function-level profiler cProfile and how it’s output can be visualised with snakeviz.

Line-Level Profiling

Function-level profiling may not always be granular enough, perhaps your software is a single long script, or function-level profiling highlighted a particularly complex function.

Line-level profiling provides greater granularity, analysing where time is being spent with respect to individual lines of code.

This will identify individual lines of code that occupy an disproportionate amount of the total runtime.

In this course we will cover the usage of the line-level profiler line_profiler.

Deterministic vs Sampling Profilers

Line-level profiling can be particularly expensive, a program can execute hundreds of thousands of lines of code per second. Therefore, collecting information about each line of code can be costly.

line_profiler is deterministic, meaning that it tracks every line of code executed. To avoid it being too costly, the profiling is restricted to methods targeted with the decorator @profile.

In contrast scalene is a more advanced Python profiler capable of line-level profiling. It uses a sampling based approach, whereby the profiler halts and samples the line of code currently executing thousands of times per second. This reduces the cost of profiling, whilst still maintaining representative metrics for the most expensive components.

Timeline Profiling

Timeline profiling takes a different approach to visualising where time is being spent during execution.

Typically a subset of function-level profiling, the execution of the profiled software is instead presented as a timeline highlighting the order of function execution in addition to the time spent in each individual function call.

By highlighting individual functions calls, patterns relating to how performance scales over time can be identified. These would be hidden with the aforementioned aggregate approaches.

viztracer is an example of a timeline profiler for Python, however we won’t be demonstrating timeline profiling on this course.

A viztracer timeline of the execution of the Pred-Prey exercise from later in the course. There is a shallow repeating pattern on the left side which corresponds to model steps, the right side instead has a range of 'icicles' which correspond to the deep call hierarchies of matplotlib generating a graph.
An example timeline visualisation provided by viztracer/vizviewer.

Hardware Metric Profiling

Processor manufacturers typically release advanced profilers specific to their hardware with access to internal hardware metrics. These profilers can provide analysis of performance relative to theoretical hardware maximums (e.g. memory bandwidth or operations per second) and detail the utilisation of specific hardware features and operations.

Using these advanced profilers requires a thorough understanding of the relevant processor architecture and may lead to hardware specific optimisations.

Examples of these profilers include; Intel’s VTune, AMD’s uProf, and NVIDIA’s Nsight Compute.

Profiling of this nature is outside the scope of this course.

Selecting an Appropriate Test Case


The act of profiling your code, collecting additional timing metrics during execution, will cause your program to execute slower. The slowdown is dependent on many variables related to both your code and the granularity of metrics being collected.

Similarly, the longer your code runs, the more code that is being executed, the more data that will be collected. A profile that runs for hours could produce gigabytes of output data!

Therefore, it is important to select an appropriate test-case that is both representative of a typical workload and small enough that it can be quickly iterated. Ideally, it should take no more than a few minutes to run the profiled test-case from start to finish, however you may have circumstances where something that short is not possible.

For example, you may have a model which normally simulates a year in hourly timesteps. It would be appropriate to begin by profiling the simulation of a single day. If the model scales over time, such as due to population growth, it may be pertinent to profile a single day later into a simulation if the model can be resumed or configured. A larger population is likely to amplify any bottlenecks that scale with the population, making them easier to identify.

Exercise (5 minutes)

Think about a project where you’ve been working with Python. Do you know where the time during execution is being spent?

Write a short plan of the approach you would take to investigate and confirm where the majority of time is being spent during it’s execution.

  • What tools and techniques would be required?
  • Is there a clear priority to these approaches?
  • Which test-case/s would be appropriate?

Key Points

  • Profiling is a relatively quick process to analyse where time is being spent and bottlenecks during a program’s execution.
  • Code should be profiled when ready for deployment if it will be running for more than a few minutes during it’s lifetime.
  • There are several types of profiler each with slightly different purposes.
    • function-level: cProfile (visualised with snakeviz)
    • line-level: line_profiler
    • timeline: viztracer
    • hardware-metric
  • A representative test-case should be profiled, that is large enough to amplify any bottlenecks whilst executing to completion quickly.

Content from Function Level Profiling


Last updated on 2024-03-28 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • When is function level profiling appropriate?
  • How can cProfile and snakeviz be used to profile a Python program?
  • How are the outputs from function level profiling interpreted?

Objectives

  • execute a Python program via cProfile to collect profiling information about a Python program’s execution
  • use snakeviz to visualise profiling information output by cProfile
  • interpret snakeviz views, to identify the functions where time is being spent during a program’s execution

Introduction


Software is typically comprised of a hierarchy of function calls, both functions written by the developer and those used from the language’s standard library and third party packages.

Function-level profiling analyses where time is being spent with respect to functions. Typically function-level profiling will calculate the number of times each function is called and the total time spent executing each function, inclusive and exclusive of child function calls.

This allows functions that occupy a disproportionate amount of the total runtime to be quickly identified and investigated.

In this episode we will cover the usage of the function-level profiler cProfile, how it’s output can be visualised with snakeviz and how the output can be interpreted.

What is a Call Stack?

The call stack keeps track of the active hierarchy of function calls and their associated variables.

As a stack it is a last-in first-out (LIFO) data structure.

A greyscale diagram showing a (call)stack, containing 5 stack frame. Two additional stack frames are shown outside the stack, one is marked as entering the call stack with an arrow labelled push and the other is marked as exiting the call stack labelled pop.
A diagram of a call stack

When a function is called, a frame to track it’s variables and metadata is pushed to the call stack. When that same function finishes and returns, it is popped from the stack and variables local to the function are dropped.

If you’ve ever seen a stack overflow error, this refers to the call stack becoming too large. These are typically caused by recursive algorithms, whereby a function calls itself, that don’t exit early enough.

Within Python the current call-stack can be printed using the core traceback package, traceback.print_stack() will print the current call stack.

The below example:

PYTHON

import traceback

def a():
    b1()
    b2()
def b1():
    pass
def b2():
    c()
def c():
    traceback.print_stack()

a()

Here we can see that the printing of the stack trace is called in c(), which is called by b2(), which is called by a(), which is called from global scope.

Hence, this prints the following call stack:

OUTPUT

  File "C:\call_stack.py", line 13, in <module>
    a()
  File "C:\call_stack.py", line 5, in a
    b2()
  File "C:\call_stack.py", line 9, in b2
    c()
  File "C:\call_stack.py", line 11, in c
    traceback.print_stack()

The first line states the file and line number where a() was called from (the last line of code in the file shown). The second line states that it was the function a() that was called, this could include it’s arguments. The third line then repeats this pattern, stating the line number where b2() was called inside a(). This continues until the call to traceback.print_stack() is reached.

You may see stack traces like this when an unhandled exception is thrown by your code.

In this instance the base of the stack has been printed first, other visualisations of call stacks may use the reverse ordering.

cProfile


cProfile is a function-level profiler provided as part of the Python standard library.

It can be called directly within your Python code as an imported package, however it’s easier to use it’s script interface:

SH

python -m cProfile -o <output file> <script name> <arguments>

For example if you normally run your program as:

SH

python my_script.py input.csv

You would call cProfile to produce profiling output out.prof with:

SH

python -m cProfile -o out.prof my_script.py input.csv

No additional changes to your code are required, it’s really that simple!

If you instead, don’t specify output to file (e.g. remove -o out.prof from the command), cProfile will produce output to console similar to that shown below:

OUTPUT

         28 function calls in 4.754 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.754    4.754 worked_example.py:1(<module>)
        1    0.000    0.000    1.001    1.001 worked_example.py:13(b_2)
        3    0.000    0.000    1.513    0.504 worked_example.py:16(c_1)
        3    0.000    0.000    1.238    0.413 worked_example.py:19(c_2)
        3    0.000    0.000    0.334    0.111 worked_example.py:23(d_1)
        1    0.000    0.000    4.754    4.754 worked_example.py:3(a_1)
        3    0.000    0.000    2.751    0.917 worked_example.py:9(b_1)
        1    0.000    0.000    4.754    4.754 {built-in method builtins.exec}
       11    4.753    0.432    4.753    0.432 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

The columns have the following definitions:

Column Definition
ncalls The number of times the given function was called.
tottime The total time spent in the given function, excluding child function calls.
percall The average tottime per function call (tottime/ncalls).
cumtime The total time spent in the given function, including child function calls.
percall The average cumtime per function call (cumtime/ncalls).
filename:lineno(function) The location of the given function’s definition and it’s name.

This output can often exceed the terminal’s buffer length for large programs and can be unwieldy to parse, so the package snakeviz is often utilised to provide an interactive visualisation of the data when exported to file.

snakeviz


It can help to run these examples by running snakeviz live. For the worked example you may wish to also show the code (e.g. in split screen).

Demonstrate features such as moving up/down the call-stack by clicking the boxes and changing the depth and cutoff via the dropdown.

Download pre-generated profile reports:

snakeviz is a web browser based graphical viewer for cProfile output files. It is not part of the Python standard library, and therefore must be installed via pip.

SH

pip install snakeviz

Once installed, you can visualise a cProfile output file such as out.prof via:

SH

python -m snakeviz out.prof

This should open your web browser displaying a page similar to that below.

A web page, with a central diagram representing a call-stack, with the root at the top and the horizontal axis representing the duration of each call. Below this diagram is the top of a table detailing the statistics of individual methods.
An example of the default ‘icicle’ visualisation provided by snakeviz.

The icicle diagram displayed by snakeviz represents an aggregate of the call stack during the execution of the profiled code. The box which fills the top row represents the the root call, filling the row shows that it occupied 100% of the runtime. The second row holds the child methods called from the root, with their widths relative to the proportion of runtime they occupied. This continues with each subsequent row, however where a method only occupies 50% of the runtime, it’s children can only occupy a maximum of that runtime hence the appearance of “icicles” as each row gets narrower when the overhead of methods with no further children is accounted for.

By clicking a box within the diagram, it will “zoom” making the selected box the root allowing more detail to be explored. The diagram is limited to 10 rows by default (“Depth”) and methods with a relatively small proportion of the runtime are hidden (“Cutoff”).

As you hover each box, information to the left of the diagram updates specifying the location of the method and for how long it ran.

snakeviz Inside Notebooks

If you’re more familiar with writing Python inside Jupyter notebooks you can still use snakeviz directly from inside notebooks using the notebooks “magic” prefix (%) and it will automatically call cProfile for you.

First snakeviz must be installed and it’s extension loaded.

PY

!pip install snakeviz
%load_ext snakeviz

Following this, you can either call %snakeviz to profile a function defined earlier in the notebook.

PY

%snakeviz my_function()

Or, you can create a %%snakeviz cell, to profile the python executed within it.

PY

%%snakeviz

def my_function():
    print("Hello World!")

my_function()

In both cases, the full snakeviz profile visualisation will appear as an output within the notebook!

You may wish to right click the top of the output, and select “Disable Scrolling for Outputs” to expand it’s box if it begins too small.

Worked Example


Demonstrate this!

To more clearly demonstrate how an execution hierarchy maps to the icicle diagram, the below toy example Python script has been implemented.

PYTHON

import time

def a_1():
    for i in range(3):
        b_1()
    time.sleep(1)
    b_2()
    
def b_1():
    c_1()
    c_2()

def b_2():
    time.sleep(1)
    
def c_1():
    time.sleep(0.5)

def c_2():
    time.sleep(0.3)
    d_1()

def d_1():
    time.sleep(0.1)

# Entry Point
a_1()

All of the methods except for b_1() call time.sleep(), this is used to provide synthetic bottlenecks to create an interesting profile.

  • a_1() calls b_1() x3 and b_2() x1
  • b_1() calls c_1() x1 and c_2() x1
  • c_2() calls d_1()

Follow Along

Download the Python source for the example or cProfile output file and follow along with the worked example on your own machine.

SH

python -m cProfile -o out.prof example.py
python -m snakeviz out.prof
The snakeviz icicle visualisation for the worked example Python code.
An icicle visualisation provided by snakeviz for the above Python code.

The third row represents a_1(), the only method called from global scope, therefore the first two rows represent Python’s internal code for launching our script and can be ignored (by clicking on the third row).

The row following a_1() is split into three boxes representing b_1(), time.sleep() and b_2(). Note that b_1() is called three times, but only has one box within the icicle diagram. The boxes are ordered left-to-right according to cumulative time, which happens to be the order they were first called.

If the box for time.sleep() is hovered it will change colour along with several other boxes that represent the other locations that time.sleep() was called from. Note that each of these boxes display the same duration, the timing statistics collected by cProfile (and visualised by snakeviz) are aggregate, so there is no information about individual method calls for methods which were called multiple times. This does however mean that if you check the properties to the left of the diagram whilst hovering time.sleep() you will see a cumulative time of 99% reported, the overhead of the method calls and for loop is insignificant in contrast to the time spent sleeping!

Below are the properties shown, the time may differ if you generated the profile yourself.

  • Name: <built-in method time.sleep>
  • Cumulative Time: 4.71 s (99.99 %)
  • File: ~
  • Line: 0
  • Directory:

As time.sleep() is a core Python method it is displayed as “built-in method” and doesn’t have a file, line or directory.

If you hover any boxes representing the methods from the above code, you will see file and line properties completed. The directory property remains empty as the profiled code was in the root of the working directory. A profile of a large project with many files across multiple directories will see this filled.

Find the box representing c_2() on the icicle diagram, it’s children are unlabelled because they are not wide enough (but they can still be hovered). Clicking c_2() zooms in the diagram, showing the children to be time.sleep() and d_1().

To zoom back out you can either click the top row, which will zoom out one layer, or click “Reset Zoom” on the left-hand side.

In this simple example the execution is fairly evenly balanced between all of the user-defined methods, so there is not a clear hot-spot to investigate.

Below the icicle diagram, there is a table similar to the default output from cProfile. However, in this case you can sort the columns by clicking their headers and filter the rows shown by entering a filename in the search box. This allows built-in methods to be hidden, which can make it easier to highlight optimisation priorities.

Notebooks

If you followed along inside a notebook it might look like this:

A Jupyter notebook showing the worked example profiled with snakeviz.
The worked example inside a notebook.

Because notebooks operate by creating temporary Python files, the filename (shown 1378276351.py above) and line numbers displayed are not too useful. However, the function names match those defined in the code and follow the temporary file name in parentheses, e.g. 1378276351.py:3(a_1), 1378276351.py:9(b_1) refer to the functions a_1() and b_1() respectively.

Sunburst

snakeviz provides an alternate “Sunburst” visualisation, accessed via the “Style” drop-down on the left-hand side.

This provides the same information as “Icicle”, however the rows are instead circular with the root method call found at the center.

The sunburst visualisation displays less text on the boxes, so it can be harder to interpret. However, it increases the visibility of boxes further from the root call.

A sunburst visualisation for the worked example Python code.
An sunburst visualisation provided by snakeviz for the worked example’s Python code.

Exercises


The following exercises allow you to review your understanding of what has been covered in this episode.

Arguments 1-9 passed to travellingsales.py should execute relatively fast (less than a minute)

This will be slower via the profiler, and is likely to vary on different hardware.

Larger values should be avoided.

Download the set of profiles for arguments 1-10, these can be opened by passing the directory to snakeviz.

SH

python -m snakeviz .

Exercise 1: Travelling Salesperson

Download and profile this Python program, try to locate the function call(s) where the majority of execution time is being spent.

The travelling salesperson problem aims to optimise the route for a scenario where a salesperson is requires to travel between N locations. They wish to travel to each location exactly once, in any order, whilst minimising the total distance travelled.

The provided implementation uses a naive brute-force approach.

The program can be executed via python travellingsales.py <cities>. The value of cities should be a positive integer, this algorithm has poor scaling so larger numbers take significantly longer to run.

  • If a hotspot isn’t visible with the argument 1, try increasing the value.
  • If you think you identified the hotspot with your first profile, try investigating how the value of cities affects the hotspot within the profile.

The hotspot only becomes visible when an argument of 5 or greater is passed.

You should see that distance() (from travellingsales.py:11) becomes the largest box (similarly it’s parent in the call-stack total_distance()) showing that it scales poorly with the number of cities. With 5 cities, distance() has a cumulative time of ~35% the runtime, this increases to ~60% with 9 cities.

Other boxes within the diagram correspond to the initialisation of imports, or initialisation of cities. These have constant or linear scaling, so their cost barely increases with the number of cities.

This highlights the need to profile a realistic test-case expensive enough that initialisation costs are not the most expensive component.

The default configuration of the Predator Prey model takes around 10 seconds to run, it may be slower on other hardware.

Download the pre-generated cProfile output, this can be opened with snakeviz to save waiting for the profiler.

SH

python -m snakeviz predprey_out.prof

Exercise 2: Predator Prey

Download and profile the Python predator prey model, try to locate the function call(s) where the majority of execution time is being spent

This exercise uses the packages numpy and matplotlib, they can be installed via pip install numpy matplotlib.

The predator prey model is a simple agent-based model of population dynamics. Predators and prey co-exist in a common environment and compete over finite resources.

The three agents; predators, prey and grass exist in a two dimensional grid. Predators eat prey, prey eat grass. The size of each population changes over time. Depending on the parameters of the model, the populations may oscillate, grow or collapse due to the availability of their food source.

The program can be executed via python predprey.py.

It takes no arguments, but contains various environment properties which can be modified to change the model’s behaviour. When the model finishes it outputs a graph of the three populations predprey_out.png.

It should be clear from the profile that the method Grass::eaten() (from predprey.py:278) occupies the majority of the runtime.

From the table below the Icicle diagram, we can see that it was called 1,250,000 times.

The top 9 rows of the table shown by snakeviz when profiling predprey.py. The top row shows that predprey.py:278(eaten) was called 1,250,000 times, taking a total time of 8 seconds. The table is ordered in descending total time, with the next row taking a mere 0.74 seconds.
The top of the table shown by snakeviz.

If the table is ordered by ncalls, it can be identified as the joint 4th most called method and 2nd most called method from predprey.py.

If you checked predprey_out.png (shown below), you should notice that there are significantly more Grass agents than Predators or Prey.

A line graph plotting population over time through 250 steps of the pred prey model. Grass/20, shown in green, has a brief dip in the first 30 steps, but recovers holding steady at approximately 240 (4800 agents). Prey, shown in blue, starts at 200, quickly drops to around 185, before levelling off for steps and then slowly declining to a final value of 50. The data for predators, shown in red, has significantly more noise. There are 50 predators to begin, this rises briefly before falling to around 10, from here it noisily grows to around 70 by step 250 with several larger declines during the growth.
predprey_out.png as produced by the default configuration of predprey.py.

Similarly, the Grass::eaten() has a percall time is inline with other agent functions such as Prey::flock() (from predprey.py:67).

Maybe we could investigate this further with line profiling!

You may have noticed many iciles on the right hand of the diagram, these primarily correspond to the import of matplotlib which is relatively expensive!

Key Points

  • A python program can be function level profiled with cProfile via python -m cProfile -o <output file> <script name> <arguments>.
  • The output file from cProfile can be visualised with snakeviz via python -m snakeviz <output file>.
  • Function level profiling output displays the nested call hierarchy, listing both the cumulative and total minus sub functions time.

Content from Break


Last updated on 2024-03-28 | Edit this page

Estimated time: 0 minutes

Take a break. If you can, move around and look at something away from your screen to give your eyes a rest and a chance to absorb the content covered so far.

Content from Line Level Profiling


Last updated on 2024-06-20 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • When is line level profiling appropriate?
  • What adjustments are required to Python code to profile with line_profiler?
  • How can kernprof be used to profile a Python program?

Objectives

  • decorate Python code to prepare it for profiling with line_profiler
  • execute a Python program via kernprof to collect profiling information about a Python program’s execution
  • interpret output from line_profiler, to identify the lines where time is being spent during a program’s execution

Introduction


Whilst profiling, you may find that function-level profiling highlights expensive methods where you can’t easily determine the cause of the cost due to their complexity.

Line level profiling allows you to target specific methods to collect more granular metrics, which can help narrow the source of expensive computation further. Typically line-level profiling will calculate the number of times each line is called and the total time spent executing each line. However, with the increased granularity come increased collection costs, which is why it’s targeted to specific methods.

This allows lines that occupy a disproportionate amount of the total runtime to be quickly identified and investigated.

In this episode we will cover the usage of the line-level profiler line_profiler, how your code should be modified to target the profiling and how the output can be interpreted.

line_profiler


line_profiler is a line-level profiler which provides both text output and visualisation.

It is not part of the Python standard library, and therefore must be installed via pip.

SH

pip install line_profiler[all]

If you are unable to install line_profiler via pip on MacOS. Instead it can be installed via conda.

SH

conda install line_profiler

It may first be necessary to enable conda-forge.

SH

conda config --add channels conda-forge

To use line_profiler decorate methods to be profiled with @profile which is imported from line_profiler.

For example, the below code:

PYTHON

def is_prime(number):
    if number < 2:
        return False
    for i in range(2, int(number**0.5) + 1):
        if number % i == 0:
            return False
    return True
    
print(is_prime(1087))

Would be updated to:

PYTHON

from line_profiler import profile

@profile
def is_prime(number):
    if number < 2:
        return False
    for i in range(2, int(number**0.5) + 1):
        if number % i == 0:
            return False
    return True
    
print(is_prime(1087))

This tells line_profiler to collect metrics for the lines within the method is_prime(). You can still execute your code as normal, and these changes will have no effect.

Similar to the earlier tools, line_profiler can then be triggered via kernprof.

SH

python -m kernprof -lvr my_script.py

This will output a table per profiled method to console:

OUTPUT

Wrote profile results to my_script.py.lprof
Timer unit: 1e-06 s

Total time: 1.65e-05 s
File: my_script.py
Function: is_prime at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def is_prime(number):
     5         1          0.4      0.4      2.4      if number < 2:
     6                                                   return False
     7        32          8.4      0.3     50.9      for i in range(2, int(number**0.5) + 1):
     8        31          7.4      0.2     44.8          if number % i == 0:
     9                                                       return False
    10         1          0.3      0.3      1.8      return True

The columns have the following definitions:

Column Definition
Line # The line number of the relevant line within the file (specified above the table).
Hits The total number of times the line was executed.
Time The total time spent executing that line, including child function calls.
Per Hit The average time per call, including child function calls (Time/Hits).
% Time The time spent executing the line, including child function calls, relative to the other lines of the function.
Line Contents A copy of the line from the file.

As line_profiler must be attached to specific methods and cannot attach to a full Python file or project, if your Python file has significant code in the global scope it will be necessary to move it into a new method which can then instead be called from global scope.

The profile is also output to file, in this case my_script.py.lprof. This file is not human-readable, but can be printed to console by passing it to line_profiler, which will then display the same table as above.

SH

python -m line_profiler -rm my_script.py.lprof

Worked Example


Follow Along

Download the Python source for the example and follow along with the worked example on your own machine.

To more clearly demonstrate how to use line_profiler, the below implementation of “FizzBuzz” will be line profiled.

PYTHON

n = 100
for i in range(1, n + 1):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)

As there are no methods, firstly it should be updated to move the code to be profiled into a method:

PYTHON

def fizzbuzz(n):
    for i in range(1, n + 1):
        if i % 3 == 0 and i % 5 == 0:
            print("FizzBuzz")
        elif i % 3 == 0:
            print("Fizz")
        elif i % 5 == 0:
            print("Buzz")
        else:
            print(i)

fizzbuzz(100)

Next the method can be decorated with @profile which must be imported via line_profiler:

PYTHON

from line_profiler import profile

@profile
def fizzbuzz(n):
    for i in range(1, n + 1):
        if i % 3 == 0 and i % 5 == 0:
            print("FizzBuzz")
        elif i % 3 == 0:
            print("Fizz")
        elif i % 5 == 0:
            print("Buzz")
        else:
            print(i)

fizzbuzz(100)

Now that the code has been decorated, it can be profiled!

SH

python -m kernprof -lvr fizzbuzz.py

This will output a table per profiled method to console:

If you run this locally it should be highlighted due to -r passed to kernprof.

OUTPUT

Wrote profile results to fizzbuzz.py.lprof
Timer unit: 1e-06 s

Total time: 0.0021535 s
File: fizzbuzz.py
Function: fizzbuzz at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def fizzbuzz(n):
     5       101         32.5      0.3      1.5      for i in range(1, n + 1):
     6       100         26.9      0.3      1.2          if i % 3 == 0 and i % 5 == 0:
     7         6        125.8     21.0      5.8              print("FizzBuzz")
     8        94         16.7      0.2      0.8          elif i % 3 == 0:
     9        27        541.3     20.0     25.1              print("Fizz")
    10        67         12.4      0.2      0.6          elif i % 5 == 0:
    11        14        285.1     20.4     13.2              print("Buzz")
    12                                                   else:
    13        53       1112.8     21.0     51.7              print(i)

For this basic example, we can calculate that “FizzBuzz” would be printed 6 times out of 100, and the profile shows that line 7 (print("FizzBuzz")) occupied 5.8% of the runtime. This is slightly lower than 6% due to the control flow code (printing to console is expensive relative to the control flow and conditional statements). Similarly, “Fizz” is printed 27 times and occupies 25.1%, likewise “Buzz” is printed 14 times and occupies 13.2%. Each print statement has a similar “Per Hit” time of 20-21 micro seconds.

Therefore it can be seen in this example, how the time spent executing each line matches expectations.

Rich Output

The -r argument passed to kernprof (or line_profiler) enables rich output, if you run the profile locally it should look similar to this. This requires the optional package rich, it will have been installed if [all] was specified when installing line_profiler with pip.

A screenshot of the `line_profiler` output from the previous code block, where the code within the line contents column has basic highlighting.
Rich (highlighted) console output provided by line_profiler for the above FizzBuzz profile code.

line_profiler Inside Notebooks

If you’re more familiar with writing Python inside Jupyter notebooks you can, as with snakeviz, use line_profiler directly from inside notebooks. However it is still necessary for the code you wish to profile to be placed within a function.

First line_profiler must be installed and it’s extension loaded.

PY

!pip install line_profiler
%load_ext line_profiler

Following this, you call line_profiler with %lprun.

PY

%lprun -f profiled_function_name entry_function_call()

The functions to be line profiled are specified with -f <function name>, this is repeated for each individual function that you would otherwise apply the @profile decorator to.

This is followed by calling the function which runs the full code to be profiled.

For the above fizzbuzz example it would be:

PY

%lprun -f fizzbuzz fizzbuzz(100)

This will then create an output cell with any output from the profiled code, followed by the standard output from line_profiler. It is not currently possible to get the rich/coloured output from line_profiler within notebooks.

A screenshot of the line_profiler output from the previous code block inside a Jupyter notebook.
Output provided by line_profiler inside a Juypter notebook for the above FizzBuzz profile code.

Exercises


The following exercises allow you to review your understanding of what has been covered in this episode.

Exercise 1: BubbleSort

Download and profile the Python bubblesort implementation, line-level profile the code to investigate where time is being spent.

Bubblesort is a basic sorting algorithm, it is not considered to be efficient so in practice other sorting algorithms are typically used.

The array to be sorted is iterated, with a pair-wise sort being applied to each element and it’s neighbour. This can cause elements to rise (or sink) multiple positions in a single pass, hence the name bubblesort. This iteration continues until the array is fully iterated with no elements being swapped.

The program can be executed via python bubblesort.py <elements>. The value of elements should be a positive integer as it represents the number of elements to be sorted.

  • Remember that the code needs to be moved into a method decorated with @profile
  • This must be imported via from line_profiler import profile
  • 100 elements should be suitable for a quick profile

If you chose to profile the whole code, it may look like this:

PYTHON

import sys
import random
from line_profiler import profile        # Import profile decorator

@profile                                 # Decorate the function to be profiled
def main():                              # Create a simple function with the code to be profiled
    # Argument parsing
    if len(sys.argv) != 2:
        print("Script expects 1 positive integer argument, %u found."%(len(sys.argv) - 1))
        sys.exit()
    n = int(sys.argv[1])
    # Init
    random.seed(12)
    arr = [random.random() for i in range(n)]
    print("Sorting %d elements"%(n))
    # Sort
    for i in range(n - 1):
        swapped = False
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        # If no two elements were swapped in the inner loop, the array is sorted
        if not swapped:
            break
    # Validate
    is_sorted = True
    for i in range(n - 1):
        if arr[i] > arr[i+1]:
            is_sorted = False
    print("Sorting: %s"%("Passed" if is_sorted else "Failed"))
    
main()                                  # Call the created function

The sort can be profiled with 100 elements, this is quick and should be representative.

SH

python -m kernprof -lvr bubblesort.py 100

This produces output:

OUTPUT

Wrote profile results to bubblesort.py.lprof
Timer unit: 1e-06 s

Total time: 0.002973 s
File: bubblesort.py
Function: main at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5                                           @profile
     6                                           def main():
     7                                               # Argument parsing
     8         1          0.7      0.7      0.0      if len(sys.argv) != 2:
     9                                                   print("Script expects 1 positive integer argument, %u found."%…
    10                                                   sys.exit()
    11         1          1.6      1.6      0.1      n = int(sys.argv[1])
    12                                               # Init
    13         1          8.8      8.8      0.3      random.seed(12)
    14         1         16.6     16.6      0.6      arr = [random.random() for i in range(n)]
    15         1         38.2     38.2      1.3      print("Sorting %d elements"%(n))
    16                                               # Sort
    17        95         14.5      0.2      0.5      for i in range(n - 1):
    18        95         13.1      0.1      0.4          swapped = False
    19      5035        723.1      0.1     24.3          for j in range(0, n - i - 1):
    20      4940       1045.9      0.2     35.2              if arr[j] > arr[j + 1]:
    21      2452        686.9      0.3     23.1                  arr[j], arr[j + 1] = arr[j + 1], arr[j]
    22      2452        353.0      0.1     11.9                  swapped = True
    23                                                   # If no two elements were swapped in the inner loop, the array…
    24        95         15.2      0.2      0.5          if not swapped:
    25         1          0.2      0.2      0.0              break
    26                                               # Validate
    27         1          0.5      0.5      0.0      is_sorted = True
    28       100         12.9      0.1      0.4      for i in range(n - 1):
    29        99         20.3      0.2      0.7          if arr[i] > arr[i+1]:
    30                                                       is_sorted = False
    31         1         21.5     21.5      0.7      print("Sorting: %s"%("Passed" if is_sorted else "Failed"))

From this we can identify that the print statements were the most expensive individual calls (“Per Hit”), however both were only called once. Most execution time was spent at the inner loop (lines 19-22).

As this is a reference implementation of a classic sorting algorithm we are unlikely to be able to improve it further.

Download the pre-generated line_profiler output, this can be opened be to save waiting for the profiler.

SH

python -m line_profiler -rm predprey.py.lprof

Exercise 2: Predator Prey

During the function-level profiling episode, the Python predator prey model was function-level profiled. This highlighted that Grass::eaten() (from predprey.py:278) occupies the majority of the runtime.

Line-profile this method, using the output from the profile consider how it might be optimised.

  • Remember that the function needs to be decorated with @profile
  • This must be imported via from line_profiler import profile
  • Line-level profiling Grass::eaten(), the most called function will slow it down significantly. You may wish to reduce the number of steps predprey.py:305.

First the function must be decorated

PYTHON

# line ~1
from line_profiler import profile

PYTHON

# line ~278
    @profile
    def eaten(self, prey_list):

line_profiler can then be executed via python -m kernprof -lvr predprey.py.

This will take much longer to run due to line_profiler, you may wish to reduce the number of steps. In this instance it may change the profiling output slightly, as the number of Prey and their member variables evaluated by this method both change as the model progresses, but the overall pattern is likely to remain similar.

PYTHON

# line ~420
model = Model(50) # 50 steps (originally defaulted to 250)

Alternatively, you can kill the profiling process (e.g. ctrl + c) after a minute and the currently collected partial profiling information will be output.

This will produce output similar to that below.

OUTPUT

Wrote profile results to predprey.py.lprof
Timer unit: 1e-06 s

Total time: 101.573 s
File: predprey.py
Function: eaten at line 278

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   278                                               @profile
   279                                               def eaten(self, prey_list):
   280   1250000     227663.1      0.2      0.2          if self.available:
   281   1201630     165896.4      0.1      0.2              prey_index = -1
   282   1201630     166219.0      0.1      0.2              closest_prey = GRASS_EAT_DISTANCE
   283
   284                                                       # Iterate prey_location messages to find the closest prey
   285 198235791   29227902.1      0.1     28.8              for i in range(len(prey_list)):
   286 197034161   30158318.8      0.2     29.7                  prey = prey_list[i]
   287 197034161   38781451.1      0.2     38.2                  if prey.life < PREY_HUNGER_THRESH:
   288                                                               # Check if they are within interaction radius
   289   2969470     579923.4      0.2      0.6                      dx = self.x - prey.x
   290   2969470     552092.2      0.2      0.5                      dy = self.y - prey.y
   291   2969470     938669.8      0.3      0.9                      distance = math.sqrt(dx*dx + dy*dy)
   292
   293   2969470     552853.8      0.2      0.5                      if distance < closest_prey:
   294      2532        469.3      0.2      0.0                          prey_index = i
   295      2532        430.1      0.2      0.0                          closest_prey = distance
   296
   297   1201630     217534.5      0.2      0.2              if prey_index >= 0:
   298                                                           # Add grass eaten message
   299      2497       2181.8      0.9      0.0                  prey_list[prey_index].life += GAIN_FROM_FOOD_PREY
   300
   301                                                           # Update grass agent variables
   302      2497        793.9      0.3      0.0                  self.dead_cycles = 0
   303      2497        631.0      0.3      0.0                  self.available = 0

From the profiling output it can be seen that lines 285-287 occupy over 90% of the method’s runtime!

PYTHON

            for i in range(len(prey_list)):
                prey = prey_list[i]
                if prey.life < PREY_HUNGER_THRESH:

Given that the following line 289 only has a relative 0.6% time, it can be understood that the vast majority of times the condition prey.life < PREY_HUNGER_THRESH is evaluated it does not proceed.

Remembering that this method is executed once per each of the 5000 Grass agents each step of the model, it could make sense to pre-filter prey_list once each timestep before it is passed to Grass::eaten(). This would greatly reduce the number of Prey iterated, reducing the cost of the method.

Key Points

  • Specific methods can be line-level profiled if decorated with @profile that is imported from line_profiler.
  • kernprof executes line_profiler via python -m kernprof -lvr <script name> <arguments>.
  • Code in global scope must wrapped in a method if it is to be profiled with line_profiler.
  • The output from line_profiler lists the absolute and relative time spent per line for each targeted function.

Content from Profiling Conclusion


Last updated on 2024-03-28 | Edit this page

Estimated time: 5 minutes

Overview

Questions

  • What has been learnt about profiling?

Objectives

  • Review what has been learnt about profiling

This concludes the profiling portion of the course.

cProfile, snakeviz and line_profiler have been introduced, these are some of the most accessible Python profiling tools.

With these transferable skills, if necessary, you should be able to follow documentation to use more advanced Python profiling tools such as scalene.

Key Points

What profiling is:

  • The collection and analysis of metrics relating to the performance of a program during execution .

Why programmers can benefit from profiling:

  • Narrows down the costly areas of code, allowing optimisation to be prioritised or decided to be unnecessary.

When to Profile:

  • Profiling should be performed on functional code, either when concerned about performance or prior to release/deployment.

What to Profile:

  • The collection of profiling metrics will often slow the execution of code, therefore the test-case should be narrow whilst remaining representative of a realistic run.

How to function-level profile:

  • Execute cProfile via python -m cProfile -o <output file> <script name> <arguments>
  • Execute snakeviz via python -m snakeviz <output file>

How to line-level profile:

  • Import profile from line_profiling
  • Decorate targeted methods with @profile
  • Execute line_profiler via python -m kernprof -lvr <script name> <arguments>

Content from Introduction to Optimisation


Last updated on 2024-03-28 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • Why could optimisation of code be harmful?

Objectives

  • Able to explain the cost benefit analysis of performing code optimisation

Introduction


Now that you’re able to find the most expensive components of your code with profiling, it becomes time to learn how to identify whether that expense is reasonable.

In order to optimise code for performance, it is necessary to have an understanding of what a computer is doing to execute it.

Even a high-level understanding of how you code executes, such as how Python and the most common data-structures and algorithms are implemented, can help you to identify suboptimal approaches when programming. If you have learned to write code informally out of necessity, to get something to work, it’s not uncommon to have collected some bad habits along the way.

The remaining content is often abstract knowledge, that is transferable to the vast majority of programming languages. This is because the hardware architecture, data-structures and algorithms used are common to many languages and they hold some of the greatest influence over performance bottlenecks.

Premature Optimisation


Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth

This classic quote among computer scientists states; when considering optimisation it is important to focus on the potential impact, both to the performance and maintainability of the code.

Profiling is a valuable tool in this cause. Should effort be expended to optimise a component which occupies 1% of the runtime? Or would that time be better spent focusing on the most expensive components?

Advanced optimisations, mostly outside the scope of this course, can increase the cost of maintenance by obfuscating what code is doing. Even if you are a solo-developer working on private code, your future self should be able to easily comprehend your implementation.

Therefore, the balance between the impact to both performance and maintainability should be considered when optimising code.

This is not to say, don’t consider performance when first writing code. The selection of appropriate algorithms and data-structures covered in this course form good practice, simply don’t fret over a need to micro-optimise every small component of the code that you write.

Ensuring Reproducible Results


When optimising your code, you are making speculative changes. It’s easy to make mistakes, many of which can be subtle. Therefore, it’s important to have a strategy in place to check that the outputs remain correct.

Testing is hopefully already a seamless part of your research software development process. Test can be used to clarify how your software should perform, ensuring that new features work as intended and protecting against unintended changes to old functionality.

There are a plethora of methods for testing code.

pytest Overview


Most Python developers use the testing package pytest, it’s a great place to get started if you’re new to testing code.

Here’s a quick example of how a test can be used to check your function’s output against an expected value.

Tests should be created within a project’s testing directory, by creating files named with the form test_*.py or *_test.py.

pytest looks for these files, when running the test suite.

Within the created test file, any functions named in the form test* are considered tests that will be executed by pytest.

The assert keyword is used, to test whether a condition evaluates to True.

PYTHON

# file: test_demonstration.py

# A simple function to be tested, this could instead be an imported package
def squared(x):
    return x**2

# A simple test case
def test_example():
    assert squared(5) == 24

When py.test is called inside a working directory, it will then recursively find and execute all the available tests.

SH

>py.test
================================================= test session starts =================================================
platform win32 -- Python 3.10.12, pytest-7.3.1, pluggy-1.3.0
rootdir: C:\demo
plugins: anyio-4.0.0, cov-4.1.0, xdoctest-1.1.2
collected 1 item

test_demonstration.py F                                                                                          [100%]

====================================================== FAILURES =======================================================
____________________________________________________ test_example _____________________________________________________

    def test_example():
>       assert squared(5) == 24
E       assert 25 == 24
E        +  where 25 = squared(5)

test_demonstration.py:9: AssertionError
=============================================== short test summary info ===============================================
FAILED test_demonstration.py::test_example - assert 25 == 24
================================================== 1 failed in 0.07s ==================================================

Whilst not designed for benchmarking, it does provide the total time the test suite took to execute. In some cases this may help identify whether the optimisations have had a significant impact on performance.

This is only the simplest introduction to using pytest, it has advanced features common to other testing frameworks such as fixtures, mocking and test skipping. pytest’s documentation covers all this and more. You may already have a different testing workflow in-place for validating the correctness of the outputs from your code.

  • Fixtures: A test fixture is a common class which multiple tests can inherit from. This class will typically include methods that perform common initialisation and teardown actions around the behaviour to be tested. This reduces repeated code.
  • Mocking: If you wish to test a feature which would relies on a live or temperamental service, such as making API calls to a website. You can mock that API, so that when the test runs synthetic responses are produced rather than the real API being used.
  • Test skipping: You may have configurations of your software that cause certain tests to be unsupported. Skipping allows conditions to be added to tests, to decide whether they should be executed or skipped.

Key Points

  • The knowledge necessary to perform high-level optimisations of code is largely transferable between programming languages.
  • When considering optimisation it is important to focus on the potential impact, both to the performance and maintainability of the code.
  • Many high-level optimisations should be considered good-practice.

Content from Data Structures & Algorithms


Last updated on 2024-03-28 | Edit this page

Estimated time: 35 minutes

Overview

Questions

  • What’s the most efficient way to construct a list?
  • When should tuples be used?
  • When are sets appropriate?
  • What is the best way to search a list?

Objectives

  • Able to summarise how lists and tuples work behind the scenes.
  • Able to identify appropriate use-cases for tuples.
  • Able to utilise dictionaries and sets effectively
  • Able to use bisect_left() to perform a binary search of a list or array

The important information for students to learn within this episode are the patterns demonstrated via the benchmarks.

This episode introduces many complex topics, these are used to ground the performant patterns in understanding to aid memorisation.

It should not be a concern to students if they find the data-structure/algorithm internals challenging, if they are still able to recognise the demonstrated patterns.

This episode is challenging!

Within this episode you will be introduced to how certain data-structures and algorithms work.

This is used to explain why one approach is likely to execute faster than another.

It matters that you are able to recognise the faster/slower approaches, not that you can describe or reimplement these data-structures and algorithms yourself.

Lists


Lists are a fundamental data structure within Python.

It is implemented as a form of dynamic array found within many programming languages by different names (C++: std::vector, Java: ArrayList, R: vector, Julia: Vector).

They allow direct and sequential element access, with the convenience to append items.

This is achieved by internally storing items in a static array. This array however can be longer than the list, so the current length of the list is stored alongside the array. When an item is appended, the list checks whether it has enough spare space to add the item to the end. If it doesn’t, it will re-allocate a larger array, copy across the elements, and deallocate the old array. The item to be appended is then copied to the end and the counter which tracks the list’s length is increemnted.

The amount the internal array grows by is dependent on the particular list implementation’s growth factor. CPython for example uses newsize + (newsize >> 3) + 6, which works out to an over allocation of roughly ~12.5%.

A line graph displaying the relationship between the number of calls to append() and the number of internal resizes of a CPython list. It has a logarithmic relationship, at 1 million appends there have been 84 internal resizes.
The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.

This has two implications:

  • If you are creating large static lists, they will use upto 12.5% excess memory.
  • If you are growing a list with append(), there will be large amounts of redundant allocations and copies as the list grows.

List Comprehension

If creating a list via append() is undesirable, the natural alternative is to use list-comprehension.

List comprehension can be twice as fast at building lists than using append(). This is primarily because list-comprehension allows Python to offload much of the computation into faster C code. General python loops in contrast can be used for much more, so they remain in Python bytecode during computation which has additional overheads.

This can be demonstrated with the below benchmark:

PYTHON

from timeit import timeit

def list_append():
    li = []
    for i in range(100000):
        li.append(i)

def list_preallocate():
    li = [0]*100000
    for i in range(100000):
        li[i] = i

def list_comprehension():
    li = [i for i in range(100000)]

repeats = 1000
print(f"Append: {timeit(list_append, number=repeats):.2f}ms")
print(f"Preallocate: {timeit(list_preallocate, number=repeats):.2f}ms")
print(f"Comprehension: {timeit(list_comprehension, number=repeats):.2f}ms")

timeit is used to run each function 1000 times, providing the below averages:

OUTPUT

Append: 3.50ms
Preallocate: 2.48ms
Comprehension: 1.69ms

Results will vary between Python versions, hardware and list lengths. But in this example list comprehension was 2x faster, with pre-allocate fairing in the middle. Although this is milliseconds, this can soon add up if you are regularly creating lists.

Tuples


In contrast, Python’s tuples are immutable static arrays (similar to strings), their elements cannot be modified and they cannot be resized.

Their potential use-cases are greatly reduced due to these two limitations, they are only suitable for groups of immutable properties.

Tuples can still be joined with the + operator, similar to appending lists, however the result is always a newly allocated tuple (without a list’s over-allocation).

Python caches a large number of short (1-20 element) tuples. This greatly reduces the cost of creating and destroying them during execution at the cost of a slight memory overhead.

This can be easily demonstrated with Python’s timeit module in your console.

SH

>python -m timeit "li = [0,1,2,3,4,5]"
10000000 loops, best of 5: 26.4 nsec per loop

>python -m timeit "tu = (0,1,2,3,4,5)"
50000000 loops, best of 5: 7.99 nsec per loop

It takes 3x as long to allocate a short list than a tuple of equal length. This gap only grows with the length, as the tuple cost remains roughly static whereas the cost of allocating the list grows slightly.

Dictionaries


Dictionaries are another fundamental Python data-structure. They provide a key-value store, whereby unique keys with no intrinsic order map to attached values.

“no intrinsic order”

Since Python 3.6, the items within a dictionary will iterate in the order that they were inserted. This does not apply to sets.

OrderedDict still exists, and may be preferable if the order of items is important when performing whole-dictionary equality.

Hashing Data Structures

Python’s dictionaries are implemented as hashing data structures. Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key. The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary. If that index doesn’t already contain another key, the key (and any associated values) can be inserted. When the index isn’t free, a collision strategy is applied. CPython’s dictionary and set both use a form of open addressing whereby a hash is mutated and corresponding indices probed until a free one is located. When the hashing data structure exceeds a given load factor (e.g. 2/3 of indices have been assigned keys), the internal storage must grow. This process requires every item to be re-inserted which can be expensive, but reduces the average probes for a key to be found.

A diagram demonstrating how the keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. This is followed by the insertion of 59, 80 and 39 which require linear probing to be inserted due to collisions.
An visual explanation of linear probing, CPython uses an advanced form of this.

To retrieve or check for the existence of a key within a hashing data structure, the key is hashed again and a process equivalent to insertion is repeated. However, now the key at each index is checked for equality with the one provided. If any empty index is found before an equivalent key, then the key must not be present in the ata structure.

Keys

Keys will typically be a core Python type such as a number or string. However multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods __hash__() and __eq__() have been implemented.

You can implement __hash__() by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function.

PYTHON

class MyKey:

    def __init__(self, _a, _b, _c):
        self.a = _a
        self.b = _b
        self.c = _c

    def __eq__(self, other):
        return (isinstance(other, type(self))
                and (self.a, self.b, self.c) == (other.a, other.b, other.c))

    def __hash__(self):
        return hash((self.a, self.b, self.c))

dict = {}
dict[MyKey("one", 2, 3.0)] = 12

The only limitation is that where two objects are equal they must have the same hash, hence all member variables which contribute to __eq__() should also contribute to __hash__() and vice versa (it’s fine to have irrelevant or redundant internal members contribute to neither).

Sets


Sets are dictionaries without the values (both are declared using {}), a collection of unique keys equivalent to the mathematical set. Modern CPython now uses a set implementation distinct from that of it’s dictionary, however they still behave much the same in terms of performance characteristics.

Sets are used for eliminating duplicates and checking for membership, and will normally outperform lists especially when the list cannot be maintained sorted.

Exercise: Unique Collection

There are four implementations in the below example code, each builds a collection of unique elements from 25,000 where 50% can be expected to be duplicates.

Estimate how the performance of each approach is likely to stack up.

If you reduce the value of repeats it will run faster, how does changing the number of items (N) or the ratio of duplicates int(N/2) affect performance?

PYTHON

import random
from timeit import timeit

def generateInputs(N = 25000):
    random.seed(12)  # Ensure every list is the same 
    return [random.randint(0,int(N/2)) for i in range(N)]
    
def uniqueSet():
    ls_in = generateInputs()
    set_out = set(ls_in)
    
def uniqueSetAdd():
    ls_in = generateInputs()
    set_out = set()
    for i in ls_in:
        set_out.add(i)
    
def uniqueList():
    ls_in = generateInputs()
    ls_out = []
    for i in ls_in:
        if not i in ls_out:
            ls_out.append(i)

def uniqueListSort():
    ls_in = generateInputs()
    ls_in.sort()
    ls_out = [ls_in[0]]
    for i in ls_in:
        if ls_out[-1] != i:
            ls_out.append(i)
            
repeats = 1000
gen_time = timeit(generateInputs, number=repeats)
print(f"uniqueSet: {timeit(uniqueSet, number=repeats)-gen_time:.2f}ms")
print(f"uniqueSetAdd: {timeit(uniqueSetAdd, number=repeats)-gen_time:.2f}ms")
print(f"uniqueList: {timeit(uniqueList, number=repeats)-gen_time:.2f}ms")
print(f"uniqueListSort: {timeit(uniqueListSort, number=repeats)-gen_time:.2f}ms")
  • uniqueSet() passes the input list to the constructor set().
  • uniqueSetAdd() creates an empty set, and then iterates the input list adding each item individually.
  • uniqueList() this naive approach, checks whether each item in the input list exists in the output list before appending.
  • uniqueListSort() sorts the input list, allowing only the last item of the output list to be checked before appending.

There is not a version using list comprehension, as it is not possible to refer to the list being constructed during list comprehension.

Constructing a set by passing in a single list is the clear winner.

Constructing a set with a loop and add() (equivalent to a list’s append()) comes in second. This is slower due to the pythonic loop, whereas adding a full list at once moves this to CPython’s back-end.

The naive list approach is 2200x times slower than the fastest approach, because of how many times the list is searched. This gap will only grow as the number of items increases.

Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of it’s runtime is now spent sorting the list.

OUTPUT

uniqueSet: 0.30ms
uniqueSetAdd: 0.81ms
uniqueList: 660.71ms
uniqueListSort: 2.67ms

Searching


Independent of the performance to construct a unique set (as covered in the previous section), it’s worth identifying the performance to search the data-structure to retrieve an item or check whether it exists.

The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with a single access.

In contrast if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found half way through the list, meaning that an average search will require checking half of the items.

If however the list or array is sorted, a binary search can be used. A binary search divides the list in half and checks which half the target item would be found in, this continues recursively until the search is exhausted whereby the item should be found or dismissed. This is significantly faster than performing a linear search of the list, checking a total of log N items every time.

The below code demonstrates these approaches and their performance.

PYTHON

import random
from timeit import timeit
from bisect import bisect_left

N = 25000  # Number of elements in list
M = 2  # N*M == Range over which the elements span

def generateInputs():
    random.seed(12)  # Ensure every list is the same
    st = set([random.randint(0, int(N*M)) for i in range(N)])
    ls = list(st)
    ls.sort()  # Sort required for binary
    return st, ls  # Return both set and list
    
def search_set():
    st, _ = generateInputs()
    j = 0
    for i in range(0, int(N*M), M):
        if i in st:
            j += 1
    
def linear_search_list():
    _, ls = generateInputs()
    j = 0
    for i in range(0, int(N*M), M):
        if i in ls:
            j += 1
    
def binary_search_list():
    _, ls = generateInputs()
    j = 0
    for i in range(0, int(N*M), M):
        k = bisect_left(ls, i)
        if k != len(ls) and ls[k] == i:
            j += 1

            
repeats = 1000
gen_time = timeit(generateInputs, number=repeats)
print(f"search_set: {timeit(search_set, number=repeats)-gen_time:.2f}ms")
print(f"linear_search_list: {timeit(linear_search_list, number=repeats)-gen_time:.2f}ms")
print(f"binary_search_list: {timeit(binary_search_list, number=repeats)-gen_time:.2f}ms")

Searching the set is fastest performing 25,000 searches in 0.04ms. This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search. The linear search of the list is more than 56,600x slower than the fastest, it really shouldn’t be used!

OUTPUT

search_set: 0.04ms
linear_search_list: 2264.91ms
binary_search_list: 5.79ms

These results are subject to change based on the number of items and the proportion of searched items that exist within the list. However, the pattern is likely to remain the same. Linear searches should be avoided!

Key Points

  • List comprehension should be preferred when constructing lists.
  • Where appropriate, tuples should be preferred over Python lists.
  • Dictionaries and sets are appropriate for storing a collection of unique data with no intrinsic order for random access.
  • When used appropriately, dictionaries and sets are significantly faster than lists.
  • If searching a list or array is required, it should be sorted and searched using bisect_left() (binary search).

Content from Break


Last updated on 2024-03-28 | Edit this page

Estimated time: 0 minutes

Take a break. If you can, move around and look at something away from your screen to give your eyes a rest.

Content from Understanding Python (NumPy/Pandas)


Last updated on 2024-03-28 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • Why are Python loops slow?
  • Why is NumPy often faster than raw Python?
  • How can processing rows of a Pandas data table be made faster?

Objectives

  • Able to identify when Python code can be rewritten to perform execution in the back-end.
  • Able to utilise NumPy’s vectorisation when operating on arrays of data.
  • Able to efficiently process rows when working with data tables.

Python is an interpreted programming language. When you execute your .py file, the (default) CPython back-end compiles your Python source code to an intermediate bytecode. This bytecode is then interpreted in software at runtime generating instructions for the processor as necessary. This interpretation stage, and other features of the language, harm the performance of Python (whilst improving it’s usability).

In comparison, many languages such as C/C++ compile directly to machine code. This allows the compiler to perform low-level optimisations that better exploit hardware nuance to achieve fast performance. This however comes at the cost of compiled software not being cross-platform.

Whilst Python will rarely be as fast as compiled languages like C/C++, it is possible to take advantage of the CPython back-end and packages such as NumPy and Pandas that have been written in compiled languages to expose this performance.

A simple example of this would be to perform a linear search of a list (in the previous episode we did say this is not recommended). The below example creates a list of 2500 integers in the inclusive-exclusive range [0, 5000). It then searches for all of the even numbers in that range. searchlistPython() is implemented manually, iterating ls checking each individual item in Python code. searchListC() in contrast uses the in operator to perform each search, which allows CPython to implement the inner loop in it’s C back-end.

PYTHON

import random

N = 2500  # Number of elements in list
M = 2  # N*M == Range over which the elements span

def generateInputs():
    random.seed(12)  # Ensure every list is the same
    return [random.randint(0, int(N*M)) for i in range(N)]
    
def searchListPython():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        for j in range(0, len(ls)):
            if ls[j] == i:
                ct += 1
                break

def searchListC():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        if i in ls:
            ct += 1

repeats = 1000
gen_time = timeit(generateInputs, number=repeats)
print(f"searchListPython: {timeit(searchListPython, number=repeats)-gen_time:.2f}ms")
print(f"searchListC: {timeit(searchListC, number=repeats)-gen_time:.2f}ms")

This results in the manual Python implementation being 5x slower, doing the exact same operation!

OUTPUT

searchListPython: 152.15ms
searchListC: 28.43ms

An easy approach to follow is that if two blocks of code do the same operation, the one that contains less Python is probably faster. This won’t apply if you’re using 3rd party packages written purely in Python though.

Python bytecode

You can use dis to view the bytecode generated by Python, the amount of byte-code more strongly correlates with how much code is being executed by the Python interpreter. However, this still does not account for whether functions called are implemented using Python or C.

The pure Python search compiles to 82 lines of byte-code.

PYTHON

import dis

def searchListPython():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        for j in range(0, len(ls)):
            if ls[j] == i:
                ct += 1
                break

dis.dis(searchListPython)

OUTPUT

 11           0 LOAD_GLOBAL              0 (generateInputs)
              2 CALL_FUNCTION            0
              4 STORE_FAST               0 (ls)

 12           6 LOAD_CONST               1 (0)
              8 STORE_FAST               1 (ct)

 13          10 LOAD_GLOBAL              1 (range)
             12 LOAD_CONST               1 (0)
             14 LOAD_GLOBAL              2 (int)
             16 LOAD_GLOBAL              3 (N)
             18 LOAD_GLOBAL              4 (M)
             20 BINARY_MULTIPLY
             22 CALL_FUNCTION            1
             24 LOAD_GLOBAL              4 (M)
             26 CALL_FUNCTION            3
             28 GET_ITER
        >>   30 FOR_ITER                24 (to 80)
             32 STORE_FAST               2 (i)

 14          34 LOAD_GLOBAL              1 (range)
             36 LOAD_CONST               1 (0)
             38 LOAD_GLOBAL              5 (len)
             40 LOAD_FAST                0 (ls)
             42 CALL_FUNCTION            1
             44 CALL_FUNCTION            2
             46 GET_ITER
        >>   48 FOR_ITER                14 (to 78)
             50 STORE_FAST               3 (j)

 15          52 LOAD_FAST                0 (ls)
             54 LOAD_FAST                3 (j)
             56 BINARY_SUBSCR
             58 LOAD_FAST                2 (i)
             60 COMPARE_OP               2 (==)
             62 POP_JUMP_IF_FALSE       38 (to 76)

 16          64 LOAD_FAST                1 (ct)
             66 LOAD_CONST               2 (1)
             68 INPLACE_ADD
             70 STORE_FAST               1 (ct)

 17          72 POP_TOP
             74 JUMP_FORWARD             1 (to 78)

 15     >>   76 JUMP_ABSOLUTE           24 (to 48)
        >>   78 JUMP_ABSOLUTE           15 (to 30)

 13     >>   80 LOAD_CONST               0 (None)
             82 RETURN_VALUE

Whereas the in variant only compiles to 54.

PYTHON

import dis

def searchListC():
    ls = generateInputs()
    ct = 0
    for i in range(0, int(N*M), M):
        if i in ls:
            ct += 1

dis.dis(searchListC)

OUTPUT

  4           0 LOAD_GLOBAL              0 (generateInputs)
              2 CALL_FUNCTION            0
              4 STORE_FAST               0 (ls)

  5           6 LOAD_CONST               1 (0)
              8 STORE_FAST               1 (ct)

  6          10 LOAD_GLOBAL              1 (range)
             12 LOAD_CONST               1 (0)
             14 LOAD_GLOBAL              2 (int)
             16 LOAD_GLOBAL              3 (N)
             18 LOAD_GLOBAL              4 (M)
             20 BINARY_MULTIPLY
             22 CALL_FUNCTION            1
             24 LOAD_GLOBAL              4 (M)
             26 CALL_FUNCTION            3
             28 GET_ITER
        >>   30 FOR_ITER                10 (to 52)
             32 STORE_FAST               2 (i)

  7          34 LOAD_FAST                2 (i)
             36 LOAD_FAST                0 (ls)
             38 CONTAINS_OP              0
             40 POP_JUMP_IF_FALSE       25 (to 50)

  8          42 LOAD_FAST                1 (ct)
             44 LOAD_CONST               2 (1)
             46 INPLACE_ADD
             48 STORE_FAST               1 (ct)
        >>   50 JUMP_ABSOLUTE           15 (to 30)

  6     >>   52 LOAD_CONST               0 (None)
             54 RETURN_VALUE

Scope


When Python executes your code, it has to find the variables and functions that you’re using.

This adds an additional cost to accessing variables and calling functions in Python, which isn’t typically seen in compiled languages.

In particular, it will first check whether the variable or functions has been declared within the current function (local scope), if it can’t find it there it will check whether it has been declared in the file (global scope) after which it may even check whether it’s from an imported package.

These are not implicitly cached, therefore repeated accesses to variables and functions, will repeat these checks.

The implication, is that as local scope variables and functions are checked first, they will be faster to use.

If you’re only accessing a variable once or twice that’s nothing to worry about, this is a relatively small cost. But if a variable or functions is being accessed regularly, such as within a loop, the impact may become visible.

The below example provides a small demonstration of this in practice.

PY

from timeit import timeit

N = 1000000
repeats = 1000

def test_list_global():
    t = 0
    for i in range(N):
        if t > N:
            break
        t += i
        
def test_list_local():
    t = 0
    N_local = N
    for i in range(N_local):
        if t > N_local:
            break
        t += i
        
print(f"Global Scope: {timeit(test_list_global, number=repeats):.5f}ms")
print(f"Local Scope: {timeit(test_list_local, number=repeats):.5f}ms")

This is only a trivial example, whereby N has been copied to the local scope N_local, but local scope is about 20% faster than global scope!

OUTPUT

Global Scope: 0.06416ms
Local Scope: 0.05391ms

Consider copying highly accessed variables into local scope, you can always copy them back to global scope before you return from a function.

Copying functions to local scope works much the same as variables, e.g.

PY

import numpy as np

def my_function():
    uniform_local = np.random.uniform
    
    for i in range(10000):
        t = uniform_local()

Built-in Functions Operators


In order to take advantage of offloading computation to the CPython back-end it’s necessary to be aware of what functionality is present. Those available without importing packages are considered built-in functions.

Built-in functions are typically implemented in the CPython back-end, so their performance benefits from bypassing the Python interpreter.

In particular, those which are passed an iterable (e.g. lists) are likely to provide the greatest benefits to performance. The Python documentation provides equivalent Python code for many of these cases

  • all(): boolean and of all items
  • any(): boolean or of all items
  • max(): Return the maximum item
  • min(): Return the minimum item
  • sum(): Return the sum of all items

Callout

The built-in functions filter() and map() can be used for processing iterables However list-comprehension is likely to be more performant.

Using NumPy (Effectively)


NumPy is a commonly used package for scientific computing, which provides a wide variety of methods.

It adds restriction via it’s own basic numeric types, and static arrays to enable even greater performance than that of core Python. However if these restrictions are ignored, the performance can become significantly worse.

Arrays

NumPy’s arrays (not to be confused with the core Python array package) are static arrays. Unlike core Python’s lists, they do not dynamically resize. Therefore if you wish to append to a NumPy array, you must call resize() first. If you treat this like append() for a Python list, resizing for each individual append you will be performing significantly more copies and memory allocations than a Python list.

The below example sees lists and arrays constructed from range(100000).

PYTHON

from timeit import timeit
import numpy

N = 100000  # Number of elements in list/array

def list_append():
    ls = []
    for i in range(N):
        ls.append(i)

def array_resize():
    ar = numpy.zeros(1)
    for i in range(1, N):
        ar.resize(i+1)
        ar[i] = i
        
repeats = 1000
print(f"list_append: {timeit(list_append, number=repeats):.2f}ms")
print(f"array_resize: {timeit(array_resize, number=repeats):.2f}ms")

Resizing a NumPy array is 5.2x slower than a list, probably 10x slower than list comprehension.

OUTPUT

list_append: 3.50ms
array_resize: 18.04ms

Another difference, is that NumPy arrays typically require all data to be the same type (and a NumPy type). This enables more efficient access to elements, as they all exist contiguously in memory. In contrast, elements within Python lists can be of any type so the list always stores a pointer to where the element actually exists in memory, rather than the actual element. This has the side effect that if you are converting back and forth between Python lists and NumPy arrays, there is an additional overhead as it’s not as simple as copying a single block of memory.

Callout

If you construct a NumPy array from a list containing a complex object, it will store your data as Python types and you won’t be able to take advantage of NumPy’s optimisations.

SH

>python
>>> import numpy as np
>>> a = np.array([0.5, 5])
>>> type(a[0])
<class 'numpy.float64'>
>>> type(a[1])
<class 'numpy.float64'>
>>> b = np.array([0.5, 5,{"foo":5}])
>>> type(b[0])
<class 'float'>
>>> type(b[1])
<class 'int'>
>>> type(b[2])
<class 'dict'>

The below example demonstrates the overhead of mixing Python lists and NumPy functions.

SH

# Python list, numpy.random.choice()
>python -m timeit -s "import numpy; ls = list(range(10000))" "numpy.random.choice(ls)"
1000 loops, best of 5: 267 usec per loop

# NumPy array, numpy.random.choice()
>python -m timeit -s "import numpy; ar = numpy.arange(10000)" "numpy.random.choice(ar)"
50000 loops, best of 5: 4.06 usec per loop

Passing a Python list to numpy.random.choice() is 65.6x slower than passing a NumPy array. This is the additional overhead of converting the list to an array. If this function were called multiple times, it would make sense to transform the list to an array before calling the function so that overhead is only paid once.

Callout

SH

# Python list, Manually select 1 item
>python -m timeit -s "import numpy; ls = list(range(10000))" "ls[numpy.random.randint(len(ls))]"
200000 loops, best of 5: 1.19 usec per loop

# NumPy array, Manually select 1 item
>python -m timeit -s "import numpy; ar = numpy.arange(10000)" "ar[numpy.random.randint(len(ar))]"
200000 loops, best of 5: 1.22 usec per loop

Regardless, for this simple application of numpy.random.choice(), merely using numpy.random.randint(len()) is around 4x faster regardless whether a Python list or NumPy array is used.

With numpy.random.choice() being such a general function (it has many possible parameters), there is significant internal branching. If you don’t require this advanced functionality and are calling a function regularly, it can be worthwhile considering using a more limited function.

There is however a trade-off, using numpy.random.choice() can be clearer to someone reading your code, and is more difficult to use incorrectly.

Vectorisation

The manner by which NumPy stores data in arrays enables it’s functions to utilise vectorisation, whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays.

Earlier in this episode it was demonstrated that using core Python methods over a list, will outperform a loop performing the same calculation faster. The below example takes this a step further by demonstrating the calculation of dot product.

PYTHON

from timeit import timeit

N = 1000000  # Number of elements in list

gen_list = f"ls = list(range({N}))"
gen_array = f"import numpy;ar = numpy.arange({N}, dtype=numpy.int64)"

py_sum_ls = "sum([i*i for i in ls])"
py_sum_ar = "sum(ar*ar)"
np_sum_ar = "numpy.sum(ar*ar)"
np_dot_ar = "numpy.dot(ar, ar)"

repeats = 1000
print(f"python_sum_list: {timeit(py_sum_ls, setup=gen_list, number=repeats):.2f}ms")
print(f"python_sum_array: {timeit(py_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_sum_array: {timeit(np_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_dot_array: {timeit(np_dot_ar, setup=gen_array, number=repeats):.2f}ms")
  • python_sum_list uses list comprehension to perform the multiplication, followed by the Python core sum(). This comes out at 46.93ms
  • python_sum_array instead directly multiplies the two arrays, taking advantage of NumPy’s vectorisation. But uses the core Python sum(), this comes in slightly faster at 33.26ms.
  • numpy_sum_array again takes advantage of NumPy’s vectorisation for the multiplication, and additionally uses NumPy’s sum() implementation. These two rounds of vectorisation provide a much faster 1.44ms completion.
  • numpy_dot_array instead uses NumPy’s dot() to calculate the dot product in a single operation. This comes out the fastest at 0.29ms, 162x faster than python_sum_list.

OUTPUT

python_sum_list: 46.93ms
python_sum_array: 33.26ms
numpy_sum_array: 1.44ms
numpy_dot_array: 0.29ms

Parallel NumPy

NumPy can sometimes take advantage of auto parallelisation, particularly on HPC systems.

A small number of functions are backed by BLAS and LAPACK, enabling even greater speedup.

The supported functions mostly correspond to linear algebra operations.

The auto-parallelisation of these functions is hardware dependant, so you won’t always automatically get the additional benefit of parallelisation. However, HPC systems should be primed to take advantage, so try increasing the number of cores you request when submitting your jobs and see if it improves the performance.

This might be why numpy_dot_array is that much faster than numpy_sum_array in the previous example!

vectorize()

Python’s map() was introduced earlier, for applying a function to all elements within a list. NumPy provides vectorize() an equivalent for operating over it’s arrays.

This doesn’t actually make use of processor-level vectorisation, from the documentation:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

The below example demonstrates how the performance of vectorize() is only marginally faster than map().

PYTHON

N = 100000  # Number of elements in list/array

def genArray():
    return numpy.arange(N)

def plus_one(x):
    return x + 1
    
def python_map():
    ar = genArray()
    return list(map(plus_one, ar))

def numpy_vectorize():
    ar = genArray()
    return numpy.vectorize(plus_one)(ar)

repeats = 1000
gentime = timeit(genArray, number=repeats)
print(f"python_map: {timeit(python_map, number=repeats)-gentime:.2f}ms")
print(f"numpy_vectorize: {timeit(numpy_vectorize, number=repeats)-gentime:.2f}ms")

OUTPUT

python_map: 7.94ms
numpy_vectorize: 7.80ms

Using Pandas (Effectively)


Pandas is the most common Python package used for scientific computing when working with tabular data akin to spreadsheets (DataFrames).

Similar to NumPy, Pandas enables greater performance than pure Python implementations when used correctly, however incorrect usage can actively harm performance.

Operating on Rows


Pandas’ methods by default operate on columns. Each column or series can be thought of as a NumPy array, highly suitable for vectorisation.

Following the theme of this episode, iterating over the rows of a data frame using a for loop is not advised. The pythonic iteration will be slower than other approaches.

Pandas allows it’s own methods to be applied to rows in many cases by passing axis=1, where available these functions should be preferred over manual loops. Where you can’t find a suitable method, apply() can be used, which is similar to map()/vectorize(), to apply your own function to rows.

PYTHON

from timeit import timeit
import pandas
import numpy

N = 100000  # Number of rows in DataFrame

def genDataFrame():
    numpy.random.seed(12)  # Ensure each dataframe is identical
    return pandas.DataFrame(
    {
        "f_vertical": numpy.random.random(size=N),
        "f_horizontal": numpy.random.random(size=N),
        # todo some spurious columns
    })

def pythagoras(row):
    return (row["f_vertical"]**2 + row["f_horizontal"]**2)**0.5
    
def for_range():
    rtn = []
    df = genDataFrame()
    for row_idx in range(df.shape[0]):
        row = df.iloc[row_idx]
        rtn.append(pythagoras(row))
    return pandas.Series(rtn)

def for_iterrows():
    rtn = []
    df = genDataFrame()
    for row_idx, row in df.iterrows():
        rtn.append(pythagoras(row))
    return pandas.Series(rtn)
    
def pandas_apply():
    df = genDataFrame()
    return df.apply(pythagoras, axis=1)

repeats = 100
gentime = timeit(genDataFrame, number=repeats)
print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")

apply() is 3x faster than the two for approaches, as it avoids the Python for loop.

OUTPUT

for_range: 1582.47ms
for_iterrows: 1677.14ms
pandas_apply: 390.49ms

However, rows don’t exist in memory as arrays (columns do!), so apply() does not take advantage of NumPys vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns to NumPy.

PYTHON

def vectorize():
    df = genDataFrame()
    return pandas.Series(numpy.sqrt(numpy.square(df["f_vertical"]) + numpy.square(df["f_horizontal"])))
    
print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")

264x faster than apply(), 1000x faster than for iterrows()!

vectorize: 1.48ms

It won’t always be possible to take full advantage of vectorisation, for example you may have conditional logic.

An alternate approach is converting your dataframe to a Python dictionary using to_dict(orient='index'). This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension:

PYTHON

def to_dict():
    df = genDataFrame()
    df_as_dict = df.to_dict(orient='index')
    return pandas.Series([(r['f_vertical']**2 + r['f_horizontal']**2)**0.5 for r in df_as_dict.values()])

print(f"to_dict: {timeit(to_dict, number=repeats)*10-gentime:.2f}ms")

Whilst still nearly 100x slower than pure vectorisation, it’s twice as fast as apply().

SH

to_dict: 131.15ms

This is because indexing into Pandas’ Series (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large dataframe.

PYTHON

from timeit import timeit
import pandas as pandas

N = 100000  # Number of rows in DataFrame

def genInput():
    s = pandas.Series({'a' : 1, 'b' : 2})
    d = {'a' : 1, 'b' : 2}
    return s, d

def series():
    s, _ = genInput()
    for i in range(N):
        y = s['a'] * s['b']

def dictionary():
    _, d = genInput()
    for i in range(N):
        y = d['a'] * d['b']

repeats = 1000
print(f"series: {timeit(series, number=repeats):.2f}ms")
print(f"dictionary: {timeit(dictionary, number=repeats):.2f}ms")

65x slower!

OUTPUT

series: 237.25ms
dictionary: 3.63ms

Filter Early


If you can filter your rows before processing, rather than after, you may significantly reduce the amount of processing and memory used.

Key Points

  • Python is an interpreted language, this adds an additional overhead at runtime to the execution of Python code. Many core Python and NumPy functions are implemented in faster C/C++, free from this overhead.
  • NumPy can take advantage of vectorisation to process arrays, which can greatly improve performance.
  • Pandas’ data tables store columns as arrays, therefore operations applied to columns can take advantage of NumPys vectorisation.

Content from Keep Python & Packages up to Date


Last updated on 2024-03-28 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • Why would a newer version of Python or a package be faster?
  • Are there any risks to updating Python and packages?
  • How can reproducibility be ensured through package upgrades?

Objectives

  • Able to explain why using the latest versions of Python and packages is beneficial.
  • Able to identify when updating is not possible due to incompatibilities.
  • Able to ensure code remains reproducible through package changes.

Introduction


It’s important to use the latest Python wherever feasible. In addition to new features and fixes, much work has been completed over the lifetime of Python 3 to improve the performance of the language.

Python 3.11 is between 10-60% faster than Python 3.10. On average, we measured a 1.25x speedup on the standard benchmark suite.

Future proposals, such as changes to the JIT and GIL will provide further improvements to performance.

Similarly, major packages with a performance focus such as NumPy and Pandas, should be kept up to date for the same reasons.

These improvements are often free, requiring minimal changes to any code (unlike the jump from Python 2 to Python 3).

Performance regressions within major packages should be considered rare, they often track performance alongside their test suites.

However, the more packages and language features your code touches, and the older the Python it currently uses, the greater chance of incompatibilities making it difficult to upgrade.

Similar to optimising, when updating it’s important to have tests in place to validate the correctness of your code before and after changes. An update to a single small dependent package could introduce a breaking change. This could cause your code to crash, or worse subtly change your results.

Updating Python & Packages


This isn’t as relevant if you’re starting from scratch. Simply make sure you’ve installed the latest Python before you start.

Key Points

  • Where feasible, the latest version of Python and packages should be used as they can include significant free improvements to the performance of your code.
  • There is a risk that updating Python or packages will not be possible to due to version incompatibilities or will require breaking changes to your code.
  • Changes to packages may impact results output by your code, ensure you have a method of validation ready prior to attempting upgrades.

Content from Understanding Memory


Last updated on 2024-03-28 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How does a CPU look for a variable it requires?
  • What impact do cache lines have on memory accesses?
  • Why is it faster to read/write a single 100mb file, than 100 1mb files?

Objectives

  • Able to explain, at a high-level, how memory accesses occur during computation and how this impacts optimisation considerations.
  • Able to identify the relationship between different latencies relevant to software.

Accessing Variables


The storage and movement of data plays a large role in the performance of executing software.

Modern computer’s typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel.

Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives). But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster.

An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled.
An annotated photo of a computer’s hardware.

When reading a variable, to perform an operation with it, the CPU will first look in it’s registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes). As the register file is so small, most variables won’t be found and the CPU’s caches will be searched. It will first check the current processing core’s L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU. If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core). This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size). If the variable has not been found in any of the CPU’s cache, the CPU will look to the computer’s RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard).

Correspondingly, the earlier the CPU finds the variable the faster it will be to access. However, to fully understand the cache’s it’s necessary to explain what happens once a variable has been found.

If a variable is not found in the caches, so must be fetched from RAM. The full 64 byte cache line containing the variable, will be copied first into the CPU’s L3, then L2 and then L1. Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches. Similarly, adding new data to a cache evicts old data. This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers

Therefore, to optimally access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches. If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again.

It’s not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important.

An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.
An abstract diagram showing the path data takes from disk or RAM to be used for computation.

Callout

Python as a programming language, does not give you enough control to carefully pack your variables in this manner (every variable is an object, so it’s stored as a pointer that redirects to the actual data stored elsewhere).

However all is not lost, packages such as numpy and pandas implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly).

Accessing Disk


When accessing data on disk (or network), a very similar process is performed to that between CPU and RAM when accessing variables.

When reading data from a file, it transferred from the disk, to the disk cache, to the RAM. The latency to access files on disk is another order of magnitude higher than accessing RAM.

As such, disk accesses similarly benefit from sequential accesses and reading larger blocks together rather than single variables. Python’s io package is already buffered, so automatically handles this for you in the background.

However before a file can be read, the file system on the disk must be polled to transform the file path to it’s address on disk to initiate the transfer (or throw an exception).

Following the common theme of this episode, the cost of accessing randomly scattered files can be significantly slower than accessing a single larger file of the same size. This is because for each file accessed, the file system must be polled to transform the file path to an address on disk. Traditional hard disk drives particularly suffer, as the read head must physically move to locate data.

Hence, it can be wise to avoid storing outputs in many individual files and to instead create a larger output file.

This is even visible outside of your own code. If you try to upload/download 1 GB to HPC. The transfer will be significantly faster, assuming good internet bandwidth, if that’s a single file rather than thousands.

The below example code runs a small benchmark, whereby 10MB is written to disk and read back whilst being timed. In one case this is as a single file, and the other, 1000 file segments.

PYTHON

import os, time

# Generate 10MB
data_len = 10000000
data = os.urandom(data_len)
file_ct = 1000
file_len = int(data_len/file_ct)
# Write one large file
start = time.perf_counter()
large_file = open("large.bin", "wb")
large_file.write(data)
large_file.close ()
large_write_s = time.perf_counter() - start
# Write multiple small files
start = time.perf_counter()
for i in range(file_ct):
    small_file = open(f"small_{i}.bin", "wb")
    small_file.write(data[file_len*i:file_len*(i+1)])
    small_file.close()
small_write_s = time.perf_counter() - start
# Read back the large file
start = time.perf_counter()
large_file = open("large.bin", "rb")
t = large_file.read(data_len)
large_file.close ()
large_read_s = time.perf_counter() - start
# Read back the small files
start = time.perf_counter()
for i in range(file_ct):
    small_file = open(f"small_{i}.bin", "rb")
    t = small_file.read(file_len)
    small_file.close()
small_read_s = time.perf_counter() - start
# Print Summary
print(f"{1:5d}x{data_len/1000000}MB Write: {large_write_s:.5f} seconds")
print(f"{file_ct:5d}x{file_len/1000}KB Write: {small_write_s:.5f} seconds")
print(f"{1:5d}x{data_len/1000000}MB Read: {large_read_s:.5f} seconds")
print(f"{file_ct:5d}x{file_len/1000}KB Read: {small_read_s:.5f} seconds")
print(f"{file_ct:5d}x{file_len/1000}KB Write was {small_write_s/large_write_s:.1f} slower than 1x{data_len/1000000}MB Write")
print(f"{file_ct:5d}x{file_len/1000}KB Read was {small_read_s/large_read_s:.1f} slower than 1x{data_len/1000000}MB Read")
# Cleanup
os.remove("large.bin")
for i in range(file_ct):
    os.remove(f"small_{i}.bin")

Running this locally, with an SSD I received the following timings.

SH

    1x10.0MB Write: 0.00198 seconds
 1000x10.0KB Write: 0.14886 seconds
    1x10.0MB Read: 0.00478 seconds
 1000x10.0KB Read: 2.50339 seconds
 1000x10.0KB Write was 75.1 slower than 1x10.0MB Write
 1000x10.0KB Read was 523.9 slower than 1x10.0MB Read

Repeated runs show some noise to the timing, however the slowdown is consistently the same order of magnitude slower when split across multiple files.

You might not even be reading 1000 different files. You could be reading the same file multiple times, rather than reading it once and retaining it in memory during execution. An even greater overhead would apply.

Latency Overview


Latency can have a big impact on the speed that a program executes, the below graph demonstrates this. Note the log scale!

A horizontal bar chart displaying the relative latencies for L1/L2/L3 cache, RAM, SSD, HDD and a packet being sent from London to California and back. These latencies range from 1 nanosecond to 140 milliseconds and are displayed with a log scale.
A graph demonstrating the wide variety of latencies a programmer may experience when accessing data.

The lower the latency typically the higher the effective bandwidth. L1 and L2 cache have 1TB/s, RAM 100GB/s, SSDs upto 32 GB/s, HDDs upto 150MB/s. Making large memory transactions even slower.

Memory Allocation is not Free


When a variable is created, memory must be located for it, potentially requested from the operating system. This gives it an overhead versus reusing existing allocations, or avoiding redundant temporary allocations entirely.

Within Python memory is not explicitly allocated and deallocated, instead it is automatically allocated and later “garbage collected”. The costs are still there, this just means that Python programmers have less control over where they occur.

The below implementation of the heat-equation, reallocates out_grid, a large 2 dimensional (500x500) list each time update() is called which progresses the model.

PYTHON

import time
grid_shape = (512, 512)

def update(grid, a_dt):
    x_max, y_max = grid_shape
    out_grid = [[0.0 for x in range(y_max)] * y_max for x in range(x_max)]
    for i in range(x_max):
        for j in range(y_max):
            out_xx = grid[(i-1)%x_max][j] - 2 * grid[i][j] + grid[(i+1)%x_max][j]
            out_yy = grid[i][(j-1)%y_max] - 2 * grid[i][j] + grid[i][(j+1)%y_max]
            out_grid[i][j] = grid[i][j] + (out_xx + out_yy) * a_dt 
    return out_grid
    
def heat_equation(steps):
    x_max, y_max = grid_shape
    grid = [[0.0] * y_max for x in range(x_max)]
    # Init central point to diffuse
    grid[int(x_max/2)][int(y_max/2)] = 1.0
    # Run steps
    for i in range(steps):
        grid = update(grid, 0.1)

heat_equation(100)

Line profiling demonstrates that function takes up over 55 seconds of the total runtime, with the cost of allocating the temporary out_grid list to be 39.3% of the total runtime of that function!

OUTPUT

Total time: 55.4675 s
File: heat_equation.py
Function: update at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def update(grid, a_dt):
     5       100        127.7      1.3      0.0      x_max, y_max = grid_shape
     6       100   21822304.9 218223.0     39.3      out_grid = [[0.0 for x in range(y_max)] * y_max for x in range(x_m…
     7     51300       7741.9      0.2      0.0      for i in range(x_max):
     8  26265600    3632718.1      0.1      6.5          for j in range(y_max):
     9  26214400   11207717.9      0.4     20.2              out_xx = grid[(i-1)%x_max][j] - 2 * grid[i][j] + grid[(i+1…
    10  26214400   11163116.5      0.4     20.1              out_yy = grid[i][(j-1)%y_max] - 2 * grid[i][j] + grid[i][(…
    11  26214400    7633720.1      0.3     13.8              out_grid[i][j] = grid[i][j] + (out_xx + out_yy) * a_dt
    12       100         27.8      0.3      0.0      return out_grid

If instead out_grid is double buffered, such that two buffers are allocated outside the function, which are swapped after each call to update().

PYTHON

import time
grid_shape = (512, 512)

def update(grid, a_dt, out_grid):
    x_max, y_max = grid_shape
    for i in range(x_max):
        for j in range(y_max):
            out_xx = grid[(i-1)%x_max][j] - 2 * grid[i][j] + grid[(i+1)%x_max][j]
            out_yy = grid[i][(j-1)%y_max] - 2 * grid[i][j] + grid[i][(j+1)%y_max]
            out_grid[i][j] = grid[i][j] + (out_xx + out_yy) * a_dt 
    
def heat_equation(steps):
    x_max, y_max = grid_shape
    grid = [[0.0 for x in range(y_max)] for x in range(x_max)]
    out_grid = [[0.0 for x in range(y_max)] for x in range(x_max)]  # Allocate a second buffer once
    # Init central point to diffuse
    grid[int(x_max/2)][int(y_max/2)] = 1.0
    # Run steps
    for i in range(steps):
        update(grid, 0.1, out_grid)  # Pass the output buffer
        grid, out_grid = out_grid, grid  # Swap buffers

heat_equation(100)

The total time reduces to 34 seconds, reducing the runtime by 39% inline with the removed allocation.

OUTPUT

Total time: 34.0597 s
File: heat_equation.py
Function: update at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def update(grid, a_dt, out_grid):
     5       100         43.5      0.4      0.0      x_max, y_max = grid_shape
     6     51300       7965.8      0.2      0.0      for i in range(x_max):
     7  26265600    3569519.4      0.1     10.5          for j in range(y_max):
     8  26214400   11291491.6      0.4     33.2              out_xx = grid[(i-1)%x_max][j] - 2 * grid[i][j] + grid[(i+1…
     9  26214400   11409533.7      0.4     33.5              out_yy = grid[i][(j-1)%y_max] - 2 * grid[i][j] + grid[i][(…
    10  26214400    7781156.4      0.3     22.8              out_grid[i][j] = grid[i][j] + (out_xx + out_yy) * a_dt

Key Points

  • Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses.
    • This is not always natively possible in Python without the use of packages such as NumPy and Pandas
  • One large file is preferable to many small files.
  • Memory allocation is not free, avoiding destroying and recreating objects can improve performance.

Content from Optimisation Conclusion


Last updated on 2024-09-26 | Edit this page

Estimated time: 5 minutes

Overview

Questions

  • What has been learnt about writing performant Python?

Objectives

  • Review what has been learnt about writing performant Python

This concludes the optimisation portion of the course.

An overview of how Python operates and the most important practices for achieving performant code have been introduced.

Hopefully with the information from this course you will be in a better position to investigate and optimise the performance of your own code.

This course’s website can be used as a reference manual when profiling your own code.

Key Points

  • Data Structures & Algorithms
    • List comprehension should be preferred when constructing lists.
    • Where appropriate, Tuples and Generator functions should be preferred over Python lists.
    • Dictionaries and sets are appropriate for storing a collection of unique data with no intrinsic order for random access.
    • When used appropriately, dictionaries and sets are significantly faster than lists.
    • If searching a list or array is required, it should be sorted and searched using bisect_left() (binary search).
  • Minimise Python Written
    • Python is an interpreted language, this adds an additional overhead at runtime to the execution of Python code. Many core Python and NumPy functions are implemented in faster C/C++, free from this overhead.
    • NumPy can take advantage of vectorisation to process arrays, which can greatly improve performance.
    • Pandas’ data tables store columns as arrays, therefore operations applied to columns can take advantage of NumPys vectorisation.
  • Newer is Often Faster
    • Where feasible, the latest version of Python and packages should be used as they can include significant free improvements to the performance of your code.
    • There is a risk that updating Python or packages will not be possible to due to version incompatibilities or will require breaking changes to your code.
    • Changes to packages may impact results output by your code, ensure you have a method of validation ready prior to attempting upgrades.
  • How the Computer Hardware Affects Performance
    • Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses.
      • This is not always natively possible in Python without the use of packages such as NumPy and Pandas
    • One large file is preferable to many small files.
    • Memory allocation is not free, avoiding destroying and recreating objects can improve performance.