Content from Reproducible Research


Last updated on 2025-05-26 | Edit this page

Estimated time: 18 minutes

Overview

Questions

  • What does it mean to be “reproducible”?
  • How is “reproducibility” different than “reuse”?

Objectives

  • Understand the concepts of reproducibility and reuse
  • Be able to describe what is needed for a computational environment to be reproducible.

Introduction


Modern scientific analyses are complex software and logistical workflows that may span multiple software environments and require heterogenous software and computing infrastructure. Scientific researchers need to keep track of all of this to be able to do their research, and to ensure the validity of their work, which can be difficult. Scientific software enables all of this work to happen, but software isn’t a static resource — software is continually developed, revised, and released, which can introduce large breaking changes or subtle computational differences in outputs and results. Having the software you’re using change without you intending it from day-to-day, run-to-run, or on different machines is problematic when trying to do high quality research and can cause software bugs, errors in scientific results, and make findings unreproducible. All of these things are not desirable!

Callout

When discussing “software” in this lesson we will primarily be meaning open source software that is openly developed. However, there are situations in which software might (for good reason) be:

  • Closed development with open source release artifacts
  • Closed development and closed source with public binary distributions
  • Closed development and closed source with proprietary licenses

Ask the participants to discuss the question in small groups of 2 to 4 people at their table for 3 minutes and then to share their group’s thoughts.

What are other challenges to reproducible research?

There are many! Here are some you might have thought of:

  • (Not having) Access to data
  • Required software packages be removed from mutable package indexes
  • Unreproducible builds of software that isn’t packaged and distributed on public package indexes
  • Analysis code not being under version control
  • Not having any environment definition configuration files

What did you come up with?

Computational reproducibility


“Reproducible” research can mean many things and is a multipronged problem. This lesson will focus primarily on computational reproducibility. Like all forms of reproducibility, there are multiple “levels” of reproducibility. For this lesson we will focus on “full” reproducibility, meaning that reproducible software environments will:

  • Be defined through high level user configuration files.
  • Have machine produced hash level lock files with a full definition of all software in the environment.
  • Specify target computer platforms for all environments solved.
  • Have the resolution and “solving” of a platform’s environments be machine agnostic.
  • Have the software packages defined in the environments exist on immutable public package indexes.

Hardware accelerated environments

Software the involves hardware acceleration on computing resources like GPUs requires additional information to be provided for full computational reproducibility. In addition to the computer platform, information about the hardware acceleration device, its supported drivers, and compatible hardware accelerated versions of the software in the environment (GPU enabled builds) are required. Traditionally this has been very difficult to do, but multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems.

What are possible challenges of reproducible hardware accelerated environments?

Here are some you might have thought of:

  • Installing hardware acceleration drivers and libraries on the machine with the GPU
  • Knowing what drivers are supported for the available GPUs
  • Providing instructions to install the same drivers and libraries on multiple computing platforms
  • Having the “deployment” machine’s resources and environment where the analysis is done match the “development” machine’s environment

What did you come up with?

Does computational reproducibility mean that the exact same numeric results should be achieved every time?

Not necessarily. Even though the computational software environment is identical there are things that can change between runs of software that could slightly change numerical results (e.g. random number generation seeds, file read order, machine entropy). This isn’t necessarily a problem, and in general one should be more concerned with getting answers that make sense within application uncertainties than matching down to machine precision.

What are additional reasons you thought of?

Computational reproducibility vs. scientific reuse


Aiming for computational reproducibility is the first step to making scientific research more beneficial to us. For the purposes of a single analysis this should be the primary goal. However, just because a software environment is fully reproducible does not mean that the research is automatically reusable. Reuse allows for the tools and components of the scientific workflow to be composable tools that can interoperate together to create a workflow. The steps of the workflow might exist in radically different computational environments and require different software, or different versions of the same software tools. Given these demands, reproducible computational software environments are a first step toward fully reusable scientific workflows.

This lesson will focus on computational reproducibility of hardware accelerated scientific workflows (e.g. machine learning). Scientifically reusable analysis workflows are a more extensive topic, but this lesson will link to references on the topic.

What are challenges to your own research practices to making them reproducible and reusable?

  • Technical expertise in reproducibility technologies
  • Time to learn new tools
  • Balancing reproducibility concerns with using tools the entire research team can understand

What did you come up with?

Key Points

  • Modern scientific research is complex and requires software environments.
  • Computational reproducibility helps to enable reproducible science, but is not sufficient by itself.
  • Reproducible computational software environments that use hardware acceleration require additional information.
  • New technologies make all of these processes easier.
  • Reproducible computational software environments are a first step toward fully reusable scientific workflows but are not sufficient by themselves.

Content from Introduction to Pixi


Last updated on 2025-06-02 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • What is Pixi?
  • How does Pixi enable fully reproducible software environments?
  • What are Pixi’s semantics?

Objectives

  • Learn Pixi’s workflow design
  • Understand the relationship between a Pixi manifest and a lock file
  • Understand how to create a multi-platform and multi-environment Pixi workspace

Pixi


As described in the previous section on computational reproducibility, to have reproducible software environments we need tools that can take high level human writeable environment configuration files and produce machine readable hash level lock files that exactly specify every piece of software that exists in an environment.

Pixi a cross-platform package and environment manager that can handle complex development workflows. Importantly, Pixi automatically and non-optionally will produce or update a lock file for the software environments defined by the user whenever any actions mutate the environment. Pixi is written in Rust, and leverages the language’s speed and technologies to solve environments fast.

Pixi addresses the concept of computational reproducibility by focusing on a set of main features

  1. Virtual environment management: Pixi can create environments that contain conda packages and Python packages and use or switch between environments easily.
  2. Package management: Pixi enables the user to install, update, and remove packages from these environments through the pixi command line.
  3. Task management: Pixi has a task runner system built-in, which allows for tasks with custom logic and dependencies on other tasks to be created.

combined with robust behaviors

  1. Automatic lock files: Any changes to a Pixi workspace that can mutate the environments defined in it will automatically and non-optionally result in the Pixi lock file for the workspace being updated. This ensures that any and every state of a Pixi project is trivially computationally reproducible.
  2. Solving environments for other platforms: Pixi allows the user to solve environment for platforms other than the current user machine’s. This allows for users to solve and share environment to any collaborator with confidence that all environments will work with no additional setup.
  3. Pairity of conda and Python packages: Pixi allows for conda packages and Python packages to be used together seamlessly, and is unique in its ability to handle overlap in dependencies between them. Pixi will first solve all conda package requirements for the target environment, lock the environment, and then solve all the dependencies of the Python packages for the environment, determine if there are any overlaps with the existing conda environment, and the only install the missing Python dependencies. This ensures allows for fully reproducible solves and for the two package ecosystems to compliment each other rather than potentially cause conflicts.
  4. Efficient caching: Pixi uses an extremely efficient global caching scheme. This means that the first time a package is installed on a machine with Pixi is the slowest is will ever be to install it for any future project on the machine while the cache is still active.

Project-based workflows


Pixi uses a “project-based” workflow which scopes a workspace and the installed tooling for a project to the project’s directory tree.

Pros

  • Environments in the workspace are isolated to the project and can not cause conflicts with any tools or projects outside of the project.
  • A high level declarative syntax allows for users to state only what they need, making even complex environments easy to understand and share.
  • Environments can be treated as transient and be fully deleted and then rebuilt within seconds without worry about breaking other projects. This allows for much greater freedom of exploration and development without fear.

Cons

  • As each project has its own version of its packages installed, and does not share a copy with other projects, the total amount of disk space on a machine can be larger than other forms of development workflows. This can be mitigated for disk limited machines by cleaning environments not in use while keeping their lock files and cleaning the system cache periodically.
  • Each project needs to be set up by itself and does not reuse components of previous projects.

Pixi project files and the CLI API basics


Every Pixi project begins with creating a manifest file. A manifest file is a declarative configuration file that list what the high level requirements of a project are. Pixi then takes those requirements and constraints and solves for the full dependency tree.

Let’s create our first Pixi project. First, to have a uniform directory tree experience, create a directory called pixi-lesson under your home directory on your machine and navigate to it.

BASH

mkdir -p ~/pixi-lesson
cd ~/pixi-lesson

Then use pixi init to create a new project directory and initialize a Pixi manifest with your machine’s configuration.

BASH

pixi init example

OUTPUT

Created /home/<username>/pixi-lesson/example/pixi.toml

Navigate to the example directory and check the directory structure

BASH

cd example
ls -a

OUTPUT

.  ..  .gitattributes  .gitignore  pixi.toml

We see that Pixi has setup Git configuration files for the project as well as a Pixi manifest pixi.toml file. Checking the default manifest file, we see three TOML tables: workspace, tasks, and dependencies.

BASH

cat pixi.toml

TOML

[workspace]
authors = ["Your Name <your email from your global Git config>"]
channels = ["conda-forge"]
name = "example"
# This will be whatever your machine's platform is
platforms = ["linux-64"]
version = "0.1.0"

[tasks]

[dependencies]
  • workspace: Defines metadata and properties for the entire project.
  • tasks: Defines tasks for the task runner system to execute from the command line and their dependencies.
  • dependencies: Defines the conda package dependencies from the channels in your workspace table.

Callout

For the rest of the lesson we’ll ignore the authors list in our discussions as it is optional and will be specific to you.

At the moment there are no dependencies defined in the manifest, so let’s add Python using the pixi add CLI API.

BASH

pixi add python

OUTPUT

✔ Added python >=3.13.3,<3.14

What happened? We saw that python got added, and we can see that the pixi.toml manifest now contains python as a dependency

BASH

cat pixi.toml

TOML

[workspace]
channels = ["conda-forge"]
name = "example"
# This will be whatever your machine's platform is
platforms = ["linux-64"]
version = "0.1.0"

[tasks]

[dependencies]
python = ">=3.13.3,<3.14"

Further, we also now see that a pixi.lock lock file has been created in the project directory as well as a .pixi/ directory.

BASH

ls -a

OUTPUT

.  ..  .gitattributes  .gitignore  .pixi  pixi.lock  pixi.toml

The .pixi/ directory contains the installed environments. We can see that at the moment there is just one environment named default

BASH

ls .pixi/envs/

OUTPUT

default

Inside the .pixi/envs/default/ directory are all the libraries, header files, and executables that are needed by the environment.

The pixi.lock lock file contains YAML that defines all requested conda package dependencies in the manifest, as well as their dependencies, at the exact versions that were solved for. It provides their full URLs on the conda package index to download from as well as digest information for the exact package to ensure that it is exactly specified and that version, and only that version, will be downloaded and installed in the future. We can even test that now by deleting the installed environment fully with pixi clean and then getting it back (bit for bit) in a few seconds with pixi install.

BASH

pixi clean

OUTPUT

  removed /home/<username>/pixi-lesson/example/.pixi/envs

BASH

 pixi install

OUTPUT

✔ The default environment has been installed.

We can also see all the packages that were installed and are now available for us to use with pixi list

BASH

pixi list

OUTPUT

Package           Version    Build               Size       Kind   Source
_libgcc_mutex     0.1        conda_forge         2.5 KiB    conda  https://conda.anaconda.org/conda-forge/
_openmp_mutex     4.5        2_gnu               23.1 KiB   conda  https://conda.anaconda.org/conda-forge/
bzip2             1.0.8      h4bc722e_7          246.9 KiB  conda  https://conda.anaconda.org/conda-forge/
ca-certificates   2025.4.26  hbd8a1cb_0          148.7 KiB  conda  https://conda.anaconda.org/conda-forge/
ld_impl_linux-64  2.43       h712a8e2_4          655.5 KiB  conda  https://conda.anaconda.org/conda-forge/
libexpat          2.7.0      h5888daf_0          72.7 KiB   conda  https://conda.anaconda.org/conda-forge/
libffi            3.4.6      h2dba641_1          56.1 KiB   conda  https://conda.anaconda.org/conda-forge/
libgcc            15.1.0     h767d61c_2          809.7 KiB  conda  https://conda.anaconda.org/conda-forge/
libgcc-ng         15.1.0     h69a702a_2          33.8 KiB   conda  https://conda.anaconda.org/conda-forge/
libgomp           15.1.0     h767d61c_2          442 KiB    conda  https://conda.anaconda.org/conda-forge/
liblzma           5.8.1      hb9d3cd8_1          110.2 KiB  conda  https://conda.anaconda.org/conda-forge/
libmpdec          4.0.0      hb9d3cd8_0          89 KiB     conda  https://conda.anaconda.org/conda-forge/
libsqlite         3.50.0     hee588c1_0          897.5 KiB  conda  https://conda.anaconda.org/conda-forge/
libuuid           2.38.1     h0b41bf4_0          32.8 KiB   conda  https://conda.anaconda.org/conda-forge/
libzlib           1.3.1      hb9d3cd8_2          59.5 KiB   conda  https://conda.anaconda.org/conda-forge/
ncurses           6.5        h2d0b736_3          870.7 KiB  conda  https://conda.anaconda.org/conda-forge/
openssl           3.5.0      h7b32b05_1          3 MiB      conda  https://conda.anaconda.org/conda-forge/
python            3.13.3     hf636f53_101_cp313  31.7 MiB   conda  https://conda.anaconda.org/conda-forge/
python_abi        3.13       7_cp313             6.8 KiB    conda  https://conda.anaconda.org/conda-forge/
readline          8.2        h8c095d6_2          275.9 KiB  conda  https://conda.anaconda.org/conda-forge/
tk                8.6.13     noxft_hd72426e_102  3.1 MiB    conda  https://conda.anaconda.org/conda-forge/
tzdata            2025b      h78e105d_0          120.1 KiB  conda  https://conda.anaconda.org/conda-forge/

Extending the manifest

Let’s extend this manifest to add the Python library numpy and the Jupyter tools notebook and jupyterlab as dependencies and add a task called lab that will launch Jupyter Lab in the current working directory.

Let’s start at the command line and add the additional dependencies with pixi add

BASH

pixi add numpy notebook jupyterlab

Then let’s manually edit the pixi.toml with a text editor to add a task named lab that when called executes jupyter lab.

The resulting pixi.toml manifest is

TOML

[workspace]
channels = ["conda-forge"]
name = "example"
platforms = ["linux-64"]
version = "0.1.0"

[tasks.lab]
description = "Launch JupyterLab"
cmd = "jupyter lab"

[dependencies]
python = ">=3.13.3,<3.14"
numpy = ">=2.2.6,<3"
notebook = ">=7.4.3,<8"
jupyterlab = ">=4.4.3,<5"

With our new dependencies added to the project manifest and our lab task defined, let’s use all of them together by launching our task using pixi run

BASH

pixi run lab

and we see that Jupyter Lab launches!

Here we used pixi run to execute tasks in the workspace’s environments without ever explicitly activating the environment. This is a different behavior compared to tools like conda of Python virtual environments, where the assumption is that you have activated an environment before using it. With Pixi we can do the equivalent with pixi shell, which starts a subshell in the current working directory with the Pixi environment activated.

BASH

pixi shell

Notice how your shell prompt now has (example) (the workspace name) preceding it, signaling to you that you’re in the activated environment. You can now directly run commands that use the environment.

BASH

python

OUTPUT

Python 3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

As we’re in a subshell, to exit the environment and move back to the shell that launched the subshell, just exit the shell.

bash

Multi platform projects

Extend your project to additionally support the linux-64 and osx-arm64 platforms.

Using the pixi workspace CLI API, one can add the platforms with

BASH

pixi workspace platform add linux-64 osx-arm64

This both adds the platforms to the workspace platforms list and solves for the platforms and updates the lock file!

One can also manually edit the pixi.toml with a text editor to add the desired platforms to the platforms list.

The resulting pixi.toml manifest is

TOML

[workspace]
channels = ["conda-forge"]
name = "example"
platforms = ["linux-64", "osx-arm64"]
version = "0.1.0"

[tasks.lab]
description = "Launch JupyterLab"
cmd = "jupyter lab"

[dependencies]
python = ">=3.13.3,<3.14"
numpy = ">=2.2.6,<3"
notebook = ">=7.4.3,<8"
jupyterlab = ">=4.4.3,<5"

So far the Pixi project has only had one environment defined in it. We can make the project multi-environment by first defining a new “feature” which provides all the fields necessary to define part of an environment to extend the default environment. We can create a new feature named dev and then create an environment also named dev which uses the dev feature to extend the default environment

TOML

[workspace]
channels = ["conda-forge"]
name = "example"
platforms = ["linux-64", "osx-arm64"]
version = "0.1.0"

[tasks.lab]
description = "Launch JupyterLab"
cmd = "jupyter lab"

[dependencies]
python = ">=3.13.3,<3.14"
numpy = ">=2.2.6,<3"
notebook = ">=7.4.3,<8"
jupyterlab = ">=4.4.3,<5"

[feature.dev.dependencies]

[environments]
dev = ["dev"]

Callout

The pixi workspace CLI can also be used to add existing featues to environments

BASH

pixi workspace environment add --feature dev dev

We can now add pre-commit to the dev feature’s dependencies and have it be accessible in the dev environment.

BASH

pixi add --feature dev pre-commit

OUTPUT

✔ Added pre-commit >=4.2.0,<5
Added these only for feature: dev

TOML

[workspace]
channels = ["conda-forge"]
name = "example"
platforms = ["linux-64", "osx-arm64"]
version = "0.1.0"

[tasks.lab]
description = "Launch JupyterLab"
cmd = "jupyter lab"

[dependencies]
python = ">=3.13.3,<3.14"
numpy = ">=2.2.6,<3"
notebook = ">=7.4.3,<8"
jupyterlab = ">=4.4.3,<5"

[feature.dev.dependencies]
pre-commit = ">=4.2.0,<5"

[environments]
dev = ["dev"]

This now allows us to specify the environment we want tasks to run in with the --environment flag

BASH

pixi run --environment dev pre-commit --help

BASH

pixi shell --environment dev

Key Points

  • Pixi uses a project based workflow and a declarative project manifest file to define project operations.
  • Pixi automatically creates or updates a hash level lock file anytime the project manifest or dependencies are mutated.
  • Pixi allows for multi-platform and multi-environment projects to be defined in a single project manifest and be fully described in a single lock file.