Content from Reproducible Research
Last updated on 2025-05-26 | Edit this page
Estimated time: 18 minutes
Overview
Questions
- What does it mean to be “reproducible”?
- How is “reproducibility” different than “reuse”?
Objectives
- Understand the concepts of reproducibility and reuse
- Be able to describe what is needed for a computational environment to be reproducible.
Introduction
Modern scientific analyses are complex software and logistical workflows that may span multiple software environments and require heterogenous software and computing infrastructure. Scientific researchers need to keep track of all of this to be able to do their research, and to ensure the validity of their work, which can be difficult. Scientific software enables all of this work to happen, but software isn’t a static resource — software is continually developed, revised, and released, which can introduce large breaking changes or subtle computational differences in outputs and results. Having the software you’re using change without you intending it from day-to-day, run-to-run, or on different machines is problematic when trying to do high quality research and can cause software bugs, errors in scientific results, and make findings unreproducible. All of these things are not desirable!
Callout
When discussing “software” in this lesson we will primarily be meaning open source software that is openly developed. However, there are situations in which software might (for good reason) be:
- Closed development with open source release artifacts
- Closed development and closed source with public binary distributions
- Closed development and closed source with proprietary licenses
Ask the participants to discuss the question in small groups of 2 to 4 people at their table for 3 minutes and then to share their group’s thoughts.
What are other challenges to reproducible research?
There are many! Here are some you might have thought of:
- (Not having) Access to data
- Required software packages be removed from mutable package indexes
- Unreproducible builds of software that isn’t packaged and distributed on public package indexes
- Analysis code not being under version control
- Not having any environment definition configuration files
What did you come up with?
Computational reproducibility
“Reproducible” research can mean many things and is a multipronged problem. This lesson will focus primarily on computational reproducibility. Like all forms of reproducibility, there are multiple “levels” of reproducibility. For this lesson we will focus on “full” reproducibility, meaning that reproducible software environments will:
- Be defined through high level user configuration files.
- Have machine produced hash level lock files with a full definition of all software in the environment.
- Specify target computer platforms for all environments solved.
- Have the resolution and “solving” of a platform’s environments be machine agnostic.
- Have the software packages defined in the environments exist on immutable public package indexes.
Hardware accelerated environments
Software the involves hardware acceleration on computing resources like GPUs requires additional information to be provided for full computational reproducibility. In addition to the computer platform, information about the hardware acceleration device, its supported drivers, and compatible hardware accelerated versions of the software in the environment (GPU enabled builds) are required. Traditionally this has been very difficult to do, but multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems.
What are possible challenges of reproducible hardware accelerated environments?
Here are some you might have thought of:
- Installing hardware acceleration drivers and libraries on the machine with the GPU
- Knowing what drivers are supported for the available GPUs
- Providing instructions to install the same drivers and libraries on multiple computing platforms
- Having the “deployment” machine’s resources and environment where the analysis is done match the “development” machine’s environment
What did you come up with?
Does computational reproducibility mean that the exact same numeric results should be achieved every time?
Not necessarily. Even though the computational software environment is identical there are things that can change between runs of software that could slightly change numerical results (e.g. random number generation seeds, file read order, machine entropy). This isn’t necessarily a problem, and in general one should be more concerned with getting answers that make sense within application uncertainties than matching down to machine precision.
What are additional reasons you thought of?
Computational reproducibility vs. scientific reuse
Aiming for computational reproducibility is the first step to making scientific research more beneficial to us. For the purposes of a single analysis this should be the primary goal. However, just because a software environment is fully reproducible does not mean that the research is automatically reusable. Reuse allows for the tools and components of the scientific workflow to be composable tools that can interoperate together to create a workflow. The steps of the workflow might exist in radically different computational environments and require different software, or different versions of the same software tools. Given these demands, reproducible computational software environments are a first step toward fully reusable scientific workflows.
This lesson will focus on computational reproducibility of hardware accelerated scientific workflows (e.g. machine learning). Scientifically reusable analysis workflows are a more extensive topic, but this lesson will link to references on the topic.
What are challenges to your own research practices to making them reproducible and reusable?
- Technical expertise in reproducibility technologies
- Time to learn new tools
- Balancing reproducibility concerns with using tools the entire research team can understand
What did you come up with?
Key Points
- Modern scientific research is complex and requires software environments.
- Computational reproducibility helps to enable reproducible science, but is not sufficient by itself.
- Reproducible computational software environments that use hardware acceleration require additional information.
- New technologies make all of these processes easier.
- Reproducible computational software environments are a first step toward fully reusable scientific workflows but are not sufficient by themselves.
Content from Introduction to Pixi
Last updated on 2025-06-02 | Edit this page
Estimated time: 45 minutes
Overview
Questions
- What is Pixi?
- How does Pixi enable fully reproducible software environments?
- What are Pixi’s semantics?
Objectives
- Learn Pixi’s workflow design
- Understand the relationship between a Pixi manifest and a lock file
- Understand how to create a multi-platform and multi-environment Pixi workspace
Pixi
As described in the previous section on computational reproducibility, to have reproducible software environments we need tools that can take high level human writeable environment configuration files and produce machine readable hash level lock files that exactly specify every piece of software that exists in an environment.
Pixi a cross-platform package and environment manager that can handle complex development workflows. Importantly, Pixi automatically and non-optionally will produce or update a lock file for the software environments defined by the user whenever any actions mutate the environment. Pixi is written in Rust, and leverages the language’s speed and technologies to solve environments fast.
Pixi addresses the concept of computational reproducibility by focusing on a set of main features
- Virtual environment management: Pixi can create environments that contain conda packages and Python packages and use or switch between environments easily.
-
Package management: Pixi enables the user to
install, update, and remove packages from these environments through the
pixi
command line. - Task management: Pixi has a task runner system built-in, which allows for tasks with custom logic and dependencies on other tasks to be created.
combined with robust behaviors
- Automatic lock files: Any changes to a Pixi workspace that can mutate the environments defined in it will automatically and non-optionally result in the Pixi lock file for the workspace being updated. This ensures that any and every state of a Pixi project is trivially computationally reproducible.
- Solving environments for other platforms: Pixi allows the user to solve environment for platforms other than the current user machine’s. This allows for users to solve and share environment to any collaborator with confidence that all environments will work with no additional setup.
- Pairity of conda and Python packages: Pixi allows for conda packages and Python packages to be used together seamlessly, and is unique in its ability to handle overlap in dependencies between them. Pixi will first solve all conda package requirements for the target environment, lock the environment, and then solve all the dependencies of the Python packages for the environment, determine if there are any overlaps with the existing conda environment, and the only install the missing Python dependencies. This ensures allows for fully reproducible solves and for the two package ecosystems to compliment each other rather than potentially cause conflicts.
- Efficient caching: Pixi uses an extremely efficient global caching scheme. This means that the first time a package is installed on a machine with Pixi is the slowest is will ever be to install it for any future project on the machine while the cache is still active.
Project-based workflows
Pixi uses a “project-based” workflow which scopes a workspace and the installed tooling for a project to the project’s directory tree.
Pros
- Environments in the workspace are isolated to the project and can not cause conflicts with any tools or projects outside of the project.
- A high level declarative syntax allows for users to state only what they need, making even complex environments easy to understand and share.
- Environments can be treated as transient and be fully deleted and then rebuilt within seconds without worry about breaking other projects. This allows for much greater freedom of exploration and development without fear.
Cons
- As each project has its own version of its packages installed, and does not share a copy with other projects, the total amount of disk space on a machine can be larger than other forms of development workflows. This can be mitigated for disk limited machines by cleaning environments not in use while keeping their lock files and cleaning the system cache periodically.
- Each project needs to be set up by itself and does not reuse components of previous projects.
Pixi project files and the CLI API basics
Every Pixi project begins with creating a manifest file. A manifest file is a declarative configuration file that list what the high level requirements of a project are. Pixi then takes those requirements and constraints and solves for the full dependency tree.
Let’s create our first Pixi project. First, to have a uniform
directory tree experience, create a directory called
pixi-lesson
under your home directory on your machine and
navigate to it.
Then use pixi init
to create a new project directory and
initialize a Pixi manifest with your machine’s configuration.
OUTPUT
Created /home/<username>/pixi-lesson/example/pixi.toml
Navigate to the example
directory and check the
directory structure
OUTPUT
. .. .gitattributes .gitignore pixi.toml
We see that Pixi has setup Git configuration files for the project as
well as a Pixi manifest pixi.toml
file. Checking the
default manifest file, we see three TOML tables: workspace
,
tasks
, and dependencies
.
TOML
[workspace]
authors = ["Your Name <your email from your global Git config>"]
channels = ["conda-forge"]
name = "example"
# This will be whatever your machine's platform is
platforms = ["linux-64"]
version = "0.1.0"
[tasks]
[dependencies]
-
workspace
: Defines metadata and properties for the entire project. -
tasks
: Defines tasks for the task runner system to execute from the command line and their dependencies. -
dependencies
: Defines the conda package dependencies from thechannels
in yourworkspace
table.
Callout
For the rest of the lesson we’ll ignore the authors
list
in our discussions as it is optional and will be specific to you.
At the moment there are no dependencies defined in the manifest, so
let’s add Python using the pixi add
CLI API.
OUTPUT
✔ Added python >=3.13.3,<3.14
What happened? We saw that python
got added, and we can
see that the pixi.toml
manifest now contains
python
as a dependency
TOML
[workspace]
channels = ["conda-forge"]
name = "example"
# This will be whatever your machine's platform is
platforms = ["linux-64"]
version = "0.1.0"
[tasks]
[dependencies]
python = ">=3.13.3,<3.14"
Further, we also now see that a pixi.lock
lock file has
been created in the project directory as well as a .pixi/
directory.
OUTPUT
. .. .gitattributes .gitignore .pixi pixi.lock pixi.toml
The .pixi/
directory contains the installed
environments. We can see that at the moment there is just one
environment named default
OUTPUT
default
Inside the .pixi/envs/default/
directory are all the
libraries, header files, and executables that are needed by the
environment.
The pixi.lock
lock file contains YAML that defines all
requested conda package dependencies in the manifest, as well as their
dependencies, at the exact versions that were solved for. It provides
their full URLs on the conda package index to download from as well as
digest information for the exact package to ensure that it is exactly
specified and that version, and only that version, will
be downloaded and installed in the future. We can even test that now by
deleting the installed environment fully with pixi clean
and then getting it back (bit for bit) in a few seconds with pixi install
.
OUTPUT
removed /home/<username>/pixi-lesson/example/.pixi/envs
OUTPUT
✔ The default environment has been installed.
We can also see all the packages that were installed and are now
available for us to use with pixi list
OUTPUT
Package Version Build Size Kind Source
_libgcc_mutex 0.1 conda_forge 2.5 KiB conda https://conda.anaconda.org/conda-forge/
_openmp_mutex 4.5 2_gnu 23.1 KiB conda https://conda.anaconda.org/conda-forge/
bzip2 1.0.8 h4bc722e_7 246.9 KiB conda https://conda.anaconda.org/conda-forge/
ca-certificates 2025.4.26 hbd8a1cb_0 148.7 KiB conda https://conda.anaconda.org/conda-forge/
ld_impl_linux-64 2.43 h712a8e2_4 655.5 KiB conda https://conda.anaconda.org/conda-forge/
libexpat 2.7.0 h5888daf_0 72.7 KiB conda https://conda.anaconda.org/conda-forge/
libffi 3.4.6 h2dba641_1 56.1 KiB conda https://conda.anaconda.org/conda-forge/
libgcc 15.1.0 h767d61c_2 809.7 KiB conda https://conda.anaconda.org/conda-forge/
libgcc-ng 15.1.0 h69a702a_2 33.8 KiB conda https://conda.anaconda.org/conda-forge/
libgomp 15.1.0 h767d61c_2 442 KiB conda https://conda.anaconda.org/conda-forge/
liblzma 5.8.1 hb9d3cd8_1 110.2 KiB conda https://conda.anaconda.org/conda-forge/
libmpdec 4.0.0 hb9d3cd8_0 89 KiB conda https://conda.anaconda.org/conda-forge/
libsqlite 3.50.0 hee588c1_0 897.5 KiB conda https://conda.anaconda.org/conda-forge/
libuuid 2.38.1 h0b41bf4_0 32.8 KiB conda https://conda.anaconda.org/conda-forge/
libzlib 1.3.1 hb9d3cd8_2 59.5 KiB conda https://conda.anaconda.org/conda-forge/
ncurses 6.5 h2d0b736_3 870.7 KiB conda https://conda.anaconda.org/conda-forge/
openssl 3.5.0 h7b32b05_1 3 MiB conda https://conda.anaconda.org/conda-forge/
python 3.13.3 hf636f53_101_cp313 31.7 MiB conda https://conda.anaconda.org/conda-forge/
python_abi 3.13 7_cp313 6.8 KiB conda https://conda.anaconda.org/conda-forge/
readline 8.2 h8c095d6_2 275.9 KiB conda https://conda.anaconda.org/conda-forge/
tk 8.6.13 noxft_hd72426e_102 3.1 MiB conda https://conda.anaconda.org/conda-forge/
tzdata 2025b h78e105d_0 120.1 KiB conda https://conda.anaconda.org/conda-forge/
Extending the manifest
Let’s extend this manifest to add the Python library
numpy
and the Jupyter tools notebook
and
jupyterlab
as dependencies and add a task called
lab
that will launch Jupyter Lab in the current working
directory.
Let’s start at the command line and add the additional dependencies
with pixi add
Then let’s manually edit the pixi.toml
with a text
editor to add a task named lab
that when called executes
jupyter lab
.
The resulting pixi.toml
manifest is
With our new dependencies added to the project manifest and our
lab
task defined, let’s use all of them together by
launching our task using pixi run
and we see that Jupyter Lab launches!
Here we used pixi run
to execute tasks in the
workspace’s environments without ever explicitly activating the
environment. This is a different behavior compared to tools like conda
of Python virtual environments, where the assumption is that you have
activated an environment before using it. With Pixi we can do the
equivalent with pixi shell
,
which starts a subshell in the current working directory with the Pixi
environment activated.
Notice how your shell prompt now has (example)
(the
workspace name) preceding it, signaling to you that you’re in the
activated environment. You can now directly run commands that use the
environment.
OUTPUT
Python 3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
As we’re in a subshell, to exit the environment and move back to the
shell that launched the subshell, just exit
the shell.
bash
Multi platform projects
Extend your project to additionally support the linux-64
and osx-arm64
platforms.
Using the pixi workspace
CLI API, one can add the platforms with
This both adds the platforms to the workspace
platforms
list and solves for the platforms and updates the
lock file!
One can also manually edit the pixi.toml
with a text
editor to add the desired platforms to the platforms
list.
The resulting pixi.toml
manifest is
So far the Pixi project has only had one environment defined in it.
We can make the project
multi-environment by first defining a new “feature”
which provides all the fields necessary to define part of an
environment to extend the default
environment. We can
create a new feature
named dev
and then create
an environment
also named dev
which uses the
dev
feature to extend the default environment
TOML
[workspace]
channels = ["conda-forge"]
name = "example"
platforms = ["linux-64", "osx-arm64"]
version = "0.1.0"
[tasks.lab]
description = "Launch JupyterLab"
cmd = "jupyter lab"
[dependencies]
python = ">=3.13.3,<3.14"
numpy = ">=2.2.6,<3"
notebook = ">=7.4.3,<8"
jupyterlab = ">=4.4.3,<5"
[feature.dev.dependencies]
[environments]
dev = ["dev"]
We can now add pre-commit
to the dev
feature’s dependencies
and have it be accessible in the
dev
environment.
OUTPUT
✔ Added pre-commit >=4.2.0,<5
Added these only for feature: dev
TOML
[workspace]
channels = ["conda-forge"]
name = "example"
platforms = ["linux-64", "osx-arm64"]
version = "0.1.0"
[tasks.lab]
description = "Launch JupyterLab"
cmd = "jupyter lab"
[dependencies]
python = ">=3.13.3,<3.14"
numpy = ">=2.2.6,<3"
notebook = ">=7.4.3,<8"
jupyterlab = ">=4.4.3,<5"
[feature.dev.dependencies]
pre-commit = ">=4.2.0,<5"
[environments]
dev = ["dev"]
This now allows us to specify the environment we want tasks to run in
with the --environment
flag
Key Points
- Pixi uses a project based workflow and a declarative project manifest file to define project operations.
- Pixi automatically creates or updates a hash level lock file anytime the project manifest or dependencies are mutated.
- Pixi allows for multi-platform and multi-environment projects to be defined in a single project manifest and be fully described in a single lock file.