Using Pixi environments on HTC Systems

Last updated on 2025-06-16 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • How can you run workflows that use GPUs with Pixi CUDA environments?
  • What solutions exist for the resources you have?

Objectives

  • Learn how to submit containerized workflows to HTC systems.

High Throughput Computing (HTC)


One of the most common forms of production computing is high-throughput computing (HTC), where computational problems as distributed across multiple computing resources to parallelize computations and reduce total compute time. HTC resources are quite dynamic, but usually focus on smaller memory and disk requirements on each individual worker compute node. This is in contrast to high-performance computing (HPC) where there are comparatively fewer compute nodes but the capabilities and associated memory, disk, and bandwidth resources are much higher.

Two of the most common HTC workflow management systems are HTCondor and SLURM.

Setting up a problem


First let’s create a computing problem to apply these compute systems to.

Let’s first create a new project in our Git repository

BASH

pixi init ~/pixi-lesson/htcondor
cd ~/pixi-lesson/htcondor

OUTPUT

✔ Created ~/<username>/pixi-lesson/htcondor/pixi.toml

Training a PyTorch model on the MNIST dataset

Let’s write a very standard tutorial example of training a deep neral network on the MNIST dataset with PyTorch and then run it on GPUs in an HTCondor worker pool.

Mea culpa, more interesting examples exist

More exciting examples will be used in the future, but MNIST is perhaps one of the most simple examples to illustrate a point.

The neural network code

We’ll download Python code that uses a convocational neural network written in PyTorch to learn to identify the handwritten number of the MNIST dataset and place it under a src/ directory. This is a modified example from the PyTorch documentation (https://github.com/pytorch/examples/blob/main/mnist/main.py) which is licensed under the BSD 3-Clause license.

BASH

curl -sLO https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/c7889222544928fb6f9fdeb1145767272b5cfec8/torch_MNIST.py
mkdir -p src
mv torch_MNIST.py src/

The Pixi environment

Now let’s think about what we need to use this code. Looking at the imports of src/torch_MNIST.py we can see that torch and torchvision are the only imported libraries that aren’t part of the Python standard library, so we will need to depend on PyTorch and torchvision. We also know that we’d like to use CUDA accelerated code, so that we’ll need CUDA libraries and versions of PyTorch that support CUDA.

Create the environment

Create a Pixi workspace that:

  • Has PyTorch and torchvision in it.
  • Has the ability to support CUDA v12.
  • Has an environment that has the CPU version of PyTorch and torchvision that can be installed on linux-64, osx-arm64, and win-64.
  • Has an environment that has the GPU version of PyTorch and torchvision.

This is just expanding the exercises from the CUDA conda packages episode.

Let’s first add all the platforms we want to work with to the workspace

BASH

pixi workspace platform add linux-64 osx-arm64 win-64

OUTPUT

✔ Added linux-64
✔ Added osx-arm64
✔ Added win-64

We know that in both environment we’ll want to use Python, and so we can install that in the default environment and have it be used in both the cpu and gpu environment.

BASH

pixi add python

OUTPUT

✔ Added python >=3.13.5,<3.14

Let’s now add the CPU requirements to a feature named cpu

BASH

pixi add --feature cpu pytorch-cpu torchvision

OUTPUT

✔ Added pytorch-cpu
✔ Added torchvision
Added these only for feature: cpu

and then create an environment named cpu with that feature

BASH

pixi workspace environment add --feature cpu cpu

OUTPUT

✔ Added environment cpu

and insatiate it with particular versions

BASH

pixi upgrade --feature cpu

TOML

[workspace]
channels = ["conda-forge"]
name = "htcondor"
platforms = ["linux-64", "osx-arm64", "win-64"]
version = "0.1.0"

[tasks]

[dependencies]
python = ">=3.13.5,<3.14"

[feature.cpu.dependencies]
pytorch-cpu = ">=2.7.0,<3"
torchvision = ">=0.22.0,<0.23"

[environments]
cpu = ["cpu"]

Now let’s add the GPU environment and dependencies. Let’s start with the CUDA system requirements

BASH

pixi workspace system-requirements add --feature gpu cuda 12

Override the __cuda virtual package

Remember that if you’re on a platform that doesn’t support the system-requirement you’ll need to override the checks to solve the environment.

BASH

export CONDA_OVERRIDE_CUDA=12

and create an environment from the feature

BASH

pixi workspace environment add --feature gpu gpu

OUTPUT

✔ Added environment gpu

and then add the GPU dependencies for the target platform of linux-64 (where we’ll run in production).

BASH

pixi add --platform linux-64 --feature gpu pytorch-gpu torchvision

OUTPUT

✔ Added pytorch-gpu >=2.7.0,<3
✔ Added torchvision >=0.22.0,<0.23
Added these only for platform(s): linux-64
Added these only for feature: gpu

TOML

[workspace]
channels = ["conda-forge"]
name = "htcondor"
platforms = ["linux-64", "osx-arm64", "win-64"]
version = "0.1.0"

[tasks]

[dependencies]
python = ">=3.13.5,<3.14"

[feature.cpu.dependencies]
pytorch-cpu = ">=2.7.0,<3"
torchvision = ">=0.22.0,<0.23"

[feature.gpu.system-requirements]
cuda = "12"

[feature.gpu.target.linux-64.dependencies]
pytorch-gpu = ">=2.7.0,<3"
torchvision = ">=0.22.0,<0.23"

[environments]
cpu = ["cpu"]
gpu = ["gpu"]

To validate that things are working with the CPU code, let’s do a short training run for only 2 epochs in the cpu environment.

BASH

pixi run --environment cpu python src/torch_MNIST.py --epochs 2 --save-model --data-dir data

OUTPUT

100.0%
100.0%
100.0%
100.0%
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.329474
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.425185
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.826808
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.556883
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.483756
...
Train Epoch: 2 [57600/60000 (96%)]	Loss: 0.146226
Train Epoch: 2 [58240/60000 (97%)]	Loss: 0.016065
Train Epoch: 2 [58880/60000 (98%)]	Loss: 0.003342
Train Epoch: 2 [59520/60000 (99%)]	Loss: 0.001542

Test set: Average loss: 0.0351, Accuracy: 9874/10000 (99%)

Running multiple ways

What’s another way we could have run this other than with pixi run?

You can enter a shell environment first

BASH

pixi shell --environment cpu
python src/torch_MNIST.py --epochs 2 --save-model --data-dir data

The Linux container

Let’s write a Dockerfile that installs the gpu environment into the container image when built.

Write the Dockerfile

Write a Dockerfile that will install the gpu environment and only the gpu environment into the container image.

DOCKERFILE

FROM ghcr.io/prefix-dev/pixi:noble AS build

WORKDIR /app
COPY . .
ENV CONDA_OVERRIDE_CUDA=12
RUN pixi install --locked --environment gpu
RUN echo "#!/bin/bash" > /app/entrypoint.sh && \
    pixi shell-hook --environment gpu -s bash >> /app/entrypoint.sh && \
    echo 'exec "$@"' >> /app/entrypoint.sh

FROM ghcr.io/prefix-dev/pixi:noble AS production

WORKDIR /app
COPY --from=build /app/.pixi/envs/gpu /app/.pixi/envs/gpu
COPY --from=build /app/pixi.toml /app/pixi.toml
COPY --from=build /app/pixi.lock /app/pixi.lock
# The ignore files are needed for 'pixi run' to work in the container
COPY --from=build /app/.pixi/.gitignore /app/.pixi/.gitignore
COPY --from=build /app/.pixi/.condapackageignore /app/.pixi/.condapackageignore
COPY --from=build --chmod=0755 /app/entrypoint.sh /app/entrypoint.sh

ENTRYPOINT [ "/app/entrypoint.sh" ]

Building and deploying the container image

Now let’s add a GitHub Actions pipeline to build this Dockerfile and deploy it to a Linux container registry.

Build and deploy Linux container image to registry

Add a GitHub Actions pipeline that will build the Dockerfile and deploy it to GitHub Container Registry (ghcr).

Create the GitHub Actions workflow directory tree

BASH

mkdir -p .github/workflows

and then write a YAML file at .github/workflows/ci.yaml that contains the following:

YAML

name: Build and publish Docker images

on:
  push:
    branches:
      - main
    tags:
      - 'v*'
    paths:
      - 'htcondor/pixi.toml'
      - 'htcondor/pixi.lock'
      - 'htcondor/Dockerfile'
      - 'htcondor/.dockerignore'
  pull_request:
    paths:
      - 'htcondor/pixi.toml'
      - 'htcondor/pixi.lock'
      - 'htcondor/Dockerfile'
      - 'htcondor/.dockerignore'
  release:
    types: [published]
  workflow_dispatch:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

permissions: {}

jobs:
  docker:
    name: Build and publish images
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Docker meta
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: |
            ghcr.io/${{ github.repository }}
          # generate Docker tags based on the following events/attributes
          tags: |
            type=raw,value=hello-pytorch-noble-cuda-12.9
            type=raw,value=latest
            type=sha

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GitHub Container Registry
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Test build
        id: docker_build_test
        uses: docker/build-push-action@v6
        with:
          context: htcondor
          file: htcondor/Dockerfile
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          pull: true

      - name: Deploy build
        id: docker_build_deploy
        uses: docker/build-push-action@v6
        with:
          context: htcondor
          file: htcondor/Dockerfile
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          pull: true
          push: ${{ github.event_name != 'pull_request' }}

To verify that things are visible to other computers, install the Linux container utility crane

BASH

pixi global install crane
└── crane: 0.20.5 (installed)
    └─ exposes: crane

and then use crane ls to list all of the container images in your container registry for the particular image

BASH

crane ls ghcr.io/<your GitHub username>/pixi-lesson

HTCondor


This episode will be on a remote system

All the computation in the rest of this episode will take place on a remote system with an HTC workflow manager.

To provide a very high level overview of HTCondor in this episode we’ll focus on only a few of its many resources and capabilities.

  1. Writing HTCondor execution scripts to define what the HTCondor worker nodes will actually do.
  2. Writing HTCondor submit description files to send our jobs to the HTCondor worker pool.
  3. Submitting those jobs with condor_submit and monitoring them with condor_q.

Connection between execution scripts and submit description files

As HTCondor execution scripts are given as the executable field in HTCondor submit description files, they are tightly linked and can not be written fully independently. Though they are presented as separate steps above, you will in practice write these together.

Write the HTCondor execution script

Let’s first start to write the execution script mnist_gpu_docker.sh, as we can think about how that relates to our code.

  • We’ll be running in the gpu environment that we defined with Pixi and built into our Docker container image.
  • For security reasons the HTCondor worker nodes don’t have full connection to all of the internet. So we’ll need to transfer out input data and source code rather than download it on demand.
  • We’ll need to activate the environment using the /app/entrypoint.sh script we built into the Docker container image.

BASH

#!/bin/bash

# detailed logging to stderr
set -x

echo -e "# Hello CHTC from Job ${1} running on $(hostname)\n"
echo -e "# GPUs assigned: ${CUDA_VISIBLE_DEVICES}\n"

echo -e "# Activate Pixi environment\n"
# The last line of the entrypoint.sh file is 'exec "$@"'. If this shell script
# receives arguments, exec will interpret them as arguments to it, which is not
# intended. To avoid this, strip the last line of entrypoint.sh and source that
# instead.
. <(sed '$d' /app/entrypoint.sh)

echo -e "# Check to see if the NVIDIA drivers can correctly detect the GPU:\n"
nvidia-smi

echo -e "\n# Check if PyTorch can detect the GPU:\n"
python ./src/torch_detect_GPU.py

echo -e "\n# Extract the training data:\n"
if [ -f "MNIST_data.tar.gz" ]; then
    tar -vxzf MNIST_data.tar.gz
else
    echo "The training data archive, MNIST_data.tar.gz, is not found."
    echo "Please transfer it to the worker node in the HTCondor jobs submission file."
    exit 1
fi

echo -e "\n# Check that the training code exists:\n"
ls -1ap ./src/

echo -e "\n# Train MNIST with PyTorch:\n"
time python ./src/torch_MNIST.py --data-dir ./data --epochs 14 --save-model

Write the HTCondor submit description file

This is pretty standard boiler plate taken from the HTCondor documentation

# mnist_gpu_docker.sub
# Submit file to access the GPU via docker

# Set the "universe" to 'container' to use Docker
universe = container
# the container images are cached, and so if a container image tag is
# overwritten it will not be pulled again
container_image = docker://ghcr.io/<github user name>/pixi-lesson:sha-<sha>

# set the log, error and output files
log = mnist_gpu_docker.log.txt
error = mnist_gpu_docker.err.txt
output = mnist_gpu_docker.out.txt

# set the executable to run
executable = mnist_gpu_docker.sh
arguments = $(Process)

# transfer training data files to the compute node
transfer_input_files = MNIST_data.tar.gz,src

# transfer the serialized trained model back
transfer_output_files = mnist_cnn.pt

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

# We require a machine with a modern version of the CUDA driver
Requirements = (Target.CUDADriverVersion >= 12.0)

# We must request 1 CPU in addition to 1 GPU
request_cpus = 1
request_gpus = 1

# select some memory and disk space
request_memory = 2GB
request_disk = 2GB

# Opt in to using CHTC GPU Lab resources
+WantGPULab = true
# Specify short job type to run more GPUs in parallel
# Can also request "medium" or "long"
+GPUJobLength = "short"

# Tell HTCondor to run 1 instances of our job:
queue 1

Write the submission script

To make it easy for us, we can write a small job submission script submit.sh that will prepare the data for us and submit the submit description file to HTCondor for us with condor_submit.

BASH

#!/bin/bash

# Download the training data locally to transfer to the worker node
if [ ! -f "MNIST_data.tar.gz" ]; then
    # c.f. https://github.com/CHTC/templates-GPUs/blob/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
    curl -sLO https://raw.githubusercontent.com/CHTC/templates-GPUs/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
fi

# Ensure existing models are backed up
if [ -f "mnist_cnn.pt" ]; then
    mv mnist_cnn.pt mnist_cnn_"$(date '+%Y-%m-%d-%H-%M')".pt.bak
fi

condor_submit mnist_gpu_docker.sub

Submitting the job

Before we actually submit code to run, we can submit an interactive job from the HTCondor system’s login nodes to check that things work as expected.

BASH

#!/bin/bash

# Download the training data locally to transfer to the worker node
if [ ! -f "MNIST_data.tar.gz" ]; then
    # c.f. https://github.com/CHTC/templates-GPUs/blob/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
    curl -sLO https://raw.githubusercontent.com/CHTC/templates-GPUs/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
fi

# Ensure existing models are backed up
if [ -f "mnist_cnn.pt" ]; then
    mv mnist_cnn.pt mnist_cnn_"$(date '+%Y-%m-%d-%H-%M')".pt.bak
fi

condor_submit -interactive mnist_gpu_docker.sub

Submitting the job for the first time will take a bit as it needs to pull down the container image, so be patient. The container image will be cached in the future and so this will be faster.

BASH

bash interact.sh

OUTPUT

Submitting job(s).
1 job(s) submitted to cluster 2127828.
Waiting for job to start...
...
Welcome to interactive3_1@vetsigian0001.chtc.wisc.edu!
Your condor job is running with pid(s) 2368233.
groups: cannot find name for group ID 24433
groups: cannot find name for group ID 40092
I have no name!@vetsigian0001:/var/lib/condor/execute/slot3/dir_2367762$

We can now activate our environment manually and look around

BASH

. /app/entrypoint.sh

OUTPUT

(htcondor:gpu) I have no name!@vetsigian0001:/var/lib/condor/execute/slot3/dir_2367762$

BASH

command -v python

OUTPUT

/app/.pixi/envs/gpu/bin/python

BASH

python --version

OUTPUT

Python 3.13.5

BASH

pixi list pytorch

OUTPUT

Environment: gpu
Package      Version  Build                           Size      Kind   Source
pytorch      2.7.0    cuda126_mkl_py313_he20fe19_300  27.8 MiB  conda  https://conda.anaconda.org/conda-forge/
pytorch-gpu  2.7.0    cuda126_mkl_ha999a5f_300        46.1 KiB  conda  https://conda.anaconda.org/conda-forge/

BASH

pixi list cuda

OUTPUT

Environment: gpu
Package               Version  Build       Size       Kind   Source
cuda-crt-tools        12.9.86  ha770c72_1  28.2 KiB   conda  https://conda.anaconda.org/conda-forge/
cuda-cudart           12.9.79  h5888daf_0  22.7 KiB   conda  https://conda.anaconda.org/conda-forge/
cuda-cudart_linux-64  12.9.79  h3f2d84a_0  192.6 KiB  conda  https://conda.anaconda.org/conda-forge/
cuda-cuobjdump        12.9.82  hbd13f7d_0  237.5 KiB  conda  https://conda.anaconda.org/conda-forge/
cuda-cupti            12.9.79  h9ab20c4_0  1.8 MiB    conda  https://conda.anaconda.org/conda-forge/
cuda-nvcc-tools       12.9.86  he02047a_1  26.2 MiB   conda  https://conda.anaconda.org/conda-forge/
cuda-nvdisasm         12.9.88  hbd13f7d_0  5.3 MiB    conda  https://conda.anaconda.org/conda-forge/
cuda-nvrtc            12.9.86  h5888daf_0  64.1 MiB   conda  https://conda.anaconda.org/conda-forge/
cuda-nvtx             12.9.79  h5888daf_0  28.6 KiB   conda  https://conda.anaconda.org/conda-forge/
cuda-nvvm-tools       12.9.86  he02047a_1  23.1 MiB   conda  https://conda.anaconda.org/conda-forge/
cuda-version          12.9     h4f385c5_3  21.1 KiB   conda  https://conda.anaconda.org/conda-forge/

BASH

nvidia-smi

OUTPUT

Mon Jun 16 00:07:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:B2:00.0 Off |                  N/A |
| 29%   26C    P8             23W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

We can interactively run our code as well

BASH

tar -vxzf MNIST_data.tar.gz
time python ./src/torch_MNIST.py --data-dir ./data --epochs 2 --save-model

To return to the login node we just exit the interactive session

BASH

exit

Now to submit our job normally, we run the submit.sh script

BASH

bash submit.sh

OUTPUT

Submitting job(s).
1 job(s) submitted to cluster 2127879.

and its submission and state can be monitored with condor_q.

BASH

condor_q

OUTPUT



-- Schedd: ap2001.chtc.wisc.edu : <128.105.68.112:9618?... @ 06/15/25 19:16:17
OWNER     BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
mfeickert ID: 2127879   6/15 19:13      _      1      _      1 2127879.0

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

When the job finishes we see that HTCondor has returned to us the following files:

  • mnist_gpu_docker.log.txt: the HTCondor log file for the job
  • mnist_gpu_docker.out.txt: the stdout of all actions executed in the job
  • mnist_gpu_docker.err.txt: the stderr of all actions executed in the job
  • mnist_cnn.pt: the serialized trained PyTorch model

Key Points

  • You can use containerized Pixi environments with HTC systems to be able to run CUDA accelerated code that you defined.