Using Pixi environments on HTC Systems

Last updated on 2025-12-05 | Edit this page

Overview

Questions

How can you run workflows that use GPUs with Pixi CUDA environments using HTCondor?
What solutions exist for the resources you have?

Objectives

Learn how to submit containerized workflows to HTC systems.

High Throughput Computing (HTC)

One of the most common forms of production computing is high-throughput computing (HTC), where computational problems as distributed across multiple computing resources to parallelize computations and reduce total compute time. HTC resources are quite dynamic, but usually focus on smaller memory and disk requirements on each individual worker compute node. This is in contrast to high-performance computing (HPC) where there are comparatively fewer compute nodes but the capabilities and associated memory, disk, and bandwidth resources are much higher.

The most common HTC workflow management system is HTCondor.

Setting up a problem

First let’s create a computing problem to apply these compute systems to.

Let’s first create a new project in our Git repository

BASH

pixi init ~/pixi-cuda-lesson/htcondor
cd ~/pixi-cuda-lesson/htcondor

OUTPUT

✔ Created ~/<username>/pixi-cuda-lesson/htcondor/pixi.toml

Training a PyTorch model on the MNIST dataset

Let’s write a very standard tutorial example of training a deep neral network on the MNIST dataset with PyTorch and then run it on GPUs in an HTCondor worker pool.

Caution

Mea culpa, more interesting examples exist

More exciting examples will be used in the future, but MNIST is perhaps one of the most simple examples to illustrate a point.

The neural network code

We’ll download Python code that uses a convocational neural network written in PyTorch to learn to identify the handwritten number of the MNIST dataset and place it under a src/ directory. This is a modified example from the PyTorch documentation which is licensed under the BSD 3-Clause license.

BASH

mkdir -p src
curl -sL https://github.com/matthewfeickert/nvidia-gpu-ml-library-test/raw/c7889222544928fb6f9fdeb1145767272b5cfec8/torch_MNIST.py -o ./src/torch_MNIST.py
curl -sL https://github.com/matthewfeickert/nvidia-gpu-ml-library-test/raw/36c725360b1b1db648d6955c27bd3885b29a3273/torch_detect_GPU.py -o ./src/torch_detect_GPU.py

The Pixi environment

Now let’s think about what we need to use this code. Looking at the imports of src/torch_MNIST.py we can see that torch and torchvision are the only imported libraries that aren’t part of the Python standard library, so we will need to depend on PyTorch and torchvision. We also know that we’d like to use CUDA accelerated code, so that we’ll need CUDA libraries and versions of PyTorch that support CUDA.

Challenge

Create the environment

Create a Pixi workspace that:

Has PyTorch and torchvision in it.
Has the ability to support CUDA v12.
Has an environment that has the CPU version of PyTorch and torchvision that can be installed on linux-64, osx-arm64, and win-64.
Has an environment that has the GPU version of PyTorch and torchvision that can be installed on linux-64, and win-64.

Show me the solution

This is just expanding the exercises from the CUDA conda packages episode.

Let’s first add all the platforms we want to work with to the workspace

BASH

pixi workspace platform add linux-64 osx-arm64 win-64

OUTPUT

✔ Added linux-64
✔ Added osx-arm64
✔ Added win-64

Let’s add the CPU requirements to a feature named cpu

BASH

pixi add --feature cpu pytorch-cpu torchvision

OUTPUT

✔ Added pytorch-cpu
✔ Added torchvision
Added these only for feature: cpu

and then create an environment named cpu with that feature

BASH

pixi workspace environment add --feature cpu cpu

OUTPUT

✔ Added environment cpu

and insatiate it with particular versions

BASH

pixi upgrade --feature cpu

TOML

[workspace]
channels = ["conda-forge"]
name = "htcondor"
platforms = ["linux-64", "osx-arm64", "win-64"]
version = "0.1.0"

[tasks]

[dependencies]

[feature.cpu.dependencies]
pytorch-cpu = ">=2.7.1,<3"
torchvision = ">=0.24.0,<0.25"

[environments]
cpu = ["cpu"]

Now let’s add the GPU environment and dependencies. Let’s start with the CUDA system requirements

BASH

pixi workspace system-requirements add --feature gpu cuda 12

and create an environment from the feature

BASH

pixi workspace environment add --feature gpu gpu

OUTPUT

✔ Added environment gpu

and then add the GPU dependencies for the target platform of linux-64 (where we’ll run in production).

BASH

pixi add --platform linux-64 --platform win-64 --feature gpu pytorch-gpu torchvision

OUTPUT

✔ Added pytorch-gpu >=2.7.1,<3
✔ Added torchvision >=0.24.0,<0.25
Added these only for platform(s): linux-64, win-64
Added these only for feature: gpu

TOML

[workspace]
channels = ["conda-forge"]
name = "htcondor"
platforms = ["linux-64", "osx-arm64", "win-64"]
version = "0.1.0"

[tasks]

[dependencies]

[feature.cpu.dependencies]
pytorch-cpu = ">=2.7.1,<3"
torchvision = ">=0.24.0,<0.25"

[feature.gpu.system-requirements]
cuda = "12"

[feature.gpu.target.linux-64.dependencies]
pytorch-gpu = ">=2.7.1,<3"
torchvision = ">=0.24.0,<0.25"

[feature.gpu.target.win-64.dependencies]
pytorch-gpu = ">=2.7.1,<3"
torchvision = ">=0.24.0,<0.25"

[environments]
cpu = ["cpu"]
gpu = ["gpu"]

Caution

Override the `__cuda` virtual package

Remember that if you’re on a platform that doesn’t support the system-requirement you’ll need to override the checks to install the environment.

BASH

export CONDA_OVERRIDE_CUDA=12
# or
# CONDA_OVERRIDE_CUDA=12 pixi <command>

To validate that things are working with the CPU code, let’s do a short training run for only 2 epochs in the cpu environment.

BASH

pixi run --environment cpu python src/torch_MNIST.py --epochs 2 --save-model --data-dir data

OUTPUT

100.0%
100.0%
100.0%
100.0%
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.329474
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.425185
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.826808
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.556883
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.483756
...
Train Epoch: 2 [57600/60000 (96%)]	Loss: 0.146226
Train Epoch: 2 [58240/60000 (97%)]	Loss: 0.016065
Train Epoch: 2 [58880/60000 (98%)]	Loss: 0.003342
Train Epoch: 2 [59520/60000 (99%)]	Loss: 0.001542

Test set: Average loss: 0.0351, Accuracy: 9874/10000 (99%)

Challenge

Running multiple ways

What’s another way we could have run this other than with pixi run?

Show me the solution

You can enter a shell environment first

BASH

pixi shell --environment cpu
python src/torch_MNIST.py --epochs 2 --save-model --data-dir data

The Linux container

Apptainer

Let’s write an Apptainer definition file that installs the gpu environment, and nothing else, into the container image when built.

Bootstrap: docker
From: ghcr.io/prefix-dev/pixi:noble
Stage: build

# %arguments have to be defined at each stage
%arguments
    CUDA_VERSION=12
    ENVIRONMENT=gpu

%files
./pixi.toml /app/
./pixi.lock /app/
./.gitignore /app/

%post
#!/bin/bash
export CONDA_OVERRIDE_CUDA={{ CUDA_VERSION }}
cd /app/
pixi info
pixi install --locked --environment {{ ENVIRONMENT }}
echo "#!/bin/bash" > /app/entrypoint.sh && \
pixi shell-hook --environment {{ ENVIRONMENT }} -s bash >> /app/entrypoint.sh && \
echo 'exec "$@"' >> /app/entrypoint.sh


Bootstrap: docker
From: ghcr.io/prefix-dev/pixi:noble
Stage: final

%arguments
    ENVIRONMENT=gpu

%files from build
/app/.pixi/envs/{{ ENVIRONMENT }} /app/.pixi/envs/{{ ENVIRONMENT }}
/app/pixi.toml /app/pixi.toml
/app/pixi.lock /app/pixi.lock
/app/.gitignore /app/.gitignore
# The ignore files are needed for 'pixi run' to work in the container
/app/.pixi/.gitignore /app/.pixi/.gitignore
/app/.pixi/.condapackageignore /app/.pixi/.condapackageignore
/app/entrypoint.sh /app/entrypoint.sh

%post
#!/bin/bash
cd /app/
pixi info
chmod +x /app/entrypoint.sh

%runscript
#!/bin/bash
/app/entrypoint.sh "$@"

%startscript
#!/bin/bash
/app/entrypoint.sh "$@"

%test
#!/bin/bash -e
. /app/entrypoint.sh
pixi info
pixi list

Docker

Let’s write a Dockerfile that installs the gpu environment into the container image when built.

Challenge

Write the Dockerfile

Write a Dockerfile that will install the gpu environment and only the gpu environment into the container image.

Show me the solution

DOCKERFILE

ARG CUDA_VERSION="12"
ARG ENVIRONMENT="gpu"

FROM ghcr.io/prefix-dev/pixi:noble AS build

ARG CUDA_VERSION
ARG ENVIRONMENT

WORKDIR /app
COPY . .
ENV CONDA_OVERRIDE_CUDA=$CUDA_VERSION
RUN pixi install --locked --environment $ENVIRONMENT
RUN echo "#!/bin/bash" > /app/entrypoint.sh && \
    pixi shell-hook --environment $ENVIRONMENT -s bash >> /app/entrypoint.sh && \
    echo 'exec "$@"' >> /app/entrypoint.sh

FROM ghcr.io/prefix-dev/pixi:noble AS final

ARG ENVIRONMENT

WORKDIR /app
COPY --from=build /app/.pixi/envs/$ENVIRONMENT /app/.pixi/envs/$ENVIRONMENT
COPY --from=build /app/pixi.toml /app/pixi.toml
COPY --from=build /app/pixi.lock /app/pixi.lock
COPY --from=build /app/.pixi/.gitignore /app/.pixi/.gitignore
COPY --from=build /app/.pixi/.condapackageignore /app/.pixi/.condapackageignore
COPY --from=build --chmod=0755 /app/entrypoint.sh /app/entrypoint.sh

ENTRYPOINT [ "/app/entrypoint.sh" ]

Building and deploying the container image

Now let’s add a GitHub Actions pipeline to build the container image definition files (apptainer.def and Dockerfile) and deploy it to a Linux container registry.

Challenge

Build and deploy Linux container image to registry

Add a GitHub Actions pipeline that will build both the apptainer.def and Dockerfile files deploy them to GitHub Container Registry (ghcr). Have the image tags include the text mnist as this will be the identifying problem.

Show me the solution

Create the GitHub Actions workflow directory tree

BASH

mkdir -p .github/workflows

and then write a YAML file at .github/workflows/containers.yaml that contains the following:

YAML

name: Build and publish Linux container images

on:
  push:
    branches:
      - main
    tags:
      - 'v*'
    paths:
      - 'htcondor/pixi.toml'
      - 'htcondor/pixi.lock'
      - 'htcondor/apptainer.def'
      - 'htcondor/Dockerfile'
      - 'htcondor/.dockerignore'
  pull_request:
    paths:
      - 'htcondor/pixi.toml'
      - 'htcondor/pixi.lock'
      - 'htcondor/apptainer.def'
      - 'htcondor/Dockerfile'
      - 'htcondor/.dockerignore'
  release:
    types: [published]
  workflow_dispatch:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

permissions: {}

jobs:
  docker:
    name: Build and publish Docker images
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Docker meta
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: |
            ghcr.io/${{ github.repository }}
          # generate Docker tags based on the following events/attributes
          tags: |
            type=raw,value=mnist-gpu-noble-cuda-12.9
            type=raw,value=latest
            type=sha
            type=sha,prefix=mnist-gpu-noble-cuda-12.9-sha-

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GitHub Container Registry
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Test build
        id: docker_build_test
        uses: docker/build-push-action@v6
        with:
          context: htcondor
          file: htcondor/Dockerfile
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          pull: true

      - name: Deploy build
        id: docker_build_deploy
        uses: docker/build-push-action@v6
        with:
          context: htcondor
          file: htcondor/Dockerfile
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          pull: true
          push: ${{ github.event_name != 'pull_request' }}

  apptainer:
    name: Build and publish Apptainer images
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Free disk space
        uses: AdityaGarg8/remove-unwanted-software@v5
        with:
          remove-android: 'true'
          remove-dotnet: 'true'
          remove-haskell: 'true'

      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get commit SHA
        id: meta
        run: |
            # Get the short commit SHA (first 7 characters)
            SHA=$(git rev-parse --short HEAD)
            echo "sha=sha-$SHA" >> $GITHUB_OUTPUT

      - name: Install Apptainer
        uses: eWaterCycle/setup-apptainer@v2

      - name: Build container from definition file
        working-directory: ./htcondor
        run: apptainer build gpu-noble-cuda-12.9.sif apptainer.def

      - name: Test container
        working-directory: ./htcondor
        run: apptainer test gpu-noble-cuda-12.9.sif

      - name: Login to GitHub Container Registry
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Deploy built container
        if: github.event_name != 'pull_request'
        working-directory: ./htcondor
        run: apptainer push gpu-noble-cuda-12.9.sif oras://ghcr.io/${{ github.repository }}:apptainer-mnist-gpu-noble-cuda-12.9-${{ steps.meta.outputs.sha }}

Callout

Version Control

We now want to make sure that we can build container images with these defention files and workflows.

On a new branch in your repository, add and commit the files from this episode.

BASH

git add htcondor/pixi.*
git add htcondor/.git*
git add htcondor/apptainer.def
git add htcondor/Dockerfile.def
git add .github/workflows/containers.yaml

Then push your branch to your remote on GitHub

BASH

git push -u origin HEAD

and make a pull request to merge your changes into your remote’s default branch.

To verify that things are visible to other computers, install the Linux container utility crane

BASH

pixi global install crane

└── crane: 0.20.5 (installed)
    └─ exposes: crane

and then use crane ls to list all of the Docker container images in your container registry for the particular image

BASH

crane ls ghcr.io/<your GitHub username>/pixi-cuda-lesson

HTCondor

Callout

This episode will be on a remote system

All the computation in the rest of this episode will take place on a remote system with a HTC workflow manager.

To provide a very high level overview of HTCondor in this episode we’ll focus on only a few of its many resources and capabilities.

Writing HTCondor execution scripts to define what the HTCondor worker nodes will actually do.
Writing HTCondor submit description files to send our jobs to the HTCondor worker pool.
Submitting those jobs with condor_submit and monitoring them with condor_q.

Caution

Connection between execution scripts and submit description files

As HTCondor execution scripts are given as the executable field in HTCondor submit description files, they are tightly linked and can not be written fully independently. Though they are presented as separate steps above, you will in practice write these together.

Write the HTCondor execution script

Let’s first start to write the execution script mnist_gpu_apptainer.sh, as we can think about how that relates to our code.

We’ll be running in the gpu environment that we defined with Pixi and built into our Apptainer container image.
For security reasons the HTCondor worker nodes don’t have full connection to all of the internet. So we’ll need to transfer out input data and source code rather than download it on demand.
We’ll need to activate the environment using the /app/entrypoint.sh script we built into the Apptainer container image.

BASH

#!/bin/bash
# mnist_gpu_apptainer.sh

# detailed logging to stderr
set -x

echo -e "# Hello from Job ${1} running on $(hostname)\n"
echo -e "# GPUs assigned: ${CUDA_VISIBLE_DEVICES}\n"

echo -e "# Activate Pixi environment\n"
# The last line of the entrypoint.sh file is 'exec "$@"'. If this shell script
# receives arguments, exec will interpret them as arguments to it, which is not
# intended. To avoid this, strip the last line of entrypoint.sh and source that
# instead.
. <(sed '$d' /app/entrypoint.sh)

echo -e "# Check to see if the NVIDIA drivers can correctly detect the GPU:\n"
nvidia-smi

echo -e "\n# Check that the training code exists:\n"
ls -1ap ./src/

echo -e "\n# Check if PyTorch can detect the GPU:\n"
python ./src/torch_detect_GPU.py

echo -e "\n# Extract the training data:\n"
if [ -f "MNIST_data.tar.gz" ]; then
    tar -vxzf MNIST_data.tar.gz
else
    echo "The training data archive, MNIST_data.tar.gz, is not found."
    echo "Please transfer it to the worker node in the HTCondor jobs submission file."
    exit 1
fi

echo -e "\n# Train MNIST with PyTorch:\n"
time python ./src/torch_MNIST.py --data-dir ./data --epochs 14 --save-model

Write the HTCondor submit description file

This is pretty standard boiler plate taken from the HTCondor documentation

# mnist_gpu_apptainer.sub
# Submit file to access the GPU via apptainer

universe = container
container_image = oras://ghcr.io/<github user name>/pixi-cuda-lesson:apptainer-mnist-gpu-noble-cuda-12.9-sha-<sha>

# set the log, error and output files
log = mnist_gpu_apptainer_$(Cluster)_$(Process).log.txt
error = mnist_gpu_apptainer_$(Cluster)_$(Process).err.txt
output = mnist_gpu_apptainer_$(Cluster)_$(Process).out.txt

# set the executable to run
executable = mnist_gpu_apptainer.sh
arguments = $(Process)

+JobDurationCategory = "Medium"

# transfer training data files and runtime source files to the compute node
transfer_input_files = MNIST_data.tar.gz,src

# transfer the serialized trained model back
transfer_output_files = mnist_cnn.pt

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

# Require a machine with a modern version of the CUDA driver
requirements = (GPUs_DriverVersion >= 12.0)

# We must request 1 CPU in addition to 1 GPU
request_cpus = 1
request_gpus = 1

# select some memory and disk space
request_memory = 2GB
# Apptainer jobs take more disk than Docker jobs for some reason
request_disk = 7GB

# Optional: specify the GPU hardware architecture required
# Check against the CUDA GPU Compute Capability for your software
# e.g. python -c "import torch; print(torch.cuda.get_arch_list())"
# The listed 'sm_xy' values show the x.y gpu capability supported
gpus_minimum_capability = 5.0

# Optional: required GPU memory
gpus_minimum_memory = 2GB

# Tell HTCondor to run 1 instances of our job:
queue 1

Write the submission script

To make it easy for us, we can write a small job submission script submit.sh that will prepare the data for us and submit the submit description file to HTCondor for us with condor_submit.

BASH

#!/bin/bash
# submit.sh

# Download the training data locally to transfer to the worker node
if [ ! -f "MNIST_data.tar.gz" ]; then
    # c.f. https://github.com/CHTC/templates-GPUs/blob/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
    curl -sLO https://raw.githubusercontent.com/CHTC/templates-GPUs/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
fi

# Ensure existing models are backed up
if [ -f "mnist_cnn.pt" ]; then
    mv mnist_cnn.pt mnist_cnn_"$(date '+%Y-%m-%d-%H-%M')".pt.bak
fi

condor_submit mnist_gpu_apptainer.sub

Submitting the job

Before we actually submit code to run, we can submit an interactive job from the HTCondor system’s login nodes to check that things work as expected.

BASH

#!/bin/bash
# interact.sh

# Download the training data locally to transfer to the worker node
if [ ! -f "MNIST_data.tar.gz" ]; then
    # c.f. https://github.com/CHTC/templates-GPUs/blob/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
    curl -sLO https://raw.githubusercontent.com/CHTC/templates-GPUs/450081144c6ae0657123be2a9a357cb432d9d394/shared/pytorch/MNIST_data.tar.gz
fi

# Ensure existing models are backed up
if [ -f "mnist_cnn.pt" ]; then
    mv mnist_cnn.pt mnist_cnn_"$(date '+%Y-%m-%d-%H-%M')".pt.bak
fi

condor_submit -interactive mnist_gpu_apptainer.sub

Submitting the job for the first time will take a bit as it needs to pull down the container image, so be patient. The container image will be cached in the future and so this will be faster.

BASH

bash interact.sh

OUTPUT

Submitting job(s).
1 job(s) submitted to cluster 2127828.
Waiting for job to start...
...
Welcome to interactive3_1@vetsigian0001.chtc.wisc.edu!
Your condor job is running with pid(s) 2368233.
groups: cannot find name for group ID 24433
groups: cannot find name for group ID 40092
I have no name!@vetsigian0001:/var/lib/condor/execute/slot3/dir_2367762$

We can now activate our environment manually and look around

BASH

. /app/entrypoint.sh

OUTPUT

(htcondor:gpu) I have no name!@vetsigian0001:/var/lib/condor/execute/slot3/dir_2367762$

BASH

command -v python

OUTPUT

/app/.pixi/envs/gpu/bin/python

BASH

python --version

OUTPUT

Python 3.13.9

BASH

pixi list torch

OUTPUT

Environment: gpu
Package                     Version  Build                           Size       Kind   Source
libtorch                    2.7.1    cuda129_mkl_h9562ed8_304        836.3 MiB  conda  https://conda.anaconda.org/conda-forge/
pytorch                     2.7.1    cuda129_mkl_py313_h1e53aa0_304  28.1 MiB   conda  https://conda.anaconda.org/conda-forge/
pytorch-gpu                 2.7.1    cuda129_mkl_h43a4b0b_304        46.9 KiB   conda  https://conda.anaconda.org/conda-forge/
torchvision                 0.24.0   cuda129_py313_h6be0d2c_0        3.1 MiB    conda  https://conda.anaconda.org/conda-forge/
torchvision-extra-decoders  0.0.2    py313hf1e760e_3                 62.9 KiB   conda  https://conda.anaconda.org/conda-forge/

BASH

pixi list cuda

OUTPUT

Environment: gpu
Package               Version  Build       Size       Kind   Source
cuda-crt-tools        12.9.86  ha770c72_2  28.5 KiB   conda  https://conda.anaconda.org/conda-forge/
cuda-cudart           12.9.79  h5888daf_0  22.7 KiB   conda  https://conda.anaconda.org/conda-forge/
cuda-cudart_linux-64  12.9.79  h3f2d84a_0  192.6 KiB  conda  https://conda.anaconda.org/conda-forge/
cuda-cuobjdump        12.9.82  hffce074_1  239.3 KiB  conda  https://conda.anaconda.org/conda-forge/
cuda-cupti            12.9.79  h676940d_1  1.8 MiB    conda  https://conda.anaconda.org/conda-forge/
cuda-nvcc-tools       12.9.86  he02047a_2  26.1 MiB   conda  https://conda.anaconda.org/conda-forge/
cuda-nvdisasm         12.9.88  hffce074_1  5.3 MiB    conda  https://conda.anaconda.org/conda-forge/
cuda-nvrtc            12.9.86  hecca717_1  64.1 MiB   conda  https://conda.anaconda.org/conda-forge/
cuda-nvtx             12.9.79  hecca717_1  28.7 KiB   conda  https://conda.anaconda.org/conda-forge/
cuda-nvvm-tools       12.9.86  h4bc722e_2  23.1 MiB   conda  https://conda.anaconda.org/conda-forge/
cuda-version          12.9     h4f385c5_3  21.1 KiB   conda  https://conda.anaconda.org/conda-forge/

BASH

nvidia-smi

OUTPUT

Thu Aug 14 06:36:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   25C    P0             35W /  250W |       4MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

We can interactively run our code as well

BASH

tar -vxzf MNIST_data.tar.gz
time python ./src/torch_MNIST.py --data-dir ./data --epochs 2 --save-model

To return to the login node we just exit the interactive session

BASH

exit

Now to submit our job normally, we run the submit.sh script

BASH

bash submit.sh

OUTPUT

Submitting job(s).
1 job(s) submitted to cluster 12651114.

and its submission and state can be monitored with condor_q.

BASH

condor_q

OUTPUT



-- Schedd: ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 08/14/25 08:20:12
OWNER            BATCH_NAME      SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
matthew.feickert ID: 12651114   8/14 08:18      _      1      _      1 12651114.0

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

When the job finishes we see that HTCondor has returned to us the following files:

mnist_gpu_apptainer_$(Cluster)_$(Process).log.txt: the HTCondor log file for the job
mnist_gpu_apptainer_$(Cluster)_$(Process).out.txt: the stdout of all actions executed in the job
mnist_gpu_apptainer_$(Cluster)_$(Process).err.txt: the stderr of all actions executed in the job
mnist_cnn.pt: the serialized trained PyTorch model

Key Points

You can use containerized Pixi environments with HTC systems to be able to run CUDA accelerated code that you defined.

Using Pixi environments on HTC Systems

Overview

Questions

Objectives

High Throughput Computing (HTC)

Setting up a problem

BASH

OUTPUT

Training a PyTorch model on the MNIST dataset

Mea culpa, more interesting examples exist

The neural network code

BASH

The Pixi environment

Create the environment

Show me the solution

BASH

OUTPUT

BASH

OUTPUT

BASH

OUTPUT

BASH

TOML

BASH

BASH

OUTPUT

BASH

OUTPUT

TOML

Override the __cuda virtual package

BASH

BASH

OUTPUT

Running multiple ways

Show me the solution

BASH

The Linux container

Apptainer

Docker

Write the Dockerfile

Show me the solution

DOCKERFILE

Building and deploying the container image

Build and deploy Linux container image to registry

Show me the solution

BASH

YAML

Version Control

BASH

BASH

BASH

BASH

HTCondor

This episode will be on a remote system

Connection between execution scripts and submit description files

Write the HTCondor execution script

BASH

Write the HTCondor submit description file

Write the submission script

BASH

Submitting the job

BASH

BASH

OUTPUT

BASH

OUTPUT

BASH

OUTPUT

BASH

OUTPUT

BASH

OUTPUT

BASH

OUTPUT

BASH

OUTPUT

BASH

BASH

BASH

OUTPUT

Override the `__cuda` virtual package