Zeta A100 SXM4 80GB β€” Singularity Precaution Guide#

Author: Snit Sanhlao , AI Assitant Claude AI

Node: zeta GPU: 8Γ— NVIDIA A100-SXM4-80GB (NVLink 3.0 + NVSwitch) Singularity: CE 4.1.3 Driver stack: nvidia-container-toolkit (CDI)


Use --nvccli (Preferred) on Zeta#

singularity exec --nvccli [other flags] image.sif command

Both --nv and --nvccli work on zeta (tested 2026-03-26). However, --nvccli is preferred as it’s the modern standard and more portable.

Note: Earlier versions of this guide warned against --nv due to hangs. The cluster stack has since been updated with compatibility layers.


Pre-flight Check Script#

Run this before submitting any GPU Singularity job to a node:

#!/bin/bash
# check-singularity-gpu.sh <nodename>
# Usage: bash check-singularity-gpu.sh zeta

NODE=${1:-$(hostname)}

echo "=== Singularity GPU Check: $NODE ==="

srun --nodelist=$NODE --gres=gpu:1 --pty bash -c '
    echo "Node: $(hostname)"
    echo "--- GPU hardware ---"
    nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

    echo "--- Singularity flag ---"
    if ls /usr/bin/nvidia-container-cli &>/dev/null; then
        echo "USE --nvccli  (CDI stack detected)"
    else
        echo "USE --nv      (legacy driver stack)"
    fi

    echo "--- nvidia-container-cli path ---"
    ls -la /usr/bin/nvidia-container-cli 2>/dev/null || echo "not found"
'

Why --nvccli is Preferred on Zeta#

--nv uses legacy library injection β€” it scans the host for NVIDIA .so files and bind-mounts them into the container. This was built for older driver setups where libraries live in well-known paths.

--nvccli delegates GPU setup to /usr/bin/nvidia-container-cli β€” the same binary Docker uses internally. It handles CDI correctly.

--nv      β†’ Legacy library injection     β†’ works (with compatibility layer)
--nvccli  β†’ nvidia-container-cli setup  β†’ works (modern standard, preferred)

Update (2026-03-26): Both flags now work on zeta. The cluster stack was updated with compatibility layers. --nvccli remains preferred for portability across container runtimes (Docker, Podman, Kubernetes).


Portable Flag Detection (use in all sbatch scripts)#

Paste this block into any sbatch script that runs Singularity with GPU:

# Auto-detect Singularity GPU flag
if ls /usr/bin/nvidia-container-cli &>/dev/null; then
    NV_FLAG="--nvccli"
else
    NV_FLAG="--nv"
fi
echo "[singularity] GPU flag: $NV_FLAG"

singularity exec $NV_FLAG "$SIF" command ...

Node Reference Table (Tested 2026-03-26)#

Node

GPU

VRAM

Both Flags Work

Recommended

zeta

A100 SXM4 80GB

8Γ— 80GB = 640GB

Yes

--nvccli

tensorcore

A100 SXM4 40GB

8Γ— 40GB = 320GB

Yes

--nvccli

tau

H100 80GB

varies

Yes

--nvccli

All nodes now support both --nv and --nvccli. Use --nvccli for portability.


User Trip β€” Cancelling a Live GPU Job (Learned 2026-03-21)#

What Happened#

scancel was used to kill a running vLLM job (8-way NCCL across all GPUs). The NCCL workers did not clean up cleanly β†’ GPU memory stayed allocated β†’ Slurm epilog detected leftover GPU usage β†’ zeta marked as UnavailableNodes.

Next job submission: PENDING (ReqNodeNotAvail, UnavailableNodes:zeta)

Fix (admin required)#

# 1. SSH to zeta and check for leftover processes
ps aux | grep -E 'python|vllm|nccl' | grep -v grep

# 2. Check if GPU memory is still held
nvidia-smi | grep -E 'MiB|Processes'

# 3a. If GPU is clear β†’ resume node in Slurm (from login node)
scontrol update node=zeta state=resume

# 3b. If processes still stuck β†’ kill first, then resume
kill -9 $(ps aux | grep python | grep -v grep | awk '{print $2}')
scontrol update node=zeta state=resume

Prevention β€” Graceful Job Cancellation#

Never use bare scancel on a live NCCL/multi-GPU Singularity job. Use SIGINT first to allow clean GPU release:

scancel --signal=SIGINT <JOBID>   # Ctrl+C β†’ NCCL releases GPU contexts
sleep 10
scancel <JOBID>                   # force kill if still running

Symptoms of Wrong Flag#

Symptom

Likely Cause

CUDA not available / False 0

No GPU flag passed at all

Container exits immediately

Wrong SIF path or missing image

unrecognized arguments

Using singularity run with vLLM args β€” use singularity exec instead

Note: The β€œ--nv hangs” issue from earlier versions has been resolved via cluster updates.