Zeta A100 SXM4 80GB β Singularity Precaution Guide#
Author: Snit Sanhlao , AI Assitant Claude AI
Node: zeta GPU: 8Γ NVIDIA A100-SXM4-80GB (NVLink 3.0 + NVSwitch) Singularity: CE 4.1.3 Driver stack: nvidia-container-toolkit (CDI)
Use --nvccli (Preferred) on Zeta#
singularity exec --nvccli [other flags] image.sif command
Both --nv and --nvccli work on zeta (tested 2026-03-26). However, --nvccli is preferred as itβs the modern standard and more portable.
Note: Earlier versions of this guide warned against
--nvdue to hangs. The cluster stack has since been updated with compatibility layers.
Pre-flight Check Script#
Run this before submitting any GPU Singularity job to a node:
#!/bin/bash
# check-singularity-gpu.sh <nodename>
# Usage: bash check-singularity-gpu.sh zeta
NODE=${1:-$(hostname)}
echo "=== Singularity GPU Check: $NODE ==="
srun --nodelist=$NODE --gres=gpu:1 --pty bash -c '
echo "Node: $(hostname)"
echo "--- GPU hardware ---"
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
echo "--- Singularity flag ---"
if ls /usr/bin/nvidia-container-cli &>/dev/null; then
echo "USE --nvccli (CDI stack detected)"
else
echo "USE --nv (legacy driver stack)"
fi
echo "--- nvidia-container-cli path ---"
ls -la /usr/bin/nvidia-container-cli 2>/dev/null || echo "not found"
'
Why --nvccli is Preferred on Zeta#
--nv uses legacy library injection β it scans the host for NVIDIA .so
files and bind-mounts them into the container. This was built for older
driver setups where libraries live in well-known paths.
--nvccli delegates GPU setup to /usr/bin/nvidia-container-cli β the
same binary Docker uses internally. It handles CDI correctly.
--nv β Legacy library injection β works (with compatibility layer)
--nvccli β nvidia-container-cli setup β works (modern standard, preferred)
Update (2026-03-26): Both flags now work on zeta. The cluster stack was updated
with compatibility layers. --nvccli remains preferred for portability across
container runtimes (Docker, Podman, Kubernetes).
Portable Flag Detection (use in all sbatch scripts)#
Paste this block into any sbatch script that runs Singularity with GPU:
# Auto-detect Singularity GPU flag
if ls /usr/bin/nvidia-container-cli &>/dev/null; then
NV_FLAG="--nvccli"
else
NV_FLAG="--nv"
fi
echo "[singularity] GPU flag: $NV_FLAG"
singularity exec $NV_FLAG "$SIF" command ...
Node Reference Table (Tested 2026-03-26)#
Node |
GPU |
VRAM |
Both Flags Work |
Recommended |
|---|---|---|---|---|
zeta |
A100 SXM4 80GB |
8Γ 80GB = 640GB |
Yes |
|
tensorcore |
A100 SXM4 40GB |
8Γ 40GB = 320GB |
Yes |
|
tau |
H100 80GB |
varies |
Yes |
|
All nodes now support both
--nvand--nvccli. Use--nvcclifor portability.
Why --nvccli is Recommended#
Reason |
Explanation |
|---|---|
Modern standard |
Uses |
Portable |
Same command works across different container runtimes and clusters |
CDI support |
Handles Container Device Interface (the new spec) correctly |
Future-proof |
|
Cleaner isolation |
Delegates GPU setup to dedicated binary, not singularityβs library injection |
Bottom line: Both work on all nodes. Use --nvccli for scripts you want to reuse elsewhere.
User Trip β Cancelling a Live GPU Job (Learned 2026-03-21)#
What Happened#
scancel was used to kill a running vLLM job (8-way NCCL across all GPUs).
The NCCL workers did not clean up cleanly β GPU memory stayed allocated β
Slurm epilog detected leftover GPU usage β zeta marked as UnavailableNodes.
Next job submission: PENDING (ReqNodeNotAvail, UnavailableNodes:zeta)
Fix (admin required)#
# 1. SSH to zeta and check for leftover processes
ps aux | grep -E 'python|vllm|nccl' | grep -v grep
# 2. Check if GPU memory is still held
nvidia-smi | grep -E 'MiB|Processes'
# 3a. If GPU is clear β resume node in Slurm (from login node)
scontrol update node=zeta state=resume
# 3b. If processes still stuck β kill first, then resume
kill -9 $(ps aux | grep python | grep -v grep | awk '{print $2}')
scontrol update node=zeta state=resume
Prevention β Graceful Job Cancellation#
Never use bare scancel on a live NCCL/multi-GPU Singularity job.
Use SIGINT first to allow clean GPU release:
scancel --signal=SIGINT <JOBID> # Ctrl+C β NCCL releases GPU contexts
sleep 10
scancel <JOBID> # force kill if still running
Symptoms of Wrong Flag#
Symptom |
Likely Cause |
|---|---|
|
No GPU flag passed at all |
Container exits immediately |
Wrong SIF path or missing image |
|
Using |
Note: The β
--nvhangsβ issue from earlier versions has been resolved via cluster updates.