HPC User Guide: Avoiding /tmp Full Errors on Compute Nodes#
Author: Snit Sanghlao and Claude AI(Antropic)
Cluster: MAI AI-HPC (NVIDIA BCM)
Date: 2026-03-21
Why This Matters#
The compute nodes have a shared root partition (/) that includes /tmp. When jobs write large files to /tmp, the entire node’s OS disk fills up — crashing your job, draining the node, and blocking all other users.
Golden rule: Never write directly to /tmp. Always use /workdir or $SLURM_TMPDIR.
1. Use $SLURM_TMPDIR for Temporary Files#
Slurm creates a per-job temporary directory that is automatically cleaned up when your job ends.
#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --gres=gpu:1
# Copy data to fast local storage
cp /home/$USER/dataset.tar.gz "$SLURM_TMPDIR/"
cd "$SLURM_TMPDIR"
tar -xf dataset.tar.gz
# Run your job from local storage (faster I/O)
python train.py --data_dir "$SLURM_TMPDIR/dataset"
# Copy results back before job ends
cp "$SLURM_TMPDIR/model_best.pt" /home/$USER/results/
If $SLURM_TMPDIR is not set on your cluster, use /workdir with a manual cleanup trap:
#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --gres=gpu:1
scratch="/workdir/${USER}_${SLURM_JOB_ID}"
mkdir -p "$scratch"
trap "rm -rf $scratch" EXIT # auto-delete even if job crashes
cd "$scratch"
# ... your work here ...
2. Singularity / Apptainer#
Singularity and Apptainer write large temporary files during container build and pull operations. By default these go to /tmp and can easily fill the root disk.
2.1 Redirect Cache and Temp Directories#
Add these to your ~/.bashrc:
# Singularity / Apptainer cache and temp
export SINGULARITY_CACHEDIR=/workdir/$USER/singularity_cache
export SINGULARITY_TMPDIR=/workdir/$USER/singularity_tmp
export APPTAINER_CACHEDIR=/workdir/$USER/apptainer_cache
export APPTAINER_TMPDIR=/workdir/$USER/apptainer_tmp
mkdir -p $SINGULARITY_CACHEDIR $SINGULARITY_TMPDIR \
$APPTAINER_CACHEDIR $APPTAINER_TMPDIR
2.2 Building Containers#
Always specify --tmpdir when building:
singularity build --tmpdir /workdir/$USER/singularity_tmp my_container.sif my_recipe.def
2.3 Running Containers with --bind#
Bind /workdir into your container so jobs inside the container also write to the safe location:
singularity exec --nv \
--bind /workdir/$USER:/scratch \
my_container.sif python train.py --tmp_dir /scratch
2.4 Clean Up Old Caches Periodically#
# Check cache size
du -sh /workdir/$USER/singularity_cache
# Clear cache (safe to run anytime)
singularity cache clean
# or
apptainer cache clean
3. Enroot / Pyxis (NVIDIA Containers)#
If you use srun --container-image=... or Enroot directly, it caches container layers in /tmp by default.
Add to your ~/.bashrc:
# Enroot / Pyxis cache
export ENROOT_CACHE_PATH=/workdir/$USER/enroot_cache
export ENROOT_DATA_PATH=/workdir/$USER/enroot_data
export ENROOT_RUNTIME_PATH=/workdir/$USER/enroot_runtime
mkdir -p $ENROOT_CACHE_PATH $ENROOT_DATA_PATH $ENROOT_RUNTIME_PATH
4. Python and pip#
Python also writes to /tmp for package builds and temporary data.
In Job Scripts#
export TMPDIR=/workdir/$USER/tmp
export PIP_CACHE_DIR=/workdir/$USER/pip_cache
mkdir -p $TMPDIR $PIP_CACHE_DIR
Installing Packages Inside Jobs#
pip install --cache-dir /workdir/$USER/pip_cache my_package
5. PyTorch / Deep Learning Frameworks#
Several frameworks create temp files during training:
# PyTorch compiled extensions
export TORCH_EXTENSIONS_DIR=/workdir/$USER/torch_extensions
# Hugging Face model cache
export HF_HOME=/workdir/$USER/huggingface
export TRANSFORMERS_CACHE=/workdir/$USER/huggingface/transformers
# Triton (GPU kernel cache)
export TRITON_CACHE_DIR=/workdir/$USER/triton_cache
mkdir -p $TORCH_EXTENSIONS_DIR $HF_HOME $TRITON_CACHE_DIR
6. Recommended ~/.bashrc Block#
Copy this entire block to your ~/.bashrc to cover all common cases:
# =============================================================
# HPC: Redirect all temp/cache away from /tmp to /workdir
# =============================================================
export TMPDIR=/workdir/$USER/tmp
export TEMP=$TMPDIR
export TMP=$TMPDIR
# Singularity / Apptainer
export SINGULARITY_CACHEDIR=/workdir/$USER/singularity_cache
export SINGULARITY_TMPDIR=/workdir/$USER/singularity_tmp
export APPTAINER_CACHEDIR=/workdir/$USER/apptainer_cache
export APPTAINER_TMPDIR=/workdir/$USER/apptainer_tmp
# Enroot / Pyxis
export ENROOT_CACHE_PATH=/workdir/$USER/enroot_cache
export ENROOT_DATA_PATH=/workdir/$USER/enroot_data
export ENROOT_RUNTIME_PATH=/workdir/$USER/enroot_runtime
# Python / pip
export PIP_CACHE_DIR=/workdir/$USER/pip_cache
# PyTorch / DL
export TORCH_EXTENSIONS_DIR=/workdir/$USER/torch_extensions
export HF_HOME=/workdir/$USER/huggingface
export TRANSFORMERS_CACHE=/workdir/$USER/huggingface/transformers
export TRITON_CACHE_DIR=/workdir/$USER/triton_cache
# Create all directories
mkdir -p $TMPDIR $SINGULARITY_CACHEDIR $SINGULARITY_TMPDIR \
$APPTAINER_CACHEDIR $APPTAINER_TMPDIR \
$ENROOT_CACHE_PATH $ENROOT_DATA_PATH $ENROOT_RUNTIME_PATH \
$PIP_CACHE_DIR $TORCH_EXTENSIONS_DIR $HF_HOME $TRITON_CACHE_DIR \
2>/dev/null
7. Quick Reference: What NOT To Do#
Bad Practice |
Why It’s Dangerous |
Do This Instead |
|---|---|---|
|
Fills root partition |
|
|
Build temp goes to |
|
|
pip temp goes to |
Set |
Saving checkpoints to |
Fills root, lost on job end |
Save to |
Leaving old containers in cache |
Wastes disk for everyone |
Run |
8. Check Your Disk Usage#
Before and after jobs, check that you’re not leaving junk behind:
# Check /tmp usage (should be small)
du -sh /tmp/$USER* 2>/dev/null
# Check your /workdir usage
du -sh /workdir/$USER/*
# Check overall node disk
df -h /
df -h /workdir
If you see / above 90%, alert the admin immediately — the node may need manual cleanup before it crashes.
Contact#
If a node is drained or unresponsive due to disk full, contact the HPC admin team. Do not attempt to manually clean other users’ files.