HPC User Guide: Avoiding /tmp Full Errors on Compute Nodes#

Author: Snit Sanghlao and Claude AI(Antropic)

Cluster: MAI AI-HPC (NVIDIA BCM)
Date: 2026-03-21


Why This Matters#

The compute nodes have a shared root partition (/) that includes /tmp. When jobs write large files to /tmp, the entire node’s OS disk fills up — crashing your job, draining the node, and blocking all other users.

Golden rule: Never write directly to /tmp. Always use /workdir or $SLURM_TMPDIR.


1. Use $SLURM_TMPDIR for Temporary Files#

Slurm creates a per-job temporary directory that is automatically cleaned up when your job ends.

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --gres=gpu:1

# Copy data to fast local storage
cp /home/$USER/dataset.tar.gz "$SLURM_TMPDIR/"
cd "$SLURM_TMPDIR"
tar -xf dataset.tar.gz

# Run your job from local storage (faster I/O)
python train.py --data_dir "$SLURM_TMPDIR/dataset"

# Copy results back before job ends
cp "$SLURM_TMPDIR/model_best.pt" /home/$USER/results/

If $SLURM_TMPDIR is not set on your cluster, use /workdir with a manual cleanup trap:

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --gres=gpu:1

scratch="/workdir/${USER}_${SLURM_JOB_ID}"
mkdir -p "$scratch"
trap "rm -rf $scratch" EXIT    # auto-delete even if job crashes

cd "$scratch"
# ... your work here ...

2. Singularity / Apptainer#

Singularity and Apptainer write large temporary files during container build and pull operations. By default these go to /tmp and can easily fill the root disk.

2.1 Redirect Cache and Temp Directories#

Add these to your ~/.bashrc:

# Singularity / Apptainer cache and temp
export SINGULARITY_CACHEDIR=/workdir/$USER/singularity_cache
export SINGULARITY_TMPDIR=/workdir/$USER/singularity_tmp
export APPTAINER_CACHEDIR=/workdir/$USER/apptainer_cache
export APPTAINER_TMPDIR=/workdir/$USER/apptainer_tmp

mkdir -p $SINGULARITY_CACHEDIR $SINGULARITY_TMPDIR \
         $APPTAINER_CACHEDIR $APPTAINER_TMPDIR

2.2 Building Containers#

Always specify --tmpdir when building:

singularity build --tmpdir /workdir/$USER/singularity_tmp my_container.sif my_recipe.def

2.3 Running Containers with --bind#

Bind /workdir into your container so jobs inside the container also write to the safe location:

singularity exec --nv \
  --bind /workdir/$USER:/scratch \
  my_container.sif python train.py --tmp_dir /scratch

2.4 Clean Up Old Caches Periodically#

# Check cache size
du -sh /workdir/$USER/singularity_cache

# Clear cache (safe to run anytime)
singularity cache clean
# or
apptainer cache clean

3. Enroot / Pyxis (NVIDIA Containers)#

If you use srun --container-image=... or Enroot directly, it caches container layers in /tmp by default.

Add to your ~/.bashrc:

# Enroot / Pyxis cache
export ENROOT_CACHE_PATH=/workdir/$USER/enroot_cache
export ENROOT_DATA_PATH=/workdir/$USER/enroot_data
export ENROOT_RUNTIME_PATH=/workdir/$USER/enroot_runtime

mkdir -p $ENROOT_CACHE_PATH $ENROOT_DATA_PATH $ENROOT_RUNTIME_PATH

4. Python and pip#

Python also writes to /tmp for package builds and temporary data.

In Job Scripts#

export TMPDIR=/workdir/$USER/tmp
export PIP_CACHE_DIR=/workdir/$USER/pip_cache
mkdir -p $TMPDIR $PIP_CACHE_DIR

Installing Packages Inside Jobs#

pip install --cache-dir /workdir/$USER/pip_cache my_package

5. PyTorch / Deep Learning Frameworks#

Several frameworks create temp files during training:

# PyTorch compiled extensions
export TORCH_EXTENSIONS_DIR=/workdir/$USER/torch_extensions

# Hugging Face model cache
export HF_HOME=/workdir/$USER/huggingface
export TRANSFORMERS_CACHE=/workdir/$USER/huggingface/transformers

# Triton (GPU kernel cache)
export TRITON_CACHE_DIR=/workdir/$USER/triton_cache

mkdir -p $TORCH_EXTENSIONS_DIR $HF_HOME $TRITON_CACHE_DIR


7. Quick Reference: What NOT To Do#

Bad Practice

Why It’s Dangerous

Do This Instead

cp dataset.tar.gz /tmp/

Fills root partition

cp dataset.tar.gz $SLURM_TMPDIR/

singularity build my.sif my.def

Build temp goes to /tmp

singularity build --tmpdir /workdir/$USER/singularity_tmp my.sif my.def

pip install big_package (no cache redirect)

pip temp goes to /tmp

Set TMPDIR and PIP_CACHE_DIR first

Saving checkpoints to /tmp/checkpoints

Fills root, lost on job end

Save to /home/$USER/ or /workdir/$USER/

Leaving old containers in cache

Wastes disk for everyone

Run singularity cache clean periodically


8. Check Your Disk Usage#

Before and after jobs, check that you’re not leaving junk behind:

# Check /tmp usage (should be small)
du -sh /tmp/$USER* 2>/dev/null

# Check your /workdir usage
du -sh /workdir/$USER/*

# Check overall node disk
df -h /
df -h /workdir

If you see / above 90%, alert the admin immediately — the node may need manual cleanup before it crashes.


Contact#

If a node is drained or unresponsive due to disk full, contact the HPC admin team. Do not attempt to manually clean other users’ files.