HPC User Guide: Avoiding /tmp Full Errors on Compute Nodes#

Author: Snit Sanghlao and Claude AI(Antropic)

Cluster: MAI AI-HPC (NVIDIA BCM)
Date: 2026-03-21

Why This Matters#

The compute nodes have a shared root partition (/) that includes /tmp. When jobs write large files to /tmp, the entire node’s OS disk fills up — crashing your job, draining the node, and blocking all other users.

Golden rule: Never write directly to /tmp. Always use /workdir or $SLURM_TMPDIR.

1. Use `$SLURM_TMPDIR` for Temporary Files#

Slurm creates a per-job temporary directory that is automatically cleaned up when your job ends.

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --gres=gpu:1

# Copy data to fast local storage
cp /home/$USER/dataset.tar.gz "$SLURM_TMPDIR/"
cd "$SLURM_TMPDIR"
tar -xf dataset.tar.gz

# Run your job from local storage (faster I/O)
python train.py --data_dir "$SLURM_TMPDIR/dataset"

# Copy results back before job ends
cp "$SLURM_TMPDIR/model_best.pt" /home/$USER/results/

If $SLURM_TMPDIR is not set on your cluster, use /workdir with a manual cleanup trap:

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --gres=gpu:1

scratch="/workdir/${USER}_${SLURM_JOB_ID}"
mkdir -p "$scratch"
trap "rm -rf $scratch" EXIT    # auto-delete even if job crashes

cd "$scratch"
# ... your work here ...

2. Singularity / Apptainer#

Singularity and Apptainer write large temporary files during container build and pull operations. By default these go to /tmp and can easily fill the root disk.

2.1 Redirect Cache and Temp Directories#

Add these to your ~/.bashrc:

# Singularity / Apptainer cache and temp
export SINGULARITY_CACHEDIR=/workdir/$USER/singularity_cache
export SINGULARITY_TMPDIR=/workdir/$USER/singularity_tmp
export APPTAINER_CACHEDIR=/workdir/$USER/apptainer_cache
export APPTAINER_TMPDIR=/workdir/$USER/apptainer_tmp

mkdir -p $SINGULARITY_CACHEDIR $SINGULARITY_TMPDIR \
         $APPTAINER_CACHEDIR $APPTAINER_TMPDIR

2.2 Building Containers#

Always specify --tmpdir when building:

singularity build --tmpdir /workdir/$USER/singularity_tmp my_container.sif my_recipe.def

2.3 Running Containers with `--bind`#

Bind /workdir into your container so jobs inside the container also write to the safe location:

singularity exec --nv \
  --bind /workdir/$USER:/scratch \
  my_container.sif python train.py --tmp_dir /scratch

2.4 Clean Up Old Caches Periodically#

# Check cache size
du -sh /workdir/$USER/singularity_cache

# Clear cache (safe to run anytime)
singularity cache clean
# or
apptainer cache clean

3. Enroot / Pyxis (NVIDIA Containers)#

If you use srun --container-image=... or Enroot directly, it caches container layers in /tmp by default.

Add to your ~/.bashrc:

# Enroot / Pyxis cache
export ENROOT_CACHE_PATH=/workdir/$USER/enroot_cache
export ENROOT_DATA_PATH=/workdir/$USER/enroot_data
export ENROOT_RUNTIME_PATH=/workdir/$USER/enroot_runtime

mkdir -p $ENROOT_CACHE_PATH $ENROOT_DATA_PATH $ENROOT_RUNTIME_PATH

4. Python and pip#

Python also writes to /tmp for package builds and temporary data.

In Job Scripts#

export TMPDIR=/workdir/$USER/tmp
export PIP_CACHE_DIR=/workdir/$USER/pip_cache
mkdir -p $TMPDIR $PIP_CACHE_DIR

Installing Packages Inside Jobs#

pip install --cache-dir /workdir/$USER/pip_cache my_package

5. PyTorch / Deep Learning Frameworks#

Several frameworks create temp files during training:

# PyTorch compiled extensions
export TORCH_EXTENSIONS_DIR=/workdir/$USER/torch_extensions

# Hugging Face model cache
export HF_HOME=/workdir/$USER/huggingface
export TRANSFORMERS_CACHE=/workdir/$USER/huggingface/transformers

# Triton (GPU kernel cache)
export TRITON_CACHE_DIR=/workdir/$USER/triton_cache

mkdir -p $TORCH_EXTENSIONS_DIR $HF_HOME $TRITON_CACHE_DIR

6. Recommended ~/.bashrc Block#

Copy this entire block to your ~/.bashrc to cover all common cases:

# =============================================================
# HPC: Redirect all temp/cache away from /tmp to /workdir
# =============================================================
export TMPDIR=/workdir/$USER/tmp
export TEMP=$TMPDIR
export TMP=$TMPDIR

# Singularity / Apptainer
export SINGULARITY_CACHEDIR=/workdir/$USER/singularity_cache
export SINGULARITY_TMPDIR=/workdir/$USER/singularity_tmp
export APPTAINER_CACHEDIR=/workdir/$USER/apptainer_cache
export APPTAINER_TMPDIR=/workdir/$USER/apptainer_tmp

# Enroot / Pyxis
export ENROOT_CACHE_PATH=/workdir/$USER/enroot_cache
export ENROOT_DATA_PATH=/workdir/$USER/enroot_data
export ENROOT_RUNTIME_PATH=/workdir/$USER/enroot_runtime

# Python / pip
export PIP_CACHE_DIR=/workdir/$USER/pip_cache

# PyTorch / DL
export TORCH_EXTENSIONS_DIR=/workdir/$USER/torch_extensions
export HF_HOME=/workdir/$USER/huggingface
export TRANSFORMERS_CACHE=/workdir/$USER/huggingface/transformers
export TRITON_CACHE_DIR=/workdir/$USER/triton_cache

# Create all directories
mkdir -p $TMPDIR $SINGULARITY_CACHEDIR $SINGULARITY_TMPDIR \
         $APPTAINER_CACHEDIR $APPTAINER_TMPDIR \
         $ENROOT_CACHE_PATH $ENROOT_DATA_PATH $ENROOT_RUNTIME_PATH \
         $PIP_CACHE_DIR $TORCH_EXTENSIONS_DIR $HF_HOME $TRITON_CACHE_DIR \
         2>/dev/null

7. Quick Reference: What NOT To Do#

Bad Practice	Why It’s Dangerous	Do This Instead
`cp dataset.tar.gz /tmp/`	Fills root partition	`cp dataset.tar.gz $SLURM_TMPDIR/`
`singularity build my.sif my.def`	Build temp goes to `/tmp`	`singularity build --tmpdir /workdir/$USER/singularity_tmp my.sif my.def`
`pip install big_package` (no cache redirect)	pip temp goes to `/tmp`	Set `TMPDIR` and `PIP_CACHE_DIR` first
Saving checkpoints to `/tmp/checkpoints`	Fills root, lost on job end	Save to `/home/$USER/` or `/workdir/$USER/`
Leaving old containers in cache	Wastes disk for everyone	Run `singularity cache clean` periodically

8. Check Your Disk Usage#

Before and after jobs, check that you’re not leaving junk behind:

# Check /tmp usage (should be small)
du -sh /tmp/$USER* 2>/dev/null

# Check your /workdir usage
du -sh /workdir/$USER/*

# Check overall node disk
df -h /
df -h /workdir

If you see / above 90%, alert the admin immediately — the node may need manual cleanup before it crashes.

Contact#

If a node is drained or unresponsive due to disk full, contact the HPC admin team. Do not attempt to manually clean other users’ files.