Watchdog & Checkpoint Survival Guide#
Authors: Snit Sanghlao Β· Owen
Core Principle#
Any job running longer than 1 hour must have crash-proof checkpoints. A single checkpoint file is a single point of failure β corruption, interrupted writes, or node failures will lose everything since the last good save.
Golden Rules#
1. Numbered Checkpoints Are Mandatory#
Never rely on a single latest file. At every save interval, write BOTH:
ckpt_00100.datβ numbered, immutable, survives corruptionlatest.datβ copy for quick resume
def save_checkpoint(state, step):
k = step // 1_000
numbered = f"ckpt_{k:05d}.dat"
save(state, numbered) # Immutable, survives corruption
save(state, "latest.dat") # Quick resume
2. Atomic Writes Prevent Corruption#
Never overwrite a checkpoint in-place. Write to a temp file, then atomically rename:
import tempfile, os
def atomic_save(state, path):
dir = os.path.dirname(path)
fd, tmp = tempfile.mkstemp(dir=dir, suffix=".tmp")
try:
with os.fdopen(fd, 'wb') as f:
serialize(state, f)
os.rename(tmp, path) # Atomic on same filesystem
except:
os.unlink(tmp) # Cleanup on failure
Why: If a job crashes mid-write, the original file is untouched. os.rename() is atomic on Linux ext4/xfs.
3. Smart Resume Logic#
On startup, scan for the best numbered checkpoint:
import glob, re
def find_best_checkpoint(pattern="ckpt_*.dat"):
files = glob.glob(pattern)
if not files:
return None
def extract_num(f):
m = re.search(r'ckpt_(\d+)', f)
return int(m.group(1)) if m else 0
return max(files, key=extract_num)
# Resume
ckpt = find_best_checkpoint() or "latest.dat"
load(ckpt)
If latest.dat is corrupted, fall back to the best numbered checkpoint automatically.
4. Checkpoint Interval vs. Risk#
Job Length |
Max Checkpoint Interval |
Max Loss |
|---|---|---|
1 hour |
5 minutes |
5 min of work |
1 day |
30 minutes |
30 min of work |
1 week |
2 hours |
2 hours of work |
1 month |
6 hours |
6 hours of work |
Rule: Maximum acceptable loss = 1% of total job runtime. Never lose more than you can afford to redo.
5. SLURM-Specific: Handle Job Preemption#
SLURM sends SIGTERM before preempting or expiring a job. Catch it:
import signal, sys
def handle_shutdown(signum, frame):
save_checkpoint(state, current_step)
sys.exit(0)
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGUSR1, handle_shutdown)
In your SLURM script, use --signal=B:USR1@120 to get a warning 120 seconds before preemption:
#SBATCH --signal=B:USR1@120
#SBATCH --time=7-00:00:00
#SBATCH --requeue
6. SLURM Checkpoint Directory on Persistent Storage#
Never store checkpoints on scratch/tmp filesystems. Use persistent storage:
# BAD β lost on node reboot
CHECKPOINT_DIR=/scratch/$USER/job123/ckpts/
# GOOD β survives node failure
CHECKPOINT_DIR=$HOME/checkpoints/job123/
# Or shared filesystem
CHECKPOINT_DIR=/shared/projects/myteam/checkpoints/job123/
7. Progress Metadata Is Unreliable#
JSON/log files reporting progress can be written by crashed processes with stale or incorrect data. Always verify against actual checkpoint files:
# Verify checkpoint integrity before resuming
python -c "import torch; torch.load('$CKPT'); print('OK')" 2>/dev/null || echo "CORRUPTED"
8. Limit Checkpoint Count to Save Disk Space#
Keep only the last N numbered checkpoints plus milestones:
import glob, os
def prune_old_checkpoints(prefix="ckpt_", keep_last=5, milestone_every=500_000):
files = sorted(glob.glob(f"{prefix}*.dat"))
for f in files[:-keep_last]:
step = extract_step(f)
if step % milestone_every != 0: # Keep milestones forever
os.remove(f)
SLURM Job Template (Crash-Proof)#
#!/bin/bash
#SBATCH --job-name=mylongjob
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=7-00:00:00
#SBATCH --signal=B:USR1@120
#SBATCH --requeue
#SBATCH --output=logs/%x_%j.out
export CHECKPOINT_DIR=$HOME/checkpoints/myjob
mkdir -p "$CHECKPOINT_DIR"
# Find best checkpoint for resume
BEST=$(ls "$CHECKPOINT_DIR"/ckpt_*.dat 2>/dev/null | sort | tail -1)
if [ -n "$BEST" ]; then
echo "Resuming from: $BEST"
python myjob.py --resume "$BEST"
else
echo "Starting fresh"
python myjob.py
fi
Watchdog / Monitoring Script#
#!/bin/bash
# watchdog.sh β runs via cron, restarts job if dead
JOB_NAME="myjob.py"
PID_FILE="/var/run/myjob.pid"
CKPT_DIR="$HOME/checkpoints/myjob"
LOG="/var/log/myjob-watchdog.log"
# Find running process
PID=$(pgrep -f "$JOB_NAME" | head -1)
if [ -z "$PID" ]; then
# Find best checkpoint
BEST=$(ls "$CKPT_DIR"/ckpt_*.dat 2>/dev/null | sort | tail -1)
if [ -n "$BEST" ]; then
echo "$(date) Job died. Restarting from $BEST" >> "$LOG"
nohup python "$JOB_NAME" --resume "$BEST" &
echo $! > "$PID_FILE"
else
echo "$(date) Job died. No checkpoint found. Starting fresh." >> "$LOG"
nohup python "$JOB_NAME" &
echo $! > "$PID_FILE"
fi
else
echo "$(date) Job alive (PID $PID)" >> "$LOG"
fi
Cron: */15 * * * * /path/to/watchdog.sh
Implementation Checklist#
[ ] Numbered checkpoints at every save interval
[ ] Atomic writes (temp file + rename)
[ ]
latestfile as secondary for quick resume[ ] Smart resume scans for best numbered checkpoint
[ ] Signal handlers (SIGTERM, SIGUSR1) save before exit
[ ] Checkpoint interval β€1% of total job time
[ ] Milestone checkpoints kept forever (major boundaries)
[ ] Old checkpoints pruned to save disk space
[ ] Checkpoints stored on persistent storage (not scratch)
[ ] SLURM:
--signal=B:USR1@120for preemption warning[ ] Watchdog monitors and restarts with best checkpoint
Anti-Patterns (Donβt Do These)#
Anti-Pattern |
Consequence |
|---|---|
Single checkpoint file ( |
Corruption loses all progress |
Overwriting checkpoint in-place |
Mid-write crash corrupts the file |
No signal handlers |
Node shutdown kills unsaved work |
Checkpoints on scratch/tmp |
Lost on node reboot |
Checkpoint interval too wide |
Loses days of computation |
Trusting progress JSON over checkpoints |
Crashed process writes fake progress |
Keeping all checkpoints |
Fills disk, causes OOM |
Watchdog hardcoded to |
Restarts from corrupted state |
Recovery Procedure#
When a job crashes:
Donβt panic β numbered checkpoints survive
Check which checkpoints exist:
ls -lt ckpt_*.dat | headVerify integrity of latest:
python -c "load('ckpt_XXXX.dat'); print('OK')"If corrupted, go back one: try
ckpt_(N-1).datResume from best valid checkpoint
Investigate root cause (OOM? hardware? bug?)
Fix before restarting
Cost of Getting It Wrong#
Real example: Lost ~100K training steps (~50 hours GPU time, ~$30 cloud cost) because a single checkpoint file was corrupted during a crash. The fix took 2 hours to implement. The prevention is 30 lines of code.
Invest 30 minutes setting up crash-proof checkpoints. Save days of recomputation.