Watchdog & Checkpoint Survival Guide#

Authors: Snit Sanghlao Β· Owen

Core Principle#

Any job running longer than 1 hour must have crash-proof checkpoints. A single checkpoint file is a single point of failure β€” corruption, interrupted writes, or node failures will lose everything since the last good save.

Golden Rules#

1. Numbered Checkpoints Are Mandatory#

Never rely on a single latest file. At every save interval, write BOTH:

  • ckpt_00100.dat β€” numbered, immutable, survives corruption

  • latest.dat β€” copy for quick resume

def save_checkpoint(state, step):
    k = step // 1_000
    numbered = f"ckpt_{k:05d}.dat"
    save(state, numbered)   # Immutable, survives corruption
    save(state, "latest.dat")  # Quick resume

2. Atomic Writes Prevent Corruption#

Never overwrite a checkpoint in-place. Write to a temp file, then atomically rename:

import tempfile, os

def atomic_save(state, path):
    dir = os.path.dirname(path)
    fd, tmp = tempfile.mkstemp(dir=dir, suffix=".tmp")
    try:
        with os.fdopen(fd, 'wb') as f:
            serialize(state, f)
        os.rename(tmp, path)  # Atomic on same filesystem
    except:
        os.unlink(tmp)  # Cleanup on failure

Why: If a job crashes mid-write, the original file is untouched. os.rename() is atomic on Linux ext4/xfs.

3. Smart Resume Logic#

On startup, scan for the best numbered checkpoint:

import glob, re

def find_best_checkpoint(pattern="ckpt_*.dat"):
    files = glob.glob(pattern)
    if not files:
        return None
    def extract_num(f):
        m = re.search(r'ckpt_(\d+)', f)
        return int(m.group(1)) if m else 0
    return max(files, key=extract_num)

# Resume
ckpt = find_best_checkpoint() or "latest.dat"
load(ckpt)

If latest.dat is corrupted, fall back to the best numbered checkpoint automatically.

4. Checkpoint Interval vs. Risk#

Job Length

Max Checkpoint Interval

Max Loss

1 hour

5 minutes

5 min of work

1 day

30 minutes

30 min of work

1 week

2 hours

2 hours of work

1 month

6 hours

6 hours of work

Rule: Maximum acceptable loss = 1% of total job runtime. Never lose more than you can afford to redo.

5. SLURM-Specific: Handle Job Preemption#

SLURM sends SIGTERM before preempting or expiring a job. Catch it:

import signal, sys

def handle_shutdown(signum, frame):
    save_checkpoint(state, current_step)
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGUSR1, handle_shutdown)

In your SLURM script, use --signal=B:USR1@120 to get a warning 120 seconds before preemption:

#SBATCH --signal=B:USR1@120
#SBATCH --time=7-00:00:00
#SBATCH --requeue

6. SLURM Checkpoint Directory on Persistent Storage#

Never store checkpoints on scratch/tmp filesystems. Use persistent storage:

# BAD β€” lost on node reboot
CHECKPOINT_DIR=/scratch/$USER/job123/ckpts/

# GOOD β€” survives node failure
CHECKPOINT_DIR=$HOME/checkpoints/job123/
# Or shared filesystem
CHECKPOINT_DIR=/shared/projects/myteam/checkpoints/job123/

7. Progress Metadata Is Unreliable#

JSON/log files reporting progress can be written by crashed processes with stale or incorrect data. Always verify against actual checkpoint files:

# Verify checkpoint integrity before resuming
python -c "import torch; torch.load('$CKPT'); print('OK')" 2>/dev/null || echo "CORRUPTED"

8. Limit Checkpoint Count to Save Disk Space#

Keep only the last N numbered checkpoints plus milestones:

import glob, os

def prune_old_checkpoints(prefix="ckpt_", keep_last=5, milestone_every=500_000):
    files = sorted(glob.glob(f"{prefix}*.dat"))
    for f in files[:-keep_last]:
        step = extract_step(f)
        if step % milestone_every != 0:  # Keep milestones forever
            os.remove(f)

SLURM Job Template (Crash-Proof)#

#!/bin/bash
#SBATCH --job-name=mylongjob
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=7-00:00:00
#SBATCH --signal=B:USR1@120
#SBATCH --requeue
#SBATCH --output=logs/%x_%j.out

export CHECKPOINT_DIR=$HOME/checkpoints/myjob
mkdir -p "$CHECKPOINT_DIR"

# Find best checkpoint for resume
BEST=$(ls "$CHECKPOINT_DIR"/ckpt_*.dat 2>/dev/null | sort | tail -1)
if [ -n "$BEST" ]; then
    echo "Resuming from: $BEST"
    python myjob.py --resume "$BEST"
else
    echo "Starting fresh"
    python myjob.py
fi

Watchdog / Monitoring Script#

#!/bin/bash
# watchdog.sh β€” runs via cron, restarts job if dead

JOB_NAME="myjob.py"
PID_FILE="/var/run/myjob.pid"
CKPT_DIR="$HOME/checkpoints/myjob"
LOG="/var/log/myjob-watchdog.log"

# Find running process
PID=$(pgrep -f "$JOB_NAME" | head -1)

if [ -z "$PID" ]; then
    # Find best checkpoint
    BEST=$(ls "$CKPT_DIR"/ckpt_*.dat 2>/dev/null | sort | tail -1)
    if [ -n "$BEST" ]; then
        echo "$(date) Job died. Restarting from $BEST" >> "$LOG"
        nohup python "$JOB_NAME" --resume "$BEST" &
        echo $! > "$PID_FILE"
    else
        echo "$(date) Job died. No checkpoint found. Starting fresh." >> "$LOG"
        nohup python "$JOB_NAME" &
        echo $! > "$PID_FILE"
    fi
else
    echo "$(date) Job alive (PID $PID)" >> "$LOG"
fi

Cron: */15 * * * * /path/to/watchdog.sh

Implementation Checklist#

  • [ ] Numbered checkpoints at every save interval

  • [ ] Atomic writes (temp file + rename)

  • [ ] latest file as secondary for quick resume

  • [ ] Smart resume scans for best numbered checkpoint

  • [ ] Signal handlers (SIGTERM, SIGUSR1) save before exit

  • [ ] Checkpoint interval ≀1% of total job time

  • [ ] Milestone checkpoints kept forever (major boundaries)

  • [ ] Old checkpoints pruned to save disk space

  • [ ] Checkpoints stored on persistent storage (not scratch)

  • [ ] SLURM: --signal=B:USR1@120 for preemption warning

  • [ ] Watchdog monitors and restarts with best checkpoint

Anti-Patterns (Don’t Do These)#

Anti-Pattern

Consequence

Single checkpoint file (latest.pt only)

Corruption loses all progress

Overwriting checkpoint in-place

Mid-write crash corrupts the file

No signal handlers

Node shutdown kills unsaved work

Checkpoints on scratch/tmp

Lost on node reboot

Checkpoint interval too wide

Loses days of computation

Trusting progress JSON over checkpoints

Crashed process writes fake progress

Keeping all checkpoints

Fills disk, causes OOM

Watchdog hardcoded to latest file

Restarts from corrupted state

Recovery Procedure#

When a job crashes:

  1. Don’t panic β€” numbered checkpoints survive

  2. Check which checkpoints exist: ls -lt ckpt_*.dat | head

  3. Verify integrity of latest: python -c "load('ckpt_XXXX.dat'); print('OK')"

  4. If corrupted, go back one: try ckpt_(N-1).dat

  5. Resume from best valid checkpoint

  6. Investigate root cause (OOM? hardware? bug?)

  7. Fix before restarting

Cost of Getting It Wrong#

Real example: Lost ~100K training steps (~50 hours GPU time, ~$30 cloud cost) because a single checkpoint file was corrupted during a crash. The fix took 2 hours to implement. The prevention is 30 lines of code.

Invest 30 minutes setting up crash-proof checkpoints. Save days of recomputation.