---
name: watchdog-checkpoint-survival
authors: [Snit Sanghlao, Owen]
description: Universal guide for preventing data loss in long-running jobs (SLURM, training, simulations)
created: 2026-05-15
---

# Watchdog & Checkpoint Survival Guide

**Authors:** Snit Sanghlao · Owen

## Core Principle

**Any job running longer than 1 hour must have crash-proof checkpoints.** A single checkpoint file is a single point of failure — corruption, interrupted writes, or node failures will lose everything since the last good save.

## Golden Rules

### 1. Numbered Checkpoints Are Mandatory

Never rely on a single `latest` file. At every save interval, write BOTH:
- `ckpt_00100.dat` — numbered, immutable, survives corruption
- `latest.dat` — copy for quick resume

```python
def save_checkpoint(state, step):
    k = step // 1_000
    numbered = f"ckpt_{k:05d}.dat"
    save(state, numbered)   # Immutable, survives corruption
    save(state, "latest.dat")  # Quick resume
```

### 2. Atomic Writes Prevent Corruption

Never overwrite a checkpoint in-place. Write to a temp file, then atomically rename:

```python
import tempfile, os

def atomic_save(state, path):
    dir = os.path.dirname(path)
    fd, tmp = tempfile.mkstemp(dir=dir, suffix=".tmp")
    try:
        with os.fdopen(fd, 'wb') as f:
            serialize(state, f)
        os.rename(tmp, path)  # Atomic on same filesystem
    except:
        os.unlink(tmp)  # Cleanup on failure
```

**Why:** If a job crashes mid-write, the original file is untouched. `os.rename()` is atomic on Linux ext4/xfs.

### 3. Smart Resume Logic

On startup, scan for the best numbered checkpoint:

```python
import glob, re

def find_best_checkpoint(pattern="ckpt_*.dat"):
    files = glob.glob(pattern)
    if not files:
        return None
    def extract_num(f):
        m = re.search(r'ckpt_(\d+)', f)
        return int(m.group(1)) if m else 0
    return max(files, key=extract_num)

# Resume
ckpt = find_best_checkpoint() or "latest.dat"
load(ckpt)
```

If `latest.dat` is corrupted, fall back to the best numbered checkpoint automatically.

### 4. Checkpoint Interval vs. Risk

| Job Length | Max Checkpoint Interval | Max Loss |
|-----------|------------------------|----------|
| 1 hour | 5 minutes | 5 min of work |
| 1 day | 30 minutes | 30 min of work |
| 1 week | 2 hours | 2 hours of work |
| 1 month | 6 hours | 6 hours of work |

**Rule:** Maximum acceptable loss = 1% of total job runtime. Never lose more than you can afford to redo.

### 5. SLURM-Specific: Handle Job Preemption

SLURM sends `SIGTERM` before preempting or expiring a job. Catch it:

```python
import signal, sys

def handle_shutdown(signum, frame):
    save_checkpoint(state, current_step)
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGUSR1, handle_shutdown)
```

In your SLURM script, use `--signal=B:USR1@120` to get a warning 120 seconds before preemption:

```bash
#SBATCH --signal=B:USR1@120
#SBATCH --time=7-00:00:00
#SBATCH --requeue
```

### 6. SLURM Checkpoint Directory on Persistent Storage

**Never store checkpoints on scratch/tmp filesystems.** Use persistent storage:

```bash
# BAD — lost on node reboot
CHECKPOINT_DIR=/scratch/$USER/job123/ckpts/

# GOOD — survives node failure
CHECKPOINT_DIR=$HOME/checkpoints/job123/
# Or shared filesystem
CHECKPOINT_DIR=/shared/projects/myteam/checkpoints/job123/
```

### 7. Progress Metadata Is Unreliable

JSON/log files reporting progress can be written by crashed processes with stale or incorrect data. Always verify against actual checkpoint files:

```bash
# Verify checkpoint integrity before resuming
python -c "import torch; torch.load('$CKPT'); print('OK')" 2>/dev/null || echo "CORRUPTED"
```

### 8. Limit Checkpoint Count to Save Disk Space

Keep only the last N numbered checkpoints plus milestones:

```python
import glob, os

def prune_old_checkpoints(prefix="ckpt_", keep_last=5, milestone_every=500_000):
    files = sorted(glob.glob(f"{prefix}*.dat"))
    for f in files[:-keep_last]:
        step = extract_step(f)
        if step % milestone_every != 0:  # Keep milestones forever
            os.remove(f)
```

## SLURM Job Template (Crash-Proof)

```bash
#!/bin/bash
#SBATCH --job-name=mylongjob
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=7-00:00:00
#SBATCH --signal=B:USR1@120
#SBATCH --requeue
#SBATCH --output=logs/%x_%j.out

export CHECKPOINT_DIR=$HOME/checkpoints/myjob
mkdir -p "$CHECKPOINT_DIR"

# Find best checkpoint for resume
BEST=$(ls "$CHECKPOINT_DIR"/ckpt_*.dat 2>/dev/null | sort | tail -1)
if [ -n "$BEST" ]; then
    echo "Resuming from: $BEST"
    python myjob.py --resume "$BEST"
else
    echo "Starting fresh"
    python myjob.py
fi
```

## Watchdog / Monitoring Script

```bash
#!/bin/bash
# watchdog.sh — runs via cron, restarts job if dead

JOB_NAME="myjob.py"
PID_FILE="/var/run/myjob.pid"
CKPT_DIR="$HOME/checkpoints/myjob"
LOG="/var/log/myjob-watchdog.log"

# Find running process
PID=$(pgrep -f "$JOB_NAME" | head -1)

if [ -z "$PID" ]; then
    # Find best checkpoint
    BEST=$(ls "$CKPT_DIR"/ckpt_*.dat 2>/dev/null | sort | tail -1)
    if [ -n "$BEST" ]; then
        echo "$(date) Job died. Restarting from $BEST" >> "$LOG"
        nohup python "$JOB_NAME" --resume "$BEST" &
        echo $! > "$PID_FILE"
    else
        echo "$(date) Job died. No checkpoint found. Starting fresh." >> "$LOG"
        nohup python "$JOB_NAME" &
        echo $! > "$PID_FILE"
    fi
else
    echo "$(date) Job alive (PID $PID)" >> "$LOG"
fi
```

Cron: `*/15 * * * * /path/to/watchdog.sh`

## Implementation Checklist

- [ ] Numbered checkpoints at every save interval
- [ ] Atomic writes (temp file + rename)
- [ ] `latest` file as secondary for quick resume
- [ ] Smart resume scans for best numbered checkpoint
- [ ] Signal handlers (SIGTERM, SIGUSR1) save before exit
- [ ] Checkpoint interval ≤1% of total job time
- [ ] Milestone checkpoints kept forever (major boundaries)
- [ ] Old checkpoints pruned to save disk space
- [ ] Checkpoints stored on persistent storage (not scratch)
- [ ] SLURM: `--signal=B:USR1@120` for preemption warning
- [ ] Watchdog monitors and restarts with best checkpoint

## Anti-Patterns (Don't Do These)

| Anti-Pattern | Consequence |
|-------------|-------------|
| Single checkpoint file (`latest.pt` only) | Corruption loses all progress |
| Overwriting checkpoint in-place | Mid-write crash corrupts the file |
| No signal handlers | Node shutdown kills unsaved work |
| Checkpoints on scratch/tmp | Lost on node reboot |
| Checkpoint interval too wide | Loses days of computation |
| Trusting progress JSON over checkpoints | Crashed process writes fake progress |
| Keeping all checkpoints | Fills disk, causes OOM |
| Watchdog hardcoded to `latest` file | Restarts from corrupted state |

## Recovery Procedure

When a job crashes:

1. **Don't panic** — numbered checkpoints survive
2. Check which checkpoints exist: `ls -lt ckpt_*.dat | head`
3. Verify integrity of latest: `python -c "load('ckpt_XXXX.dat'); print('OK')"`
4. If corrupted, go back one: try `ckpt_(N-1).dat`
5. Resume from best valid checkpoint
6. Investigate root cause (OOM? hardware? bug?)
7. Fix before restarting

## Cost of Getting It Wrong

Real example: Lost ~100K training steps (~50 hours GPU time, ~$30 cloud cost) because a single checkpoint file was corrupted during a crash. The fix took 2 hours to implement. The prevention is 30 lines of code.

**Invest 30 minutes setting up crash-proof checkpoints. Save days of recomputation.**