# Common Warnings — HPC Cluster (Mahidol DGX) ## WARNING: Do NOT run tmux on compute nodes via SLURM ### What happened (incident 2026-03-08) A user started an interactive `srun bash` session on `tensorcore`, then launched `tmux` inside that session. The SLURM job hit its 30-minute time limit. SLURM attempted to terminate the job: 1. SLURM sent `SIGTERM` to the job process group 2. `tmux` daemonizes itself — it does **not** die on `SIGTERM` 3. SLURM escalated to `SIGKILL` — child processes were in uninterruptible sleep (network/NFS wait) 4. `SIGKILL` also failed → SLURM logged **"Kill task failed"** 5. Node `tensorcore` was automatically set to `DRAINING` state 6. All subsequent jobs were blocked from running on that node **SLURM log evidence:** ``` Job 338652 bash snit.san TIMEOUT 06:44:31 → 07:14:55 tensorcore Job 338652.0 CANCELLED 07:16:37 Node tensorcore: Reason=Kill task failed [root@2026-03-08T07:16:35] ``` --- ### Root Cause `tmux` is designed to survive terminal disconnects by running as a background server process. When launched inside a SLURM job, this behaviour conflicts with SLURM's job cleanup mechanism — the tmux server and its children cannot be cleanly terminated when the job time limit is reached. --- ### Correct Pattern ``` ✅ CORRECT Login node (bcm-ai-h02) └── tmux new -s mysession # tmux lives on login node └── sbatch myjob.sbatch # submit to SLURM from tmux └── compute node # SLURM job runs and exits cleanly ❌ WRONG Login node └── srun --pty bash # interactive job on compute node └── tmux new -s mysession # tmux inside SLURM job ← DANGER └── python long_job.py # process won't die at time limit ``` ### Rules 1. **tmux belongs on the login node only.** Use it to keep your terminal session alive while waiting for jobs. 2. **Never run tmux inside `srun` or `sbatch` jobs.** SLURM cannot kill tmux cleanly. 3. **Set realistic time limits** on interactive `srun` sessions. When the limit is reached without tmux, SLURM cleans up correctly. 4. **For long-running CPU tasks** (downloads, API calls) that don't need a GPU: run directly on the login node inside tmux — no `srun` needed. --- ### Admin Recovery When a node is stuck in `DRAINING` due to this issue: ```bash # 1. Identify orphan jobs still running on the node sacct -N tensorcore --starttime=TODAY --format=JobID,JobName,User,State # 2. Cancel any stuck jobs scancel # 3. Verify no processes remain, then resume the node scontrol update NodeName=tensorcore State=resume # 4. Confirm node is back sinfo ``` --- ### Related - See `slurm-job-submit.md` for correct job submission patterns - See `tmu.md` in project `.think/` for tmux usage in the context of ML workflows