Common Warnings β€” HPC Cluster (Mahidol DGX)#

WARNING: Do NOT run tmux on compute nodes via SLURM#

What happened (incident 2026-03-08)#

A user started an interactive srun bash session on tensorcore, then launched tmux inside that session. The SLURM job hit its 30-minute time limit. SLURM attempted to terminate the job:

  1. SLURM sent SIGTERM to the job process group

  2. tmux daemonizes itself β€” it does not die on SIGTERM

  3. SLURM escalated to SIGKILL β€” child processes were in uninterruptible sleep (network/NFS wait)

  4. SIGKILL also failed β†’ SLURM logged β€œKill task failed”

  5. Node tensorcore was automatically set to DRAINING state

  6. All subsequent jobs were blocked from running on that node

SLURM log evidence:

Job 338652  bash  snit.san  TIMEOUT   06:44:31 β†’ 07:14:55  tensorcore
Job 338652.0     CANCELLED  07:16:37
Node tensorcore: Reason=Kill task failed [root@2026-03-08T07:16:35]

Root Cause#

tmux is designed to survive terminal disconnects by running as a background server process. When launched inside a SLURM job, this behaviour conflicts with SLURM’s job cleanup mechanism β€” the tmux server and its children cannot be cleanly terminated when the job time limit is reached.


Correct Pattern#

βœ… CORRECT
Login node (bcm-ai-h02)
  └── tmux new -s mysession          # tmux lives on login node
        └── sbatch myjob.sbatch      # submit to SLURM from tmux
              └── compute node       # SLURM job runs and exits cleanly

❌ WRONG
Login node
  └── srun --pty bash                # interactive job on compute node
        └── tmux new -s mysession    # tmux inside SLURM job ← DANGER
              └── python long_job.py # process won't die at time limit

Rules#

  1. tmux belongs on the login node only. Use it to keep your terminal session alive while waiting for jobs.

  2. Never run tmux inside srun or sbatch jobs. SLURM cannot kill tmux cleanly.

  3. Set realistic time limits on interactive srun sessions. When the limit is reached without tmux, SLURM cleans up correctly.

  4. For long-running CPU tasks (downloads, API calls) that don’t need a GPU: run directly on the login node inside tmux β€” no srun needed.


Admin Recovery#

When a node is stuck in DRAINING due to this issue:

# 1. Identify orphan jobs still running on the node
sacct -N tensorcore --starttime=TODAY --format=JobID,JobName,User,State

# 2. Cancel any stuck jobs
scancel <JOBID>

# 3. Verify no processes remain, then resume the node
scontrol update NodeName=tensorcore State=resume

# 4. Confirm node is back
sinfo