Common Warnings β HPC Cluster (Mahidol DGX)#
WARNING: Do NOT run tmux on compute nodes via SLURM#
What happened (incident 2026-03-08)#
A user started an interactive srun bash session on tensorcore, then launched tmux inside that session. The SLURM job hit its 30-minute time limit. SLURM attempted to terminate the job:
SLURM sent
SIGTERMto the job process grouptmuxdaemonizes itself β it does not die onSIGTERMSLURM escalated to
SIGKILLβ child processes were in uninterruptible sleep (network/NFS wait)SIGKILLalso failed β SLURM logged βKill task failedβNode
tensorcorewas automatically set toDRAININGstateAll subsequent jobs were blocked from running on that node
SLURM log evidence:
Job 338652 bash snit.san TIMEOUT 06:44:31 β 07:14:55 tensorcore
Job 338652.0 CANCELLED 07:16:37
Node tensorcore: Reason=Kill task failed [root@2026-03-08T07:16:35]
Root Cause#
tmux is designed to survive terminal disconnects by running as a background server process. When launched inside a SLURM job, this behaviour conflicts with SLURMβs job cleanup mechanism β the tmux server and its children cannot be cleanly terminated when the job time limit is reached.
Correct Pattern#
β
CORRECT
Login node (bcm-ai-h02)
βββ tmux new -s mysession # tmux lives on login node
βββ sbatch myjob.sbatch # submit to SLURM from tmux
βββ compute node # SLURM job runs and exits cleanly
β WRONG
Login node
βββ srun --pty bash # interactive job on compute node
βββ tmux new -s mysession # tmux inside SLURM job β DANGER
βββ python long_job.py # process won't die at time limit
Rules#
tmux belongs on the login node only. Use it to keep your terminal session alive while waiting for jobs.
Never run tmux inside
srunorsbatchjobs. SLURM cannot kill tmux cleanly.Set realistic time limits on interactive
srunsessions. When the limit is reached without tmux, SLURM cleans up correctly.For long-running CPU tasks (downloads, API calls) that donβt need a GPU: run directly on the login node inside tmux β no
srunneeded.
Admin Recovery#
When a node is stuck in DRAINING due to this issue:
# 1. Identify orphan jobs still running on the node
sacct -N tensorcore --starttime=TODAY --format=JobID,JobName,User,State
# 2. Cancel any stuck jobs
scancel <JOBID>
# 3. Verify no processes remain, then resume the node
scontrol update NodeName=tensorcore State=resume
# 4. Confirm node is back
sinfo