HPC-AI User Guide: NCCL DDP Multi-Node Training#

Author: Snit Sanghlao and Qwen
Institution: Mahidol AI Center
Date: 2026-05-19
Cluster: MAI HPC (Slurm) — tau, zeta, omega

Table of Contents#

Cluster Overview
Preparation: Smoke Tests
NCCL Environment Variables Reference
DDP Training: General Guidelines
Common Pitfalls
Benchmark Results
Quick Reference: Submit a Job

Executive Summary#

This guide covers Distributed Data Parallel (DDP) training across multiple GPUs and nodes using NVIDIA NCCL on the Mahidol AI Center HPC cluster.

Why NCCL + DDP + InfiniBand?#

Factor	Single GPU	DDP (Ethernet)	DDP (InfiniBand)
Throughput	—	~10 Gb/s	~200 Gb/s (20×)
Training time (CIFAR-10, 2 GPUs)	Baseline	37.7s	18.1s
Speedup	1×	—	~2.1× over Ethernet
Scalability	N/A	Limited by bandwidth	Scales to many nodes

NCCL (NVIDIA Collective Communications Library) is the standard communication backend for multi-GPU PyTorch training. It provides optimized all_reduce, all_gather, and broadcast operations.
DDP (Distributed Data Parallel) splits training data across GPUs, each holding a full model copy, and synchronizes gradients via NCCL after every backward pass.
InfiniBand provides low-latency, high-bandwidth interconnect between nodes — critical for multi-node DDP where gradient synchronization becomes the bottleneck.

Bottom line: Without InfiniBand, multi-node DDP is bottlenecked by Ethernet bandwidth. With it, you get near-linear scaling across nodes.

1. Cluster Overview#

Hardware#

Node	CPU	Cores (HT)	Memory	GPUs	IB Ports (Active)
tau	x86_64	112 × 2 = 224	~2 TB	8× NVIDIA H100 80GB	`mlx5_3`, `mlx5_4`, `mlx5_6`, `mlx5_11` (200 Gb/s)
zeta	x86_64	128 × 2 = 256	~2 TB	8× NVIDIA A100 80GB	`mlx5_0`–`mlx5_3`, `mlx5_6`–`mlx5_9` (200 Gb/s)
omega	—	—	—	—	DOWN (maintenance)

Slurm Configuration#

Parameter	Value
Cluster	`slurm`
Partition	`defq` (default)
Total nodes	3 (2 active)
Total GPUs	16
Scheduler	`select/cons_tres`
MPI default	`pmix`

2. Preparation: Smoke Tests#

Before running production training, always verify connectivity and NCCL health.

2.1 Basic 2-Node Connectivity#

#!/bin/bash
# smoke2node.sh
#SBATCH --job-name=smoke2node
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:02:00

srun -N 2 bash -c 'echo "Rank=$SLURM_PROCID  Host=$(hostname)"'

Expected output: Both ranks print their hostname.

2.2 NCCL Smoke Test (Ethernet)#

#!/bin/bash
# nccl_smoke_2node.sh
#SBATCH --job-name=nccl_smoke2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=bond

srun -N 2 python3 -c "
import os, torch, torch.distributed as dist
rank       = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master     = os.environ['MASTER_ADDR']
port       = os.environ['MASTER_PORT']
torch.cuda.set_device(local_rank)
device = torch.device(f'cuda:{local_rank}')
dist.init_process_group('nccl', rank=rank, world_size=world_size,
                        init_method=f'tcp://{master}:{port}')
x = torch.ones(1, device=device) * rank
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f'Rank {rank}: all_reduce = {x.item()}')
dist.destroy_process_group()
"

Expected output: Both ranks print all_reduce = 1.0 (0 + 1).

2.3 NCCL Smoke Test (InfiniBand)#

#!/bin/bash
# nccl_smoke_2node_ib.sh
#SBATCH --job-name=nccl_ib2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond
export NCCL_IB_HCA=mlx5_3,mlx5_6
export NCCL_IB_GID_INDEX=3
export NCCL_DEBUG=INFO

srun -N 2 python3 -c "
import os, torch, torch.distributed as dist
rank       = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master     = os.environ['MASTER_ADDR']
port       = os.environ['MASTER_PORT']
torch.cuda.set_device(local_rank)
device = torch.device(f'cuda:{local_rank}')
dist.init_process_group('nccl', rank=rank, world_size=world_size,
                        init_method=f'tcp://{master}:{port}')
x = torch.ones(1, device=device) * rank
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f'Rank {rank}: all_reduce = {x.item()}')
dist.destroy_process_group()
"

Expected output: NCCL log shows Using network IB, both ranks print all_reduce = 1.0.

2.4 Troubleshooting Smoke Tests#

Symptom	Cause	Fix
`Socket timed out`	Node still draining	Wait for node to reach `IDLE` state
`No route to host`	NCCL picked down IB port	Set `NCCL_IB_HCA` to active ports only
`QOSMaxMemoryPerUser`	Memory QoS limit exceeded	Reduce `--mem` or wait for lower usage
Job stuck in `CG` (Completing)	Orphaned srun step	Contact admin or restart slurmctld
`NCCL error: unhandled cuda error`	GPU not accessible	Verify `--gres=gpu:N` matches request

3. NCCL Environment Variables Reference#

Variable	Purpose	Recommended Value
`NCCL_IB_DISABLE`	Enable/disable IB transport	`0` (enable) for IB, `1` for Ethernet-only
`NCCL_SOCKET_IFNAME`	Bootstrap interface	`bond` — always use Ethernet for bootstrap
`NCCL_IB_HCA`	Explicit IB HCAs to use	`mlx5_3,mlx5_6` (active ports on both tau and zeta)
`NCCL_IB_GID_INDEX`	GID index for RoCE/IB	`3`
`NCCL_DEBUG`	Debug verbosity	`INFO` for troubleshooting, unset for production
`NCCL_NET_PLUGIN`	Custom network plugin	Leave default (internal)
`NCCL_P2P_DISABLE`	Disable P2P between GPUs	`0` (enable P2P for same-node GPUs)
`NCCL_SHM_DISABLE`	Disable shared memory	`0` (enable for same-node communication)

Key Principle: Bootstrap over Ethernet, Transport over IB#

Always set NCCL_SOCKET_IFNAME=bond so process discovery uses Ethernet, then let NCCL use IB for actual data transfer. This prevents NCCL from selecting a down IB port during bootstrap.

4. DDP Training: General Guidelines#

4.1 Single-Node, Multi-GPU#

Use torchrun for single-node DDP:

#!/bin/bash
# ddp_job_1node.sh
#SBATCH --job-name=cifar10_ddp_1n
#SBATCH --partition=defq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=00:30:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

torchrun --nproc_per_node=2 --nnodes=1 --rdzv_endpoint=localhost:29500 train_ddp.py

Python code (train_ddp.py):

import os
import torch.distributed as dist

dist.init_process_group("nccl")  # torchrun sets all env vars automatically
rank       = dist.get_rank()
local_rank = int(os.environ['LOCAL_RANK'])

4.2 Multi-Node DDP#

Use srun with SLURM environment variables:

#!/bin/bash
# ddp_job_2node.sh
#SBATCH --job-name=cifar10_ddp_2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
#SBATCH --time=00:30:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond
export NCCL_IB_HCA=mlx5_3,mlx5_6
export NCCL_IB_GID_INDEX=3

srun -N 2 python3 train_ddp_2node.py

Python code (train_ddp_2node.py):

import os
import torch.distributed as dist

rank       = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master     = os.environ['MASTER_ADDR']
port       = os.environ['MASTER_PORT']

dist.init_process_group('nccl', rank=rank, world_size=world_size,
                        init_method=f'tcp://{master}:{port}')

4.3 Best Practices#

Always run smoke tests first — verify NCCL works before launching expensive training
Use DistributedSampler — ensures each GPU gets unique data per epoch
Call train_sampler.set_epoch(epoch) — shuffles data differently each epoch
Save checkpoints on rank 0 only — avoids file write conflicts
Use pin_memory=True in DataLoader — enables async CPU→GPU transfer
Set num_workers appropriately — 4 per GPU is a good starting point
Use model.module.state_dict() — unwrap DDP wrapper before saving
Call dist.destroy_process_group() — ensures clean shutdown
Use NCCL_DEBUG=INFO only for debugging — adds overhead in production
Monitor with nvidia-smi — verify all GPUs are utilized

4.4 Performance Checklist#

Check	Command
All GPUs active	`nvidia-smi` on each node
IB link up	`ibstat` — look for `State: Active`
NCCL using IB	`NCCL_DEBUG=INFO` log shows `Using network IB`
No Ethernet fallback	Log should not show `Using network Socket`
GPU utilization	`nvidia-smi dmon` — should show >80% utilization
Memory balanced	All GPUs using similar VRAM

5. Common Pitfalls#

5.1 NCCL Picks Down IB Port#

Symptom: No route to host during NCCL init.

Cause: NCCL auto-detects all IB HCAs, including down ones.

Fix: Explicitly restrict NCCL to active ports:

export NCCL_IB_HCA=mlx5_3,mlx5_6

Verify active ports first with ibstat | grep -A5 "Port 1" and confirm State: Active.

5.2 Node Still Draining#

Symptom: Socket timed out on send/recv operation.

Cause: Node was resumed from drain but slurmd isn’t ready.

Fix: Wait 1–2 minutes, then resubmit. Check node state with:

sinfo -o "%N %t"

5.3 QoS Memory Limit#

Symptom: Job pending with QOSMaxMemoryPerUser.

Cause: Your total memory across all jobs exceeds the QoS limit.

Fix: Cancel other jobs or reduce --mem request. Check current usage with:

squeue -u $USER -o "%A %j %m %t"

5.4 Stuck COMPLETING State#

Symptom: Job shows CG indefinitely.

Cause: Orphaned srun step that failed to launch.

Fix: Contact the cluster admin to force-clear the job. In the meantime you can cancel with scancel <jobid> — if that hangs, the admin needs to run scontrol requeue <jobid> or restart slurmctld.

6. Benchmark Results#

Configuration	GPUs	Transport	Time (5 epochs)	Test Accuracy
2-node, 2 GPU	tau + zeta	Ethernet (bond0)	37.7s	71.0%
2-node, 2 GPU	tau + zeta	InfiniBand	18.1s	71.9%

Speedup: ~2.1× with InfiniBand over Ethernet for 2-node DDP.

7. Quick Reference: Submit a Job#

# 1. Smoke test: basic 2-node connectivity
sbatch smoke2node.sh

# 2. Smoke test: NCCL over Ethernet
sbatch nccl_smoke_2node.sh

# 3. Smoke test: NCCL over InfiniBand
sbatch nccl_smoke_2node_ib.sh

# 4. DDP training (single node, 2 GPUs)
sbatch ddp_job_1node.sh

# 5. DDP training (2 nodes, InfiniBand)
sbatch ddp_job_2node.sh

# Monitor jobs
squeue -u $USER -o "%A %N %j %t %R"

# Check output
cat ddp_2node-*.out

Mahidol AI Center — HPC-AI User Guide v1.0