HPC-AI User Guide: NCCL DDP Multi-Node Training#
Author: Snit Sanghlao and Qwen
Institution: Mahidol AI Center
Date: 2026-05-19
Cluster: MAI HPC (Slurm) β tau, zeta, omega
Table of Contents#
Cluster Overview
Preparation: Smoke Tests
NCCL Environment Variables Reference
DDP Training: General Guidelines
Common Pitfalls
Benchmark Results
Quick Reference: Submit a Job
Executive Summary#
This guide covers Distributed Data Parallel (DDP) training across multiple GPUs and nodes using NVIDIA NCCL on the Mahidol AI Center HPC cluster.
Why NCCL + DDP + InfiniBand?#
Factor |
Single GPU |
DDP (Ethernet) |
DDP (InfiniBand) |
|---|---|---|---|
Throughput |
β |
~10 Gb/s |
~200 Gb/s (20Γ) |
Training time (CIFAR-10, 2 GPUs) |
Baseline |
37.7s |
18.1s |
Speedup |
1Γ |
β |
~2.1Γ over Ethernet |
Scalability |
N/A |
Limited by bandwidth |
Scales to many nodes |
NCCL (NVIDIA Collective Communications Library) is the standard communication backend for multi-GPU PyTorch training. It provides optimized
all_reduce,all_gather, andbroadcastoperations.DDP (Distributed Data Parallel) splits training data across GPUs, each holding a full model copy, and synchronizes gradients via NCCL after every backward pass.
InfiniBand provides low-latency, high-bandwidth interconnect between nodes β critical for multi-node DDP where gradient synchronization becomes the bottleneck.
Bottom line: Without InfiniBand, multi-node DDP is bottlenecked by Ethernet bandwidth. With it, you get near-linear scaling across nodes.
1. Cluster Overview#
Hardware#
Node |
CPU |
Cores (HT) |
Memory |
GPUs |
IB Ports (Active) |
|---|---|---|---|---|---|
tau |
x86_64 |
112 Γ 2 = 224 |
~2 TB |
8Γ NVIDIA H100 80GB |
|
zeta |
x86_64 |
128 Γ 2 = 256 |
~2 TB |
8Γ NVIDIA A100 80GB |
|
omega |
β |
β |
β |
β |
DOWN (maintenance) |
Slurm Configuration#
Parameter |
Value |
|---|---|
Cluster |
|
Partition |
|
Total nodes |
3 (2 active) |
Total GPUs |
16 |
Scheduler |
|
MPI default |
|
2. Preparation: Smoke Tests#
Before running production training, always verify connectivity and NCCL health.
2.1 Basic 2-Node Connectivity#
#!/bin/bash
# smoke2node.sh
#SBATCH --job-name=smoke2node
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:02:00
srun -N 2 bash -c 'echo "Rank=$SLURM_PROCID Host=$(hostname)"'
Expected output: Both ranks print their hostname.
2.2 NCCL Smoke Test (Ethernet)#
#!/bin/bash
# nccl_smoke_2node.sh
#SBATCH --job-name=nccl_smoke2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=bond
srun -N 2 python3 -c "
import os, torch, torch.distributed as dist
rank = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
torch.cuda.set_device(local_rank)
device = torch.device(f'cuda:{local_rank}')
dist.init_process_group('nccl', rank=rank, world_size=world_size,
init_method=f'tcp://{master}:{port}')
x = torch.ones(1, device=device) * rank
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f'Rank {rank}: all_reduce = {x.item()}')
dist.destroy_process_group()
"
Expected output: Both ranks print all_reduce = 1.0 (0 + 1).
2.3 NCCL Smoke Test (InfiniBand)#
#!/bin/bash
# nccl_smoke_2node_ib.sh
#SBATCH --job-name=nccl_ib2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond
export NCCL_IB_HCA=mlx5_3,mlx5_6
export NCCL_IB_GID_INDEX=3
export NCCL_DEBUG=INFO
srun -N 2 python3 -c "
import os, torch, torch.distributed as dist
rank = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
torch.cuda.set_device(local_rank)
device = torch.device(f'cuda:{local_rank}')
dist.init_process_group('nccl', rank=rank, world_size=world_size,
init_method=f'tcp://{master}:{port}')
x = torch.ones(1, device=device) * rank
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f'Rank {rank}: all_reduce = {x.item()}')
dist.destroy_process_group()
"
Expected output: NCCL log shows Using network IB, both ranks print all_reduce = 1.0.
2.4 Troubleshooting Smoke Tests#
Symptom |
Cause |
Fix |
|---|---|---|
|
Node still draining |
Wait for node to reach |
|
NCCL picked down IB port |
Set |
|
Memory QoS limit exceeded |
Reduce |
Job stuck in |
Orphaned srun step |
Contact admin or restart slurmctld |
|
GPU not accessible |
Verify |
3. NCCL Environment Variables Reference#
Variable |
Purpose |
Recommended Value |
|---|---|---|
|
Enable/disable IB transport |
|
|
Bootstrap interface |
|
|
Explicit IB HCAs to use |
|
|
GID index for RoCE/IB |
|
|
Debug verbosity |
|
|
Custom network plugin |
Leave default (internal) |
|
Disable P2P between GPUs |
|
|
Disable shared memory |
|
Key Principle: Bootstrap over Ethernet, Transport over IB#
Always set
NCCL_SOCKET_IFNAME=bondso process discovery uses Ethernet, then let NCCL use IB for actual data transfer. This prevents NCCL from selecting a down IB port during bootstrap.
4. DDP Training: General Guidelines#
4.1 Single-Node, Multi-GPU#
Use torchrun for single-node DDP:
#!/bin/bash
# ddp_job_1node.sh
#SBATCH --job-name=cifar10_ddp_1n
#SBATCH --partition=defq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=00:30:00
module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab
torchrun --nproc_per_node=2 --nnodes=1 --rdzv_endpoint=localhost:29500 train_ddp.py
Python code (train_ddp.py):
import os
import torch.distributed as dist
dist.init_process_group("nccl") # torchrun sets all env vars automatically
rank = dist.get_rank()
local_rank = int(os.environ['LOCAL_RANK'])
4.2 Multi-Node DDP#
Use srun with SLURM environment variables:
#!/bin/bash
# ddp_job_2node.sh
#SBATCH --job-name=cifar10_ddp_2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
#SBATCH --time=00:30:00
module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond
export NCCL_IB_HCA=mlx5_3,mlx5_6
export NCCL_IB_GID_INDEX=3
srun -N 2 python3 train_ddp_2node.py
Python code (train_ddp_2node.py):
import os
import torch.distributed as dist
rank = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
dist.init_process_group('nccl', rank=rank, world_size=world_size,
init_method=f'tcp://{master}:{port}')
4.3 Best Practices#
Always run smoke tests first β verify NCCL works before launching expensive training
Use
DistributedSamplerβ ensures each GPU gets unique data per epochCall
train_sampler.set_epoch(epoch)β shuffles data differently each epochSave checkpoints on rank 0 only β avoids file write conflicts
Use
pin_memory=Truein DataLoader β enables async CPUβGPU transferSet
num_workersappropriately β 4 per GPU is a good starting pointUse
model.module.state_dict()β unwrap DDP wrapper before savingCall
dist.destroy_process_group()β ensures clean shutdownUse
NCCL_DEBUG=INFOonly for debugging β adds overhead in productionMonitor with
nvidia-smiβ verify all GPUs are utilized
4.4 Performance Checklist#
Check |
Command |
|---|---|
All GPUs active |
|
IB link up |
|
NCCL using IB |
|
No Ethernet fallback |
Log should not show |
GPU utilization |
|
Memory balanced |
All GPUs using similar VRAM |
5. Common Pitfalls#
5.1 NCCL Picks Down IB Port#
Symptom: No route to host during NCCL init.
Cause: NCCL auto-detects all IB HCAs, including down ones.
Fix: Explicitly restrict NCCL to active ports:
export NCCL_IB_HCA=mlx5_3,mlx5_6
Verify active ports first with ibstat | grep -A5 "Port 1" and confirm State: Active.
5.2 Node Still Draining#
Symptom: Socket timed out on send/recv operation.
Cause: Node was resumed from drain but slurmd isnβt ready.
Fix: Wait 1β2 minutes, then resubmit. Check node state with:
sinfo -o "%N %t"
5.3 QoS Memory Limit#
Symptom: Job pending with QOSMaxMemoryPerUser.
Cause: Your total memory across all jobs exceeds the QoS limit.
Fix: Cancel other jobs or reduce --mem request. Check current usage with:
squeue -u $USER -o "%A %j %m %t"
5.4 Stuck COMPLETING State#
Symptom: Job shows CG indefinitely.
Cause: Orphaned srun step that failed to launch.
Fix: Contact the cluster admin to force-clear the job. In the meantime you can cancel with scancel <jobid> β if that hangs, the admin needs to run scontrol requeue <jobid> or restart slurmctld.
6. Benchmark Results#
Configuration |
GPUs |
Transport |
Time (5 epochs) |
Test Accuracy |
|---|---|---|---|---|
2-node, 2 GPU |
tau + zeta |
Ethernet (bond0) |
37.7s |
71.0% |
2-node, 2 GPU |
tau + zeta |
InfiniBand |
18.1s |
71.9% |
Speedup: ~2.1Γ with InfiniBand over Ethernet for 2-node DDP.
7. Quick Reference: Submit a Job#
# 1. Smoke test: basic 2-node connectivity
sbatch smoke2node.sh
# 2. Smoke test: NCCL over Ethernet
sbatch nccl_smoke_2node.sh
# 3. Smoke test: NCCL over InfiniBand
sbatch nccl_smoke_2node_ib.sh
# 4. DDP training (single node, 2 GPUs)
sbatch ddp_job_1node.sh
# 5. DDP training (2 nodes, InfiniBand)
sbatch ddp_job_2node.sh
# Monitor jobs
squeue -u $USER -o "%A %N %j %t %R"
# Check output
cat ddp_2node-*.out
Mahidol AI Center β HPC-AI User Guide v1.0