HPC-AI User Guide: NCCL DDP Multi-Node Training#

Author: Snit Sanghlao and Qwen
Institution: Mahidol AI Center
Date: 2026-05-19
Cluster: MAI HPC (Slurm) β€” tau, zeta, omega


Table of Contents#

  1. Cluster Overview

  2. Preparation: Smoke Tests

  3. NCCL Environment Variables Reference

  4. DDP Training: General Guidelines

  5. Common Pitfalls

  6. Benchmark Results

  7. Quick Reference: Submit a Job


Executive Summary#

This guide covers Distributed Data Parallel (DDP) training across multiple GPUs and nodes using NVIDIA NCCL on the Mahidol AI Center HPC cluster.

Why NCCL + DDP + InfiniBand?#

Factor

Single GPU

DDP (Ethernet)

DDP (InfiniBand)

Throughput

β€”

~10 Gb/s

~200 Gb/s (20Γ—)

Training time (CIFAR-10, 2 GPUs)

Baseline

37.7s

18.1s

Speedup

1Γ—

β€”

~2.1Γ— over Ethernet

Scalability

N/A

Limited by bandwidth

Scales to many nodes

  • NCCL (NVIDIA Collective Communications Library) is the standard communication backend for multi-GPU PyTorch training. It provides optimized all_reduce, all_gather, and broadcast operations.

  • DDP (Distributed Data Parallel) splits training data across GPUs, each holding a full model copy, and synchronizes gradients via NCCL after every backward pass.

  • InfiniBand provides low-latency, high-bandwidth interconnect between nodes β€” critical for multi-node DDP where gradient synchronization becomes the bottleneck.

Bottom line: Without InfiniBand, multi-node DDP is bottlenecked by Ethernet bandwidth. With it, you get near-linear scaling across nodes.


1. Cluster Overview#

Hardware#

Node

CPU

Cores (HT)

Memory

GPUs

IB Ports (Active)

tau

x86_64

112 Γ— 2 = 224

~2 TB

8Γ— NVIDIA H100 80GB

mlx5_3, mlx5_4, mlx5_6, mlx5_11 (200 Gb/s)

zeta

x86_64

128 Γ— 2 = 256

~2 TB

8Γ— NVIDIA A100 80GB

mlx5_0–mlx5_3, mlx5_6–mlx5_9 (200 Gb/s)

omega

β€”

β€”

β€”

β€”

DOWN (maintenance)

Slurm Configuration#

Parameter

Value

Cluster

slurm

Partition

defq (default)

Total nodes

3 (2 active)

Total GPUs

16

Scheduler

select/cons_tres

MPI default

pmix


2. Preparation: Smoke Tests#

Before running production training, always verify connectivity and NCCL health.

2.1 Basic 2-Node Connectivity#

#!/bin/bash
# smoke2node.sh
#SBATCH --job-name=smoke2node
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:02:00

srun -N 2 bash -c 'echo "Rank=$SLURM_PROCID  Host=$(hostname)"'

Expected output: Both ranks print their hostname.

2.2 NCCL Smoke Test (Ethernet)#

#!/bin/bash
# nccl_smoke_2node.sh
#SBATCH --job-name=nccl_smoke2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=bond

srun -N 2 python3 -c "
import os, torch, torch.distributed as dist
rank       = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master     = os.environ['MASTER_ADDR']
port       = os.environ['MASTER_PORT']
torch.cuda.set_device(local_rank)
device = torch.device(f'cuda:{local_rank}')
dist.init_process_group('nccl', rank=rank, world_size=world_size,
                        init_method=f'tcp://{master}:{port}')
x = torch.ones(1, device=device) * rank
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f'Rank {rank}: all_reduce = {x.item()}')
dist.destroy_process_group()
"

Expected output: Both ranks print all_reduce = 1.0 (0 + 1).

2.3 NCCL Smoke Test (InfiniBand)#

#!/bin/bash
# nccl_smoke_2node_ib.sh
#SBATCH --job-name=nccl_ib2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond
export NCCL_IB_HCA=mlx5_3,mlx5_6
export NCCL_IB_GID_INDEX=3
export NCCL_DEBUG=INFO

srun -N 2 python3 -c "
import os, torch, torch.distributed as dist
rank       = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master     = os.environ['MASTER_ADDR']
port       = os.environ['MASTER_PORT']
torch.cuda.set_device(local_rank)
device = torch.device(f'cuda:{local_rank}')
dist.init_process_group('nccl', rank=rank, world_size=world_size,
                        init_method=f'tcp://{master}:{port}')
x = torch.ones(1, device=device) * rank
dist.all_reduce(x, op=dist.ReduceOp.SUM)
print(f'Rank {rank}: all_reduce = {x.item()}')
dist.destroy_process_group()
"

Expected output: NCCL log shows Using network IB, both ranks print all_reduce = 1.0.

2.4 Troubleshooting Smoke Tests#

Symptom

Cause

Fix

Socket timed out

Node still draining

Wait for node to reach IDLE state

No route to host

NCCL picked down IB port

Set NCCL_IB_HCA to active ports only

QOSMaxMemoryPerUser

Memory QoS limit exceeded

Reduce --mem or wait for lower usage

Job stuck in CG (Completing)

Orphaned srun step

Contact admin or restart slurmctld

NCCL error: unhandled cuda error

GPU not accessible

Verify --gres=gpu:N matches request


3. NCCL Environment Variables Reference#

Variable

Purpose

Recommended Value

NCCL_IB_DISABLE

Enable/disable IB transport

0 (enable) for IB, 1 for Ethernet-only

NCCL_SOCKET_IFNAME

Bootstrap interface

bond β€” always use Ethernet for bootstrap

NCCL_IB_HCA

Explicit IB HCAs to use

mlx5_3,mlx5_6 (active ports on both tau and zeta)

NCCL_IB_GID_INDEX

GID index for RoCE/IB

3

NCCL_DEBUG

Debug verbosity

INFO for troubleshooting, unset for production

NCCL_NET_PLUGIN

Custom network plugin

Leave default (internal)

NCCL_P2P_DISABLE

Disable P2P between GPUs

0 (enable P2P for same-node GPUs)

NCCL_SHM_DISABLE

Disable shared memory

0 (enable for same-node communication)

Key Principle: Bootstrap over Ethernet, Transport over IB#

Always set NCCL_SOCKET_IFNAME=bond so process discovery uses Ethernet, then let NCCL use IB for actual data transfer. This prevents NCCL from selecting a down IB port during bootstrap.


4. DDP Training: General Guidelines#

4.1 Single-Node, Multi-GPU#

Use torchrun for single-node DDP:

#!/bin/bash
# ddp_job_1node.sh
#SBATCH --job-name=cifar10_ddp_1n
#SBATCH --partition=defq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=00:30:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

torchrun --nproc_per_node=2 --nnodes=1 --rdzv_endpoint=localhost:29500 train_ddp.py

Python code (train_ddp.py):

import os
import torch.distributed as dist

dist.init_process_group("nccl")  # torchrun sets all env vars automatically
rank       = dist.get_rank()
local_rank = int(os.environ['LOCAL_RANK'])

4.2 Multi-Node DDP#

Use srun with SLURM environment variables:

#!/bin/bash
# ddp_job_2node.sh
#SBATCH --job-name=cifar10_ddp_2n
#SBATCH --partition=defq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
#SBATCH --time=00:30:00

module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment
eval "$(conda shell.bash hook)"
conda activate hpc_lab

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=bond
export NCCL_IB_HCA=mlx5_3,mlx5_6
export NCCL_IB_GID_INDEX=3

srun -N 2 python3 train_ddp_2node.py

Python code (train_ddp_2node.py):

import os
import torch.distributed as dist

rank       = int(os.environ['SLURM_PROCID'])
world_size = int(os.environ['SLURM_NTASKS'])
local_rank = int(os.environ.get('SLURM_LOCALID', 0))
master     = os.environ['MASTER_ADDR']
port       = os.environ['MASTER_PORT']

dist.init_process_group('nccl', rank=rank, world_size=world_size,
                        init_method=f'tcp://{master}:{port}')

4.3 Best Practices#

  1. Always run smoke tests first β€” verify NCCL works before launching expensive training

  2. Use DistributedSampler β€” ensures each GPU gets unique data per epoch

  3. Call train_sampler.set_epoch(epoch) β€” shuffles data differently each epoch

  4. Save checkpoints on rank 0 only β€” avoids file write conflicts

  5. Use pin_memory=True in DataLoader — enables async CPU→GPU transfer

  6. Set num_workers appropriately β€” 4 per GPU is a good starting point

  7. Use model.module.state_dict() β€” unwrap DDP wrapper before saving

  8. Call dist.destroy_process_group() β€” ensures clean shutdown

  9. Use NCCL_DEBUG=INFO only for debugging β€” adds overhead in production

  10. Monitor with nvidia-smi β€” verify all GPUs are utilized

4.4 Performance Checklist#

Check

Command

All GPUs active

nvidia-smi on each node

IB link up

ibstat β€” look for State: Active

NCCL using IB

NCCL_DEBUG=INFO log shows Using network IB

No Ethernet fallback

Log should not show Using network Socket

GPU utilization

nvidia-smi dmon β€” should show >80% utilization

Memory balanced

All GPUs using similar VRAM


5. Common Pitfalls#

5.1 NCCL Picks Down IB Port#

Symptom: No route to host during NCCL init.

Cause: NCCL auto-detects all IB HCAs, including down ones.

Fix: Explicitly restrict NCCL to active ports:

export NCCL_IB_HCA=mlx5_3,mlx5_6

Verify active ports first with ibstat | grep -A5 "Port 1" and confirm State: Active.

5.2 Node Still Draining#

Symptom: Socket timed out on send/recv operation.

Cause: Node was resumed from drain but slurmd isn’t ready.

Fix: Wait 1–2 minutes, then resubmit. Check node state with:

sinfo -o "%N %t"

5.3 QoS Memory Limit#

Symptom: Job pending with QOSMaxMemoryPerUser.

Cause: Your total memory across all jobs exceeds the QoS limit.

Fix: Cancel other jobs or reduce --mem request. Check current usage with:

squeue -u $USER -o "%A %j %m %t"

5.4 Stuck COMPLETING State#

Symptom: Job shows CG indefinitely.

Cause: Orphaned srun step that failed to launch.

Fix: Contact the cluster admin to force-clear the job. In the meantime you can cancel with scancel <jobid> β€” if that hangs, the admin needs to run scontrol requeue <jobid> or restart slurmctld.


6. Benchmark Results#

Configuration

GPUs

Transport

Time (5 epochs)

Test Accuracy

2-node, 2 GPU

tau + zeta

Ethernet (bond0)

37.7s

71.0%

2-node, 2 GPU

tau + zeta

InfiniBand

18.1s

71.9%

Speedup: ~2.1Γ— with InfiniBand over Ethernet for 2-node DDP.


7. Quick Reference: Submit a Job#

# 1. Smoke test: basic 2-node connectivity
sbatch smoke2node.sh

# 2. Smoke test: NCCL over Ethernet
sbatch nccl_smoke_2node.sh

# 3. Smoke test: NCCL over InfiniBand
sbatch nccl_smoke_2node_ib.sh

# 4. DDP training (single node, 2 GPUs)
sbatch ddp_job_1node.sh

# 5. DDP training (2 nodes, InfiniBand)
sbatch ddp_job_2node.sh

# Monitor jobs
squeue -u $USER -o "%A %N %j %t %R"

# Check output
cat ddp_2node-*.out

Mahidol AI Center β€” HPC-AI User Guide v1.0