# HPC-AI User Guide: NCCL DDP Multi-Node Training **Author:** Snit Sanghlao and Qwen **Institution:** Mahidol AI Center **Date:** 2026-05-19 **Cluster:** MAI HPC (Slurm) — tau, zeta, omega --- ## Table of Contents 1. [Cluster Overview](#1-cluster-overview) 2. [Preparation: Smoke Tests](#2-preparation-smoke-tests) 3. [NCCL Environment Variables Reference](#3-nccl-environment-variables-reference) 4. [DDP Training: General Guidelines](#4-ddp-training-general-guidelines) 5. [Common Pitfalls](#5-common-pitfalls) 6. [Benchmark Results](#6-benchmark-results) 7. [Quick Reference: Submit a Job](#7-quick-reference-submit-a-job) --- ## Executive Summary This guide covers Distributed Data Parallel (DDP) training across multiple GPUs and nodes using NVIDIA NCCL on the Mahidol AI Center HPC cluster. ### Why NCCL + DDP + InfiniBand? | Factor | Single GPU | DDP (Ethernet) | DDP (InfiniBand) | |--------|-----------|----------------|------------------| | Throughput | — | ~10 Gb/s | ~200 Gb/s (20×) | | Training time (CIFAR-10, 2 GPUs) | Baseline | 37.7s | **18.1s** | | Speedup | 1× | — | **~2.1× over Ethernet** | | Scalability | N/A | Limited by bandwidth | Scales to many nodes | - **NCCL** (NVIDIA Collective Communications Library) is the standard communication backend for multi-GPU PyTorch training. It provides optimized `all_reduce`, `all_gather`, and `broadcast` operations. - **DDP** (Distributed Data Parallel) splits training data across GPUs, each holding a full model copy, and synchronizes gradients via NCCL after every backward pass. - **InfiniBand** provides low-latency, high-bandwidth interconnect between nodes — critical for multi-node DDP where gradient synchronization becomes the bottleneck. > **Bottom line:** Without InfiniBand, multi-node DDP is bottlenecked by Ethernet bandwidth. With it, you get near-linear scaling across nodes. --- ## 1. Cluster Overview ### Hardware | Node | CPU | Cores (HT) | Memory | GPUs | IB Ports (Active) | |------|-----|-----------|--------|------|-------------------| | **tau** | x86_64 | 112 × 2 = 224 | ~2 TB | 8× NVIDIA H100 80GB | `mlx5_3`, `mlx5_4`, `mlx5_6`, `mlx5_11` (200 Gb/s) | | **zeta** | x86_64 | 128 × 2 = 256 | ~2 TB | 8× NVIDIA A100 80GB | `mlx5_0`–`mlx5_3`, `mlx5_6`–`mlx5_9` (200 Gb/s) | | **omega** | — | — | — | — | DOWN (maintenance) | ### Slurm Configuration | Parameter | Value | |-----------|-------| | Cluster | `slurm` | | Partition | `defq` (default) | | Total nodes | 3 (2 active) | | Total GPUs | 16 | | Scheduler | `select/cons_tres` | | MPI default | `pmix` | --- ## 2. Preparation: Smoke Tests Before running production training, always verify connectivity and NCCL health. ### 2.1 Basic 2-Node Connectivity ```bash #!/bin/bash # smoke2node.sh #SBATCH --job-name=smoke2node #SBATCH --partition=defq #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=2 #SBATCH --time=00:02:00 srun -N 2 bash -c 'echo "Rank=$SLURM_PROCID Host=$(hostname)"' ``` **Expected output:** Both ranks print their hostname. ### 2.2 NCCL Smoke Test (Ethernet) ```bash #!/bin/bash # nccl_smoke_2node.sh #SBATCH --job-name=nccl_smoke2n #SBATCH --partition=defq #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=4 #SBATCH --time=00:05:00 module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment eval "$(conda shell.bash hook)" conda activate hpc_lab export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1) export MASTER_PORT=29500 export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=bond srun -N 2 python3 -c " import os, torch, torch.distributed as dist rank = int(os.environ['SLURM_PROCID']) world_size = int(os.environ['SLURM_NTASKS']) local_rank = int(os.environ.get('SLURM_LOCALID', 0)) master = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] torch.cuda.set_device(local_rank) device = torch.device(f'cuda:{local_rank}') dist.init_process_group('nccl', rank=rank, world_size=world_size, init_method=f'tcp://{master}:{port}') x = torch.ones(1, device=device) * rank dist.all_reduce(x, op=dist.ReduceOp.SUM) print(f'Rank {rank}: all_reduce = {x.item()}') dist.destroy_process_group() " ``` **Expected output:** Both ranks print `all_reduce = 1.0` (0 + 1). ### 2.3 NCCL Smoke Test (InfiniBand) ```bash #!/bin/bash # nccl_smoke_2node_ib.sh #SBATCH --job-name=nccl_ib2n #SBATCH --partition=defq #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=4 #SBATCH --time=00:05:00 module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment eval "$(conda shell.bash hook)" conda activate hpc_lab export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1) export MASTER_PORT=29500 export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME=bond export NCCL_IB_HCA=mlx5_3,mlx5_6 export NCCL_IB_GID_INDEX=3 export NCCL_DEBUG=INFO srun -N 2 python3 -c " import os, torch, torch.distributed as dist rank = int(os.environ['SLURM_PROCID']) world_size = int(os.environ['SLURM_NTASKS']) local_rank = int(os.environ.get('SLURM_LOCALID', 0)) master = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] torch.cuda.set_device(local_rank) device = torch.device(f'cuda:{local_rank}') dist.init_process_group('nccl', rank=rank, world_size=world_size, init_method=f'tcp://{master}:{port}') x = torch.ones(1, device=device) * rank dist.all_reduce(x, op=dist.ReduceOp.SUM) print(f'Rank {rank}: all_reduce = {x.item()}') dist.destroy_process_group() " ``` **Expected output:** NCCL log shows `Using network IB`, both ranks print `all_reduce = 1.0`. ### 2.4 Troubleshooting Smoke Tests | Symptom | Cause | Fix | |---------|-------|-----| | `Socket timed out` | Node still draining | Wait for node to reach `IDLE` state | | `No route to host` | NCCL picked down IB port | Set `NCCL_IB_HCA` to active ports only | | `QOSMaxMemoryPerUser` | Memory QoS limit exceeded | Reduce `--mem` or wait for lower usage | | Job stuck in `CG` (Completing) | Orphaned srun step | Contact admin or restart slurmctld | | `NCCL error: unhandled cuda error` | GPU not accessible | Verify `--gres=gpu:N` matches request | --- ## 3. NCCL Environment Variables Reference | Variable | Purpose | Recommended Value | |----------|---------|-------------------| | `NCCL_IB_DISABLE` | Enable/disable IB transport | `0` (enable) for IB, `1` for Ethernet-only | | `NCCL_SOCKET_IFNAME` | Bootstrap interface | `bond` — always use Ethernet for bootstrap | | `NCCL_IB_HCA` | Explicit IB HCAs to use | `mlx5_3,mlx5_6` (active ports on both tau and zeta) | | `NCCL_IB_GID_INDEX` | GID index for RoCE/IB | `3` | | `NCCL_DEBUG` | Debug verbosity | `INFO` for troubleshooting, unset for production | | `NCCL_NET_PLUGIN` | Custom network plugin | Leave default (internal) | | `NCCL_P2P_DISABLE` | Disable P2P between GPUs | `0` (enable P2P for same-node GPUs) | | `NCCL_SHM_DISABLE` | Disable shared memory | `0` (enable for same-node communication) | ### Key Principle: Bootstrap over Ethernet, Transport over IB > Always set `NCCL_SOCKET_IFNAME=bond` so process discovery uses Ethernet, then let NCCL use IB for actual data transfer. This prevents NCCL from selecting a down IB port during bootstrap. --- ## 4. DDP Training: General Guidelines ### 4.1 Single-Node, Multi-GPU Use `torchrun` for single-node DDP: ```bash #!/bin/bash # ddp_job_1node.sh #SBATCH --job-name=cifar10_ddp_1n #SBATCH --partition=defq #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=00:30:00 module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment eval "$(conda shell.bash hook)" conda activate hpc_lab torchrun --nproc_per_node=2 --nnodes=1 --rdzv_endpoint=localhost:29500 train_ddp.py ``` **Python code** (`train_ddp.py`): ```python import os import torch.distributed as dist dist.init_process_group("nccl") # torchrun sets all env vars automatically rank = dist.get_rank() local_rank = int(os.environ['LOCAL_RANK']) ``` ### 4.2 Multi-Node DDP Use `srun` with SLURM environment variables: ```bash #!/bin/bash # ddp_job_2node.sh #SBATCH --job-name=cifar10_ddp_2n #SBATCH --partition=defq #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=8 #SBATCH --mem=8G #SBATCH --time=00:30:00 module load cuda12.2/toolkit nccl2-cuda12.2-gcc11/2.18.3 default-environment eval "$(conda shell.bash hook)" conda activate hpc_lab export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1) export MASTER_PORT=29500 export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME=bond export NCCL_IB_HCA=mlx5_3,mlx5_6 export NCCL_IB_GID_INDEX=3 srun -N 2 python3 train_ddp_2node.py ``` **Python code** (`train_ddp_2node.py`): ```python import os import torch.distributed as dist rank = int(os.environ['SLURM_PROCID']) world_size = int(os.environ['SLURM_NTASKS']) local_rank = int(os.environ.get('SLURM_LOCALID', 0)) master = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] dist.init_process_group('nccl', rank=rank, world_size=world_size, init_method=f'tcp://{master}:{port}') ``` ### 4.3 Best Practices 1. **Always run smoke tests first** — verify NCCL works before launching expensive training 2. **Use `DistributedSampler`** — ensures each GPU gets unique data per epoch 3. **Call `train_sampler.set_epoch(epoch)`** — shuffles data differently each epoch 4. **Save checkpoints on rank 0 only** — avoids file write conflicts 5. **Use `pin_memory=True`** in DataLoader — enables async CPU→GPU transfer 6. **Set `num_workers` appropriately** — 4 per GPU is a good starting point 7. **Use `model.module.state_dict()`** — unwrap DDP wrapper before saving 8. **Call `dist.destroy_process_group()`** — ensures clean shutdown 9. **Use `NCCL_DEBUG=INFO` only for debugging** — adds overhead in production 10. **Monitor with `nvidia-smi`** — verify all GPUs are utilized ### 4.4 Performance Checklist | Check | Command | |-------|---------| | All GPUs active | `nvidia-smi` on each node | | IB link up | `ibstat` — look for `State: Active` | | NCCL using IB | `NCCL_DEBUG=INFO` log shows `Using network IB` | | No Ethernet fallback | Log should **not** show `Using network Socket` | | GPU utilization | `nvidia-smi dmon` — should show >80% utilization | | Memory balanced | All GPUs using similar VRAM | --- ## 5. Common Pitfalls ### 5.1 NCCL Picks Down IB Port **Symptom:** `No route to host` during NCCL init. **Cause:** NCCL auto-detects all IB HCAs, including down ones. **Fix:** Explicitly restrict NCCL to active ports: ```bash export NCCL_IB_HCA=mlx5_3,mlx5_6 ``` Verify active ports first with `ibstat | grep -A5 "Port 1"` and confirm `State: Active`. ### 5.2 Node Still Draining **Symptom:** `Socket timed out on send/recv operation`. **Cause:** Node was resumed from `drain` but `slurmd` isn't ready. **Fix:** Wait 1–2 minutes, then resubmit. Check node state with: ```bash sinfo -o "%N %t" ``` ### 5.3 QoS Memory Limit **Symptom:** Job pending with `QOSMaxMemoryPerUser`. **Cause:** Your total memory across all jobs exceeds the QoS limit. **Fix:** Cancel other jobs or reduce `--mem` request. Check current usage with: ```bash squeue -u $USER -o "%A %j %m %t" ``` ### 5.4 Stuck COMPLETING State **Symptom:** Job shows `CG` indefinitely. **Cause:** Orphaned `srun` step that failed to launch. **Fix:** Contact the cluster admin to force-clear the job. In the meantime you can cancel with `scancel ` — if that hangs, the admin needs to run `scontrol requeue ` or restart `slurmctld`. --- ## 6. Benchmark Results | Configuration | GPUs | Transport | Time (5 epochs) | Test Accuracy | |---------------|------|-----------|-----------------|---------------| | 2-node, 2 GPU | tau + zeta | Ethernet (bond0) | 37.7s | 71.0% | | 2-node, 2 GPU | tau + zeta | **InfiniBand** | **18.1s** | **71.9%** | **Speedup:** ~2.1× with InfiniBand over Ethernet for 2-node DDP. --- ## 7. Quick Reference: Submit a Job ```bash # 1. Smoke test: basic 2-node connectivity sbatch smoke2node.sh # 2. Smoke test: NCCL over Ethernet sbatch nccl_smoke_2node.sh # 3. Smoke test: NCCL over InfiniBand sbatch nccl_smoke_2node_ib.sh # 4. DDP training (single node, 2 GPUs) sbatch ddp_job_1node.sh # 5. DDP training (2 nodes, InfiniBand) sbatch ddp_job_2node.sh # Monitor jobs squeue -u $USER -o "%A %N %j %t %R" # Check output cat ddp_2node-*.out ``` --- *Mahidol AI Center — HPC-AI User Guide v1.0*