1. Introduction: The Scale of Modern AI
Training GPT-4 required an estimated 25,000 NVIDIA A100 GPUs running for months. Meta's LLaMA 3 405B was trained on 16,384 H100 GPUs. These aren't just numbers—they represent some of the most sophisticated computing infrastructure ever built.
Understanding how these massive GPU clusters are architected is essential for anyone working on large-scale AI. The difference between an efficient and inefficient cluster design can mean weeks of training time and millions of dollars in compute costs.
In this comprehensive guide, we'll explore:
- The hierarchical structure of GPU clusters (GPU → Node → Rack → Pod → Cluster)
- How GPUs communicate within and across nodes
- The massive bandwidth disparities at different hierarchy levels
- Technologies like NVLink, NVSwitch, InfiniBand, and GPUDirect
- Network topologies used in HPC (Fat-tree, Dragonfly, Hypercube)
The communication between GPUs often becomes the bottleneck in distributed training. Understanding cluster architecture helps you choose the right parallelization strategy (data parallel, tensor parallel, pipeline parallel) and optimize your training throughput.
2. Cluster Hierarchy: From GPU to Datacenter
Modern AI clusters are organized in a hierarchical structure, each level with distinct characteristics and communication patterns. Let's examine each level from bottom to top.
Figure 1: Hierarchical structure of a GPU cluster, from individual GPUs to the full datacenter cluster.
Level 1: GPU (Graphics Processing Unit)
The fundamental compute unit. Modern AI GPUs like NVIDIA H100 have:
- 80GB HBM3 memory with 3.35 TB/s bandwidth
- 989 TFLOPS of FP16 tensor operations
- Multiple NVLink connections for high-speed interconnect
Level 2: Node (Server)
A single physical server containing multiple GPUs. The standard configuration is 8 GPUs per node (e.g., NVIDIA DGX H100). Nodes include:
- CPUs for orchestration and preprocessing
- System memory (typically 2TB+ DDR5)
- NVLink/NVSwitch for intra-node GPU communication
- Network interfaces (InfiniBand NICs) for inter-node communication
Level 3: Rack
A physical cabinet containing multiple nodes, typically 4-8 servers per rack (32-64 GPUs). Racks have:
- Top-of-Rack (ToR) switches connecting nodes within the rack
- Shared power and cooling infrastructure
- Local storage for checkpoints and data caching
Level 4: Pod
A group of interconnected racks optimized for low-latency communication, typically 256-512 GPUs. Pods represent the "sweet spot" for many training jobs:
- Non-blocking fabric within the pod (full bisection bandwidth)
- Leaf-spine architecture for uniform latency
- Dedicated scheduler for pod-level job allocation
Level 5: Cluster / Datacenter
The full installation, potentially spanning thousands to tens of thousands of GPUs. Clusters connect multiple pods with:
- Core/spine switches connecting pods
- Potentially oversubscribed inter-pod bandwidth
- Global job scheduler and resource management
3. Intra-Node Communication
Within a single node, GPUs communicate directly through high-bandwidth interconnects. This is the fastest communication path in the entire cluster hierarchy.
Figure 2: DGX H100 architecture showing 8 GPUs connected via NVSwitch fabric. Each GPU has 900 GB/s all-to-all bandwidth to every other GPU.
NVLink: The GPU-to-GPU Superhighway
NVLink is NVIDIA's proprietary high-speed interconnect for direct GPU-to-GPU communication. Each generation has dramatically increased bandwidth:
| NVLink Generation | Per Link Bandwidth | Links per GPU | Total Bandwidth |
|---|---|---|---|
| NVLink 1.0 (P100) | 40 GB/s | 4 | 160 GB/s |
| NVLink 2.0 (V100) | 50 GB/s | 6 | 300 GB/s |
| NVLink 3.0 (A100) | 50 GB/s | 12 | 600 GB/s |
| NVLink 4.0 (H100) | 50 GB/s | 18 | 900 GB/s |
NVSwitch: Full-Bandwidth All-to-All
Without NVSwitch, GPUs would need direct NVLink connections to every other GPU—impractical for 8+ GPUs. NVSwitch provides a non-blocking switch fabric:
- All-to-all connectivity: Any GPU can communicate with any other at full bandwidth
- No hop penalty: Communication doesn't go through intermediate GPUs
- 4 NVSwitches per DGX H100: Provides redundancy and aggregate bandwidth
Switch bandwidth: 3.6 TB/s bidirectional
Ports: 64 NVLink 4.0 ports per switch
Latency: ~100ns hop latency
SHARP support: In-network reduction for collectives
4. Inter-Node Communication
When training scales beyond a single node, communication must traverse the network fabric. This is where bandwidth drops dramatically and optimization becomes critical.
Figure 3: Inter-node communication via InfiniBand network. Note the dramatic bandwidth difference vs. NVLink.
InfiniBand: The Network Backbone
InfiniBand is the dominant networking technology for HPC and AI clusters. Unlike Ethernet, it's designed from the ground up for low latency and high bandwidth:
| InfiniBand Generation | Data Rate | With 4× Lanes | Typical Latency |
|---|---|---|---|
| FDR (2011) | 14 Gb/s | 56 Gb/s | ~1.3 μs |
| EDR (2014) | 25 Gb/s | 100 Gb/s | ~0.9 μs |
| HDR (2018) | 50 Gb/s | 200 Gb/s | ~0.6 μs |
| NDR (2022) | 100 Gb/s | 400 Gb/s | ~0.5 μs |
| XDR (2025) | 200 Gb/s | 800 Gb/s | ~0.4 μs |
GPUDirect Technologies
NVIDIA's GPUDirect family eliminates CPU involvement in GPU data transfers:
GPUDirect P2P
Direct memory access between GPUs on the same PCIe fabric, bypassing CPU memory entirely.
Bandwidth: PCIe limited (~32 GB/s)
GPUDirect RDMA
Network adapters directly access GPU memory via RDMA, bypassing CPU and system memory.
Latency: ~1-2 μs end-to-end
GPUDirect Storage
NVMe drives directly read/write GPU memory, enabling fast checkpoint loading.
Benefit: 10× faster than CPU staging
GPUDirect Async
CUDA kernels can trigger DMA operations directly, enabling compute-communication overlap.
Benefit: Better overlap efficiency
5. The Bandwidth Gap
Understanding the massive bandwidth disparity across hierarchy levels is crucial for choosing the right parallelization strategy. Let's visualize this:
Network bandwidth (50 GB/s) is 67× slower than HBM bandwidth (3,350 GB/s) and 18× slower than NVLink (900 GB/s). This is why:
• Tensor parallelism works best within a node (needs high bandwidth)
• Pipeline parallelism works well across nodes (lower bandwidth ok)
• Data parallelism gradient sync is often the bottleneck
Implications for Training Strategy
| Parallelism Type | Communication Volume | Optimal Placement | Why |
|---|---|---|---|
| Tensor Parallel | Very High (per layer) | Within node only | AllReduce every forward/backward |
| Pipeline Parallel | Low (activations only) | Across nodes OK | Point-to-point, can overlap |
| Data Parallel | Medium (gradients) | Across nodes OK | AllReduce once per step |
| Expert Parallel (MoE) | High (all-to-all) | Within pod preferred | All-to-all every layer |
6. Key Technologies Deep Dive
NCCL: The Collective Communication Library
NCCL (NVIDIA Collective Communications Library) is the standard for multi-GPU communication. It automatically selects optimal algorithms based on topology:
- Ring AllReduce: Bandwidth-optimal for large messages
- Tree AllReduce: Latency-optimal for small messages
- NVLink-aware: Uses direct paths when available
- Multi-node: Combines intra-node and inter-node optimally
SHARP: In-Network Computing
SHARP (Scalable Hierarchical Aggregation Protocol) performs reduction operations inside the network switches rather than at endpoints:
Figure 4: SHARP performs reductions inside the switch, halving round-trip latency for collective operations.
7. Network Topologies
The way nodes are interconnected dramatically affects performance. Different topologies trade off cost, latency, bandwidth, and fault tolerance.
Fat-Tree (Clos)
The most common topology for datacenters. Provides full bisection bandwidth and multiple paths.
Dragonfly
Hierarchical topology with all-to-all connections within groups and global links between groups.
Hypercube
N-dimensional cube where each node connects to N neighbors. Used in classic supercomputers.
Torus (3D/5D)
Grid with wrap-around edges. Used in IBM Blue Gene and Fujitsu Fugaku (world's former #1).
8. Real-World Cluster Examples
NVIDIA's SuperPOD Architecture
NVIDIA's DGX SuperPOD is a reference architecture for large-scale AI infrastructure:
| Component | DGX H100 SuperPOD | Specs |
|---|---|---|
| Nodes | 32 DGX H100 systems | 256 H100 GPUs total |
| GPU Memory | 80GB HBM3 × 256 | 20.5 TB aggregate |
| Compute | 989 TFLOPS × 256 | ~253 PFLOPS FP16 |
| Intra-node | NVLink 4.0 + NVSwitch | 900 GB/s per GPU |
| Inter-node | 8× NDR400 InfiniBand | 400 GB/s per node |
| Network | Quantum-2 IB fabric | Full bisection bandwidth |
| Storage | AI Enterprise Storage | ~1 PB capacity |
Meta's Research SuperCluster (RSC)
Meta's RSC was one of the largest AI training clusters when announced:
- Phase 1: 760 DGX A100 systems (6,080 A100 GPUs)
- Phase 2: 2,000+ DGX systems (16,000 A100 GPUs)
- Network: 200 Gb/s InfiniBand with SHARP
- Storage: 175 PB of Pure Storage FlashBlade
Microsoft Azure AI Infrastructure
Azure's AI supercomputer for OpenAI reportedly includes:
- ~25,000 A100 GPUs (for GPT-4 training)
- InfiniBand NDR with Microsoft's custom topology
- ND H100 v5 VMs with 8× H100 + 8× NDR400 NICs per VM
Cloud providers like AWS, Azure, and GCP offer pre-built GPU clusters. While convenient, be aware of placement group constraints—requesting GPUs across availability zones can significantly increase communication latency.
9. Summary & Design Principles
Designing efficient AI clusters requires understanding the interplay between compute, memory, and communication. Here are the key takeaways:
Key Design Principles
- Minimize cross-node communication: Place high-communication workloads (tensor parallel) within nodes, use data/pipeline parallel across nodes
- Match parallelism to topology: Tensor parallel within NVLink domain, pipeline parallel across slower links
- Overlap compute and communication: Use asynchronous operations, gradient bucketing, and pipeline scheduling
- Right-size your cluster: More GPUs isn't always better if communication becomes the bottleneck
- Consider hierarchical approaches: FSDP's hybrid sharding, hierarchical AllReduce, etc.
Bandwidth Hierarchy Reminder
| Level | Technology | Bandwidth | Relative Speed |
|---|---|---|---|
| GPU Memory | HBM3 | 3,350 GB/s | 1.0× (baseline) |
| Intra-node GPU | NVLink 4.0 | 900 GB/s | 0.27× |
| Inter-node | InfiniBand NDR | 50 GB/s | 0.015× |
| Ethernet | 100GbE | 12.5 GB/s | 0.004× |
Looking Forward
The future of AI clusters includes exciting developments:
- NVLink Switch (NVL72): 72 GPUs in a single NVLink domain
- UCIe / CXL: New interconnects for memory pooling
- Photonic interconnects: Lower latency, higher bandwidth
- In-network compute: More SHARP-like acceleration
Understanding cluster architecture isn't just academic—it directly impacts your training efficiency and costs. A well-designed training job that respects the bandwidth hierarchy can be 2-3× faster than a naive approach. Invest time in understanding your cluster's topology!