AI Cluster Architectures: A Deep Dive

Table of Contents

Introduction: The Scale of Modern AI
Cluster Hierarchy: From GPU to Datacenter
Intra-Node Communication
Inter-Node Communication
The Bandwidth Gap
Key Technologies Deep Dive
Network Topologies
Real-World Cluster Examples
Summary & Design Principles

1. Introduction: The Scale of Modern AI

Training GPT-4 required an estimated 25,000 NVIDIA A100 GPUs running for months. Meta's LLaMA 3 405B was trained on 16,384 H100 GPUs. These aren't just numbers—they represent some of the most sophisticated computing infrastructure ever built.

Understanding how these massive GPU clusters are architected is essential for anyone working on large-scale AI. The difference between an efficient and inefficient cluster design can mean weeks of training time and millions of dollars in compute costs.

In this comprehensive guide, we'll explore:

The hierarchical structure of GPU clusters (GPU → Node → Rack → Pod → Cluster)
How GPUs communicate within and across nodes
The massive bandwidth disparities at different hierarchy levels
Technologies like NVLink, NVSwitch, InfiniBand, and GPUDirect
Network topologies used in HPC (Fat-tree, Dragonfly, Hypercube)

Why This Matters

The communication between GPUs often becomes the bottleneck in distributed training. Understanding cluster architecture helps you choose the right parallelization strategy (data parallel, tensor parallel, pipeline parallel) and optimize your training throughput.

2. Cluster Hierarchy: From GPU to Datacenter

Modern AI clusters are organized in a hierarchical structure, each level with distinct characteristics and communication patterns. Let's examine each level from bottom to top.

GPU Cluster Hierarchy

Figure 1: Hierarchical structure of a GPU cluster, from individual GPUs to the full datacenter cluster.

Level 1: GPU (Graphics Processing Unit)

The fundamental compute unit. Modern AI GPUs like NVIDIA H100 have:

80GB HBM3 memory with 3.35 TB/s bandwidth
989 TFLOPS of FP16 tensor operations
Multiple NVLink connections for high-speed interconnect

Level 2: Node (Server)

A single physical server containing multiple GPUs. The standard configuration is 8 GPUs per node (e.g., NVIDIA DGX H100). Nodes include:

CPUs for orchestration and preprocessing
System memory (typically 2TB+ DDR5)
NVLink/NVSwitch for intra-node GPU communication
Network interfaces (InfiniBand NICs) for inter-node communication

Level 3: Rack

A physical cabinet containing multiple nodes, typically 4-8 servers per rack (32-64 GPUs). Racks have:

Top-of-Rack (ToR) switches connecting nodes within the rack
Shared power and cooling infrastructure
Local storage for checkpoints and data caching

Level 4: Pod

A group of interconnected racks optimized for low-latency communication, typically 256-512 GPUs. Pods represent the "sweet spot" for many training jobs:

Non-blocking fabric within the pod (full bisection bandwidth)
Leaf-spine architecture for uniform latency
Dedicated scheduler for pod-level job allocation

Level 5: Cluster / Datacenter

The full installation, potentially spanning thousands to tens of thousands of GPUs. Clusters connect multiple pods with:

Core/spine switches connecting pods
Potentially oversubscribed inter-pod bandwidth
Global job scheduler and resource management

3. Intra-Node Communication

Within a single node, GPUs communicate directly through high-bandwidth interconnects. This is the fastest communication path in the entire cluster hierarchy.

NVIDIA DGX H100 Internal Architecture

Figure 2: DGX H100 architecture showing 8 GPUs connected via NVSwitch fabric. Each GPU has 900 GB/s all-to-all bandwidth to every other GPU.

NVLink: The GPU-to-GPU Superhighway

NVLink is NVIDIA's proprietary high-speed interconnect for direct GPU-to-GPU communication. Each generation has dramatically increased bandwidth:

NVLink Generation	Per Link Bandwidth	Links per GPU	Total Bandwidth
NVLink 1.0 (P100)	40 GB/s	4	160 GB/s
NVLink 2.0 (V100)	50 GB/s	6	300 GB/s
NVLink 3.0 (A100)	50 GB/s	12	600 GB/s
NVLink 4.0 (H100)	50 GB/s	18	900 GB/s

NVSwitch: Full-Bandwidth All-to-All

Without NVSwitch, GPUs would need direct NVLink connections to every other GPU—impractical for 8+ GPUs. NVSwitch provides a non-blocking switch fabric:

All-to-all connectivity: Any GPU can communicate with any other at full bandwidth
No hop penalty: Communication doesn't go through intermediate GPUs
4 NVSwitches per DGX H100: Provides redundancy and aggregate bandwidth

NVSwitch 3.0 Specs (H100)

Switch bandwidth: 3.6 TB/s bidirectional
Ports: 64 NVLink 4.0 ports per switch
Latency: ~100ns hop latency
SHARP support: In-network reduction for collectives

4. Inter-Node Communication

When training scales beyond a single node, communication must traverse the network fabric. This is where bandwidth drops dramatically and optimization becomes critical.

Inter-Node Communication via InfiniBand

Figure 3: Inter-node communication via InfiniBand network. Note the dramatic bandwidth difference vs. NVLink.

InfiniBand: The Network Backbone

InfiniBand is the dominant networking technology for HPC and AI clusters. Unlike Ethernet, it's designed from the ground up for low latency and high bandwidth:

InfiniBand Generation	Data Rate	With 4× Lanes	Typical Latency
FDR (2011)	14 Gb/s	56 Gb/s	~1.3 μs
EDR (2014)	25 Gb/s	100 Gb/s	~0.9 μs
HDR (2018)	50 Gb/s	200 Gb/s	~0.6 μs
NDR (2022)	100 Gb/s	400 Gb/s	~0.5 μs
XDR (2025)	200 Gb/s	800 Gb/s	~0.4 μs

GPUDirect Technologies

NVIDIA's GPUDirect family eliminates CPU involvement in GPU data transfers:

GPUDirect P2P

Direct memory access between GPUs on the same PCIe fabric, bypassing CPU memory entirely.

Use case: Multi-GPU within a node
Bandwidth: PCIe limited (~32 GB/s)

GPUDirect RDMA

Network adapters directly access GPU memory via RDMA, bypassing CPU and system memory.

Use case: Multi-node training
Latency: ~1-2 μs end-to-end

GPUDirect Storage

NVMe drives directly read/write GPU memory, enabling fast checkpoint loading.

Use case: Checkpointing, data loading
Benefit: 10× faster than CPU staging

GPUDirect Async

CUDA kernels can trigger DMA operations directly, enabling compute-communication overlap.

Use case: Pipeline parallelism
Benefit: Better overlap efficiency

5. The Bandwidth Gap

Understanding the massive bandwidth disparity across hierarchy levels is crucial for choosing the right parallelization strategy. Let's visualize this:

GPU HBM Memory 3,350 GB/s

NVLink 4.0 (Intra-node) 900 GB/s

PCIe 5.0 x16 64 GB/s

InfiniBand NDR (400G) 50 GB/s

100GbE (Typical Ethernet) 12.5 GB/s

The 67× Gap

Network bandwidth (50 GB/s) is 67× slower than HBM bandwidth (3,350 GB/s) and 18× slower than NVLink (900 GB/s). This is why:

• Tensor parallelism works best within a node (needs high bandwidth)
• Pipeline parallelism works well across nodes (lower bandwidth ok)
• Data parallelism gradient sync is often the bottleneck

Implications for Training Strategy

Parallelism Type	Communication Volume	Optimal Placement	Why
Tensor Parallel	Very High (per layer)	Within node only	AllReduce every forward/backward
Pipeline Parallel	Low (activations only)	Across nodes OK	Point-to-point, can overlap
Data Parallel	Medium (gradients)	Across nodes OK	AllReduce once per step
Expert Parallel (MoE)	High (all-to-all)	Within pod preferred	All-to-all every layer

6. Key Technologies Deep Dive

NCCL: The Collective Communication Library

NCCL (NVIDIA Collective Communications Library) is the standard for multi-GPU communication. It automatically selects optimal algorithms based on topology:

Ring AllReduce: Bandwidth-optimal for large messages
Tree AllReduce: Latency-optimal for small messages
NVLink-aware: Uses direct paths when available
Multi-node: Combines intra-node and inter-node optimally

SHARP: In-Network Computing

SHARP (Scalable Hierarchical Aggregation Protocol) performs reduction operations inside the network switches rather than at endpoints:

Traditional AllReduce vs SHARP

Figure 4: SHARP performs reductions inside the switch, halving round-trip latency for collective operations.

7. Network Topologies

The way nodes are interconnected dramatically affects performance. Different topologies trade off cost, latency, bandwidth, and fault tolerance.

Fat-Tree (Clos)

The most common topology for datacenters. Provides full bisection bandwidth and multiple paths.

✓ Pros Non-blocking, fault tolerant, well understood

✗ Cons High switch count, expensive at scale

Dragonfly

Hierarchical topology with all-to-all connections within groups and global links between groups.

✓ Pros Low diameter, fewer switches, good for large scale

✗ Cons Complex routing, congestion on global links

Hypercube

N-dimensional cube where each node connects to N neighbors. Used in classic supercomputers.

✓ Pros Low diameter (log N), symmetric

✗ Cons Port count grows with scale, less practical today

Torus (3D/5D)

Grid with wrap-around edges. Used in IBM Blue Gene and Fujitsu Fugaku (world's former #1).

✓ Pros Simple, scalable, good for stencil patterns

✗ Cons Higher diameter, uneven latencies

8. Real-World Cluster Examples

NVIDIA's SuperPOD Architecture

NVIDIA's DGX SuperPOD is a reference architecture for large-scale AI infrastructure:

Component	DGX H100 SuperPOD	Specs
Nodes	32 DGX H100 systems	256 H100 GPUs total
GPU Memory	80GB HBM3 × 256	20.5 TB aggregate
Compute	989 TFLOPS × 256	~253 PFLOPS FP16
Intra-node	NVLink 4.0 + NVSwitch	900 GB/s per GPU
Inter-node	8× NDR400 InfiniBand	400 GB/s per node
Network	Quantum-2 IB fabric	Full bisection bandwidth
Storage	AI Enterprise Storage	~1 PB capacity

Meta's Research SuperCluster (RSC)

Meta's RSC was one of the largest AI training clusters when announced:

Phase 1: 760 DGX A100 systems (6,080 A100 GPUs)
Phase 2: 2,000+ DGX systems (16,000 A100 GPUs)
Network: 200 Gb/s InfiniBand with SHARP
Storage: 175 PB of Pure Storage FlashBlade

Microsoft Azure AI Infrastructure

Azure's AI supercomputer for OpenAI reportedly includes:

~25,000 A100 GPUs (for GPT-4 training)
InfiniBand NDR with Microsoft's custom topology
ND H100 v5 VMs with 8× H100 + 8× NDR400 NICs per VM

Cloud vs On-Premise

Cloud providers like AWS, Azure, and GCP offer pre-built GPU clusters. While convenient, be aware of placement group constraints—requesting GPUs across availability zones can significantly increase communication latency.

9. Summary & Design Principles

Designing efficient AI clusters requires understanding the interplay between compute, memory, and communication. Here are the key takeaways:

Key Design Principles

Minimize cross-node communication: Place high-communication workloads (tensor parallel) within nodes, use data/pipeline parallel across nodes
Match parallelism to topology: Tensor parallel within NVLink domain, pipeline parallel across slower links
Overlap compute and communication: Use asynchronous operations, gradient bucketing, and pipeline scheduling
Right-size your cluster: More GPUs isn't always better if communication becomes the bottleneck
Consider hierarchical approaches: FSDP's hybrid sharding, hierarchical AllReduce, etc.

Bandwidth Hierarchy Reminder

Level	Technology	Bandwidth	Relative Speed
GPU Memory	HBM3	3,350 GB/s	1.0× (baseline)
Intra-node GPU	NVLink 4.0	900 GB/s	0.27×
Inter-node	InfiniBand NDR	50 GB/s	0.015×
Ethernet	100GbE	12.5 GB/s	0.004×

Looking Forward

The future of AI clusters includes exciting developments:

NVLink Switch (NVL72): 72 GPUs in a single NVLink domain
UCIe / CXL: New interconnects for memory pooling
Photonic interconnects: Lower latency, higher bandwidth
In-network compute: More SHARP-like acceleration

Final Thoughts

Understanding cluster architecture isn't just academic—it directly impacts your training efficiency and costs. A well-designed training job that respects the bandwidth hierarchy can be 2-3× faster than a naive approach. Invest time in understanding your cluster's topology!

Alireza Olama

Postdoctoral researcher specializing in distributed systems and large-scale AI training. Working on efficient algorithms for training foundation models on HPC clusters.