Back to Blog
AI Clusters

Networking for AI: Building High-Performance Fabrics for GPU Clusters

An overview of the networking challenges and design considerations when building fabrics for large-scale AI/ML training clusters, from RDMA to rail-optimized topologies.

February 20, 20263 min readai, gpu-clusters, rdma, data-center

The Network is the Computer (Again)

As AI workloads scale to thousands of GPUs, the network fabric becomes the critical bottleneck. A training job is only as fast as its slowest communication link. In modern AI clusters, network design decisions directly impact model training time, cost efficiency, and operational reliability.

Key Design Considerations

1. RDMA and RoCEv2

Remote Direct Memory Access (RDMA) over Converged Ethernet v2 (RoCEv2) has become the standard for GPU-to-GPU communication in AI clusters. Unlike traditional TCP/IP, RDMA bypasses the CPU entirely, enabling:

  • Ultra-low latency: Sub-microsecond message delivery
  • High throughput: Line-rate 400G/800G transfers
  • CPU offload: Zero-copy data movement between GPU memory

The network must support Priority Flow Control (PFC) and ECN for lossless Ethernet, which introduces its own operational challenges — PFC storms, head-of-line blocking, and careful buffer management.

2. Rail-Optimized Topologies

Modern AI clusters often use "rail-optimized" topologies where GPUs within a server connect to specific leaf switches (rails), and the fabric is designed to optimize for the collective communication patterns of distributed training:

  • AllReduce: The dominant communication pattern for data-parallel training
  • All-to-All: Common in model-parallel and expert-parallel (MoE) workloads
  • Ring and tree topologies: Mapped onto the physical network for efficient collectives

3. Scale-Out vs. Scale-Up

The industry is moving toward scale-up domains (NVLink, NVSwitch) within a server or rack, combined with scale-out Ethernet fabrics across racks. The boundary between scale-up and scale-out defines the network design:

DomainInterconnectBandwidthLatency
Intra-nodeNVLink/NVSwitch900 GB/s+~1 μs
Intra-rackEthernet (RoCE)400-800G per port~2-5 μs
Inter-rackEthernet fabricOversubscribed~5-10 μs

Operational Challenges

Running AI cluster networks requires a different operational mindset:

"In traditional DC networking, a single link failure is a non-event. In an AI training cluster, it can halt a job consuming millions of dollars of GPU-hours."

Key operational concerns include:

  1. Fabric-wide consistency: Every path must deliver the same performance
  2. Rapid fault detection: Sub-second failure detection and recovery
  3. Telemetry at scale: Real-time monitoring of thousands of flows
  4. Job-aware networking: Understanding which flows belong to which training job

What's Next

In upcoming posts, I'll dive deeper into specific topics:

  • PFC tuning and lossless Ethernet best practices
  • Comparing InfiniBand and RoCEv2 for AI workloads
  • Network automation for AI cluster Day-2 operations
  • The role of DPUs in AI networking

Stay tuned, and feel free to reach out if you have questions or want to discuss AI networking challenges.