Research Highlights
Architecture and Hardware

Technical Perspective: Mirror, Mirror on the Wall, What Is the Best Topology of Them All?

"HammingMesh: A Network Topology for Large-Scale Deep Learning," by Torsten Hoefler et al., proposes a novel network topology that provides high bandwidth at low cost for deep learning training jobs.

Posted
reflections of crowd on glass ceiling
Read the related Research Paper

Artificial intelligence (AI) is one of the most important emerging technologies of the 21st century, and designing suitable infrastructure for large-scale AI systems is critical. Major companies such as Microsoft, Google, Meta, and even Tesla are touting large-scale “AI supercomputers” as an essential tool for increasingly powerful AI systems, but as AI systems have a more specialized workload than traditional supercomputers, designing and implementing their architecture is a complex process. The complexity is because AI systems are specialized for AI and machine learning (ML) workloads, leveraging parallelism and specialized hardware accelerators to excel in processing and analyzing large datasets to make predictions, classify objects, and understand natural language. Traditional supercomputers, on the other hand, are general-purpose machines used for a broader range of scientific and computational tasks. The authors of the accompanying paper leverage the unique demands of specialized AI workloads to craft a network structure tailored for large-scale deep learning, which is a pivotal facet of AI.

The authors describe AI workloads by considering three dimensions of parallelism: data parallelism, pipeline parallelism, and operator parallelism. Data parallelism is used in AI workloads when training ML models. Pipeline parallelism can be observed in AI workloads in the execution of complex neural network models. In AI workloads, operator parallelism can be seen in the parallel execution of mathematical operations that make up neural network layers. The paper argues that, although each dimension is different, it can be implemented with nearest-neighbor communication, but today’s high-performance computing (HPC) networks often overprovision global bandwidth and underprovision local bandwidth for AI workloads.

This insight motivates the authors to use toroidal networks that HPC has been using traditionally but abandoned in favor of more flexible low-diameter topologies based on switching technologies. Torus network topologies extend the torus concept into multiple dimensions. These networks offer efficient connectivity between processing nodes. Google’s early Tensor Processing Units (TPUs) also employed two and three-dimensional torus networks for interconnection. This architecture facilitates efficient communication between TPUs in datacenters, which is crucial for ML workloads. While torus networks offer advantages like low latency and determinism, low-diameter torus topologies can suffer from limited global bandwidth. Scheduling and managing traffic on torus networks can be inflexible, leading to performance bottlenecks. More flexible switch technologies, such as the Dragonfly network, have gained popularity in HPC to address the limitations of low-diameter torus networks. Switched topologies generally use switches to route data efficiently between networked devices. Switched networks use switches as part of their design to route data efficiently between networked devices and thus provide better global bandwidth and improved flexibility in routing data, making them suitable for large-scale parallel processing.

The paper proposes combining the best of both worlds (that is, the torus topologies’ cost-effectiveness and switched topologies’ performance) into HammingMesh, a novel network topology that provides high bandwidth at low cost for deep learning training jobs. A similar approach was recently presented in Google’s TPUv4. In HammingMesh, the authors propose connecting a set of 2D meshes with switches to form virtual torus topologies of varying sizes. Local connections can be implemented as thin, conductive links or traces on printed circuit boards (PCBs), thus very inexpensive. Only traces that leave a board are connected to discrete switches, so the number of switched traces is immediately halved in a 2×2 board and quartered in a 4×4 board, which is a significant cost savings. The lower cost per link can satisfy high bandwidth requirements; installing many parallel connections can achieve multiple TBs of bandwidth at a reasonable system cost. The paper shows simulation results evaluating various AI workloads in detail, showing that price and performance gains also translate to complex workloads.

The authors demonstrate how a system deploying the HammingMesh topology can deal with failures and varying job allocations. Node and board failures are handled gracefully by swapping in “virtual boards,” scheduling is flexible because the topology can permute each row and column and still achieve full bandwidth. The paper shows several real-world traces scheduled to the topology and demonstrates that it achieves consistently high utilization even during failures.

AI is a fast-moving field with new algorithms published every week. As large-scale decoder architectures such as GPT-4 dominate large parts of the market, the ML technique Mixture of Experts (MoE) emerges and indicates a direction towards sparsity. Large-scale deep learning will shift more and more to sparse models, as sparsity can produce better data science results and more efficient computing performance and cost. This architectural rethinking has already begun. Only a Magic Mirror can tell whether HammingMesh remains the best topology for future workloads. Still, it is a strong contender in network designs for AI systems.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More