The Secret Behind Arista's AI Ethernet Success

ANET Picture1

By: R. Scott Raynovich


As we've been covering in multiple threads in Futuriom, it’s exciting times in networking. AI has reinvigorated the race to build hyperscale clusters for AI on both the inference and training sides. Networking is a key enabler of these services.

For hyperscale datacenters, standards-based Ethernet switching technology emerged as the key to connecting large clusters of “scale-out” servers using commercial off-the-shelf (COTS) hardware. It's now clear that the same thing is happening in specialized AI datacenters and inference networks.

AI has new requirements in terms of low latency and high bandwidth. Yet, delivering the networking infrastructure to support AI will continue the evolution of high-scale systems supporting cloud: It will require improvements to existing systems, but these must be anchored in the economics of Ethernet and COTS components that can enable customers to scale out infrastructure to support AI in its many forms.

Ethernet's Evolution at Meta

With Arista Networks, Futuriom recently did a deep dive on the evolution of Ethernet for AI networking technology. It's clear from our examination of the technology that it's ready for prime time, as demonstrated by recent mass-scale deployment of Arista Ethernet-based technology at Meta.

Only a year or so ago, there seemed to be some debate about whether standards-based Ethernet could be adopted for the special demands of AI. Those reservations appear to have been dispelled, as Arista added significant features to its networking to scale to the highest level of AI needs, including those at Meta and other hyperscalers.

Ethernet has already evolved to adapt to these needs, with Arista and others adopting technologies such as packet spraying and congestion control, including technologies such as Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).

The proof can be seen in deployment, as Arista and Meta recently announced that Ethernet for AI networking is being deployed at scale. In a corporate blog, key Meta engineers said that they are implementing Arista’s Ethernet platform with advanced congestion management and load balancing to deliver lossless and low-latency AI networking.

"We opted for RDMA Over Converged Ethernet version 2 (RoCEv2) as the inter-node communication transport for the majority of our AI capacity," wrote Adi Gangidi and James Hongyi Zeng, two production engineers for Meta, in a blog published in August.

Gangidi and Zeng wrote that the design and operation of large-scale RoCE networks for distributed AI training workloads have evolved to meet the increasing demands of computational density and scale.

“At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control,” said the blog. “We have observed stable performance and lack of persistent congestion for training collectives.”

Ethernet Evolving at the UEC

Arista is part of a group of industry companies that have coalesced around the Ultra Ethernet Consortium (UEC). This is a community effort to deliver standards-based improvements to Ethernet, including Ultra Ethernet Transport, a modern transport protocol that will be designed to deliver the performance that AI applications require with all the advantages of the Ethernet/IP ecosystem. Arista Networks is one of the founding members of the UEC, which was publicly announced in July 2023. Additional members, totaling more than 60 companies (and continually increasing), include suppliers and operators of many of the largest AI and HPC networks today, such as AMD, Cisco, Dell, HPE, Intel, NVIDIA, and Qualcomm.

As Arista CEO Jayshree Ullal wrote in a recent blog, it's important that the AI infrastructure remain standards-based so that it doesn't devolve into silos, using different orchestration systems for configuration, validation, monitoring, and upgrades. If standards-based Ethernet can be employed across the AI infrastructure—whether it's training or inference—this can drive favorable economics by delivering a single point of control and visibility. Ullal and Arista call this the "AI center."

This is an interesting new chapter in the cloud networking revolution. In our new Leadership Brief, we explore how Arista is driving key developments in AI infrastructure and how its many challenges can be solved with innovations applied to broadly adopted technologies such as Ethernet. Here is what to look for in the continuing evolution of AI networking infrastructure:

  • The emergence of a variety of AI networking needs, including high-scale networking for both training and inference.
  • The evolution of Ethernet technologies to deliver AI-scale networking.
  • The needs of enterprises to deploy a consistent AI networking infrastructure and operational model to lower the costs of AI services and applications.

Read on, as we detail our findings with how Ethernet will evolve to support the needs of high-performance AI networking.

Download Our Leadership Brief Now!