NVIDIA Ethernet Drives xAI's Supercomputer
In news that flew a bit under the radar, NVIDIA announced last month that xAI’s gigantic new supercomputer in Memphis, Tennessee—called Colossus—is using NVIDIA Spectrum-X Ethernet networking to connect 100,000 NVIDIA Hopper GPUs.
xAI, of course, is the company headed by Elon Musk (who is also CEO of Tesla and SpaceX). In its mission to “understand the universe,” xAI is training a series of large language models (LLMs), including Grok-0, Grok-1, and Grok-2.
The news of NVIDIA's role in Colossus is significant on several fronts. It underscores NVIDIA’s dominance in the supercomputing market. It also illustrates the strength of NVIDIA’s brand of Ethernet networking, which retains the interoperability of Ethernet while supporting the demands of AI.
Demands of the Largest Supercomputer
In the case of Colossus, claimed by xAI and NVIDIA to be the world’s largest supercomputer, AI’s demands are enormous. To build Colossus, xAI recruited some of the world’s top engineers. The cluster comprises 100,000 H100 GPUs, configured in HGX servers containing eight GPUs each. (That’s for now: The cluster will soon be expanded by another 100,000 GPUs.) The servers are housed in Supermicro liquid-cooled racks, with 64 GPUs per rack. There are a total of 1,500 GPU racks in the Colossus cluster.
NVIDIA and xAI point out that the Colossus facility and supercomputer was built in just 122 days, one of the fastest datacenter projects in history. It took just 19 days from the time the first rack was installed to the startup of model training.
Advances in Spectrum-X
The networking of Colossus relies on NVIDIA's Spectrum-X Ethernet. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput using NVIDIA’s Spectrum-X congestion control. This compares to standard Ethernet’s thousands of flow collisions and 60% data throughput.
“The important thing is that the xAI 100K-GPU cluster is built on Spectrum-X. Traditional Ethernet was not designed and does not work for AI,” Kevin Deierling, senior vice president of networking at NVIDIA, told us in an interview last week. “Spectrum-X was designed and built for AI and addresses the limitations of traditional Ethernet and will scale to meet the insatiable demands of generative agentic AI.”
A Two-Pronged Approach
The big takeaway here is that NVIDIA is betting heavily on its own flavor of Ethernet (Spectrum-X) to complement its InfiniBand portfolio, which has historically led the networking end of AI “back-end” connectivity. The investment in Spectrum-X shows that NVIDIA doesn’t want to be known as an “InfiniBand only” shop, primarily because this could a risk for its own business as well as for its customers.
When it bought Mellanox in 2020, NVIDIA became the leading provider of InfiniBand solutions in the market, and some customers worried about a “lock-in” effect for InfiniBand (and the government might have worried about monopoly). By having an Ethernet solution, NVIDIA can take advantage of both markets.
Bottom line: It's clear that the industry is moving toward optimizing and upgrading Ethernet to become a leading networking technology for AI (a shift we have covered in detail in our Networking Infrastructure for AI report, as well as in many articles on Futuriom's website). And NVIDIA’s Spectrum-X has emerged as a standards-based Ethernet platform using Remote Direct Memory Access (RDMA) technology optimized for demanding AI workloads.
Ethernet’s Market for AI
The race to make Ethernet AI-ready has set off a flurry of marketing and roadmaps among networking companies as they scramble to position themselves as next-gen AI networking providers. It’s clear why: Networking is believed to be 10% to 15% of the total datacenter infrastructure spend, which according to some estimates will soon approach $1 trillion for AI datacenters.
That’s not a market NVIDIA is willing to pass up. And along with Arista, the markets have judged NVIDIA as a leader in AI infrastructure networking. The share prices of both companies have skyrocketed in recent years on increased spending by hyperscalers.
NVIDIA also has indicated that it’s willing to play more broadly in the Ethernet market by recently joining the Ultra Ethernet Consortium, which is working to develop standardized and interoperable solutions for Ethernet built for AI. The UEC includes all of the largest networking vendors, as well as many startups. In addition to Arista and NVIDIA, members include Arrcus, Broadcom, Ciena, Cisco, Dell, DriveNets, HPE, Marvell, Huawei, Juniper, and Nokia, to name a few of the most notable contributors (the group now has almost 100 members).
Futuriom Take: The xAI implementation shows that NVIDIA’s move to develop its own flavor of Ethernet appears to be paying off.