Alibaba Cloud reveals its datacenter design and homebrew network used for LLM training

Tan KW

Publish date: Thu, 27 Jun 2024, 04:51 PM

Exclusive Alibaba Cloud has revealed the design of an Ethernet-based network it created specifically to carry traffic for training large language models - and has used in production for eight months.

The Chinese Cloud also revealed that its choice of Ethernet was informed by a desire to avoid vendor lock-in and leverage "the power of the entire Ethernet Alliance for faster evolution" - a decision that backs arguments made by a collection of vendors who are trying to attack Nvidia's networking business.

Alibaba's plans were revealed on the GitHub page of Ennan Zhai - an Alibaba Cloud senior staff engineer and research scientist focused on network research. Zhai posted a paper [PDF] to be presented at August's SIGCOMM conference - the annual get-together of the Association for Computing Machinery's special interest group on data communications.

Titled "Alibaba HPN: A Data Center Network for Large Language Model Training," the paper opens with the observation that traffic cloud computing traffic "… generates millions of small flows (eg lower than 10Gbit/sec)," while LLM training "produces a small number of periodic, bursty flows (eg 400Gbit/sec) on each host."

Equal-Cost Multi-Path routing - a commonly used method of sending packets to a single destination over multiple paths - becomes predisposed to hash polarization - a phenomenon that sees load balancing struggle and can significantly reduce usable bandwidth.

Alibaba Cloud's homebrew alternative, named "High Performance Network" (HPN), "avoids hash polarization by decreasing the occurrences of ECMP, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows."

HPN also addresses the fact that GPUs need to work in sync while training LLMs, which makes AI infrastructure sensitive to single points of failure - especially top-of-rack switches.

Alibaba's network design therefore uses a pair of switches - but not in the stacked configuration suggested by switch vendors.

Crammed full of cards

The paper explains that each host Alibaba Cloud uses for LLM training contains eight GPUs and nine network interface cards (NICs), each with a pair of 200GB/sec ports. One of the NICs handles housekeeping traffic on a "backend network."

The frontend network lets each GPU in a host directly communicate with other GPUs over an intra-host network that runs at 400-900GB/sec (bidirectional). Each NIC serves a single GPU - which Alibaba Cloud terms "rails" - an arrangement that sees each accelerator operate on "a dedicated 400Gb/sec of RDMA network throughput, resulting in a total bandwidth of 3.2Tb/sec."

"Such a design aims to maximize the utilization of the GPU's PCIe capabilities (PCIe Gen5×16), thus pushing the network send/receive capacity to the limit," the paper states.

Each port on the NICs connects to a different top-of-rack switch, to avoid single points of failure.

The Chinese Cloud's remarks about its preference to use Ethernet will be music to the ears of AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, and Microsoft. All of those vendors recently signed up for the Ultra Accelerator Link consortium - an effort to challenge Nvidia's NVlink networking biz. Intel and AMD have said the consortium - and other advanced networking efforts like Ultra Ethernet - represent a better way to network AI workloads because open standards always win in the long run, as they enable easier innovation.

But while Alibaba Cloud's NPM design is based around Ethernet, it still uses Nvidia tech. The GPU champ's NVlink is used for the intra-host network (which has more bandwidth than the network between hosts) and its "rail-optimized" design approach that sees each network interface card connect to a different set of top-of-rack switches is also in place.

Single-chip switches rule at Alibaba

The paper also makes many mentions of a "51.2Tb/sec Ethernet single-chip switch (first released in early 2023)" in Alibaba Cloud's top-of-rack switches. Two devices meet that description: Broadcom's Tomahawk ASICs which shipped in March 2023 and Cisco's G200 which arrived in June of the same year. The reference to "early 2023" suggests Alibaba Cloud went with Broadcom.

Whatever's inside Alibaba's switches, the paper reveals that the Chinese Cloud has a preference for switches powered by a single chip.

"There have been multi-chip chassis switches supporting higher bandwidth capacity," the paper states, before noting that "Alibaba Cloud's long-term experience in operating datacenter networks reveals that multi-chip chassis switches introduce more stability risks than single-chip switches."

The company's fleet of single-chip switches, it's revealed, outnumber multi-chip models by a factor of 32.6x. And those multi-chip switches experience critical hardware failures at a rate 3.77x higher than in single-chip switches.

DIY heatsink needed

While Alibaba Cloud adores single-chip switches - and enjoys the fact that the 51.2Tbit/sec units it adopted double the throughput of its previous units while consuming only 45 percent more power - the new models don't run cooler than their predecessors.

If the chips warm beyond 105°C, the switches can shut down. Alibaba Cloud could not find a switch vendor that offered cooling capable of keeping chips below 105°C.

It therefore created its own vapor chamber heat sink.

"By optimizing the wick structure and deploying more wicked pillars at the center of the chip, heat could be carried out more efficiently," the paper explains.

Datacenter design disclosed

All of the above is built into "pods" that house 15,000 GPUs apiece, each of which resides in a single datacenter building.

"All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs," the paper reveals, adding "In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building."

All optic fibers within the building are less than 100 meters, which allows "use of lower-cost multi-mode optical transceivers (cutting 70 percent cost compared with single-mode optical transceivers)."

It's not all sweetness and light: the paper admits that "HPN introduces extra designs … making wiring much more complex."

"Especially at the nascent stage of constructing HPN, on-site staff make a lot of wiring mistakes." That means extra testing is needed.

The paper also notes that forwarding capacity of a single Ethernet chip doubles every two years. Alibaba Cloud is therefore already "designing the next-generation network architecture equipping the higher capacity single-chip switch."

"In the land construction planning of our next-generation datacenters, the total power constraints for a single building have been adjusted to cover more GPUs. Thus, when the new datacenter is delivered, it can be directly equipped with 102.4Tbit/sec single-chip switches and the next-generation HPN."

The paper also notes that training LLMs with hundreds of billions of parameters "relies on a large-scale distributed training cluster, typically equipped with tens of millions of GPUs."

Alibaba Cloud's own Qwen model comes in a variant trained on 110 billion parameters - which suggests it has an awful lot of pods using NPM, and many millions of GPUs in production. And it will need many more, as its models and datacenters become larger and more numerous. ®

https://www.theregister.com//2024/06/27/alibaba_network_datacenter_designs_revealed/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Chat

New Update. Discover investment communities that resonate with your ideas

Latest Videos

MQ Market Updates - 28 June 2024

MQ Trader

Apps

MQ Chat

Send individual or group chats with anyone on i3investor

MQ Trader

Earn MQ Points while trading with MQ Trader

MQ Affiliate

Earn side income from Affiliate Program

MQdemy

Online learning and teaching marketplace

Hot Stocks Today >

MPI

MALAYSIAN PACIFIC INDUSTRIES

1000

PTRANS

PERAK TRANSIT BERHAD

977

HLIND

HONG LEONG INDUSTRIES BHD

922

KIPREIT

KIP REAL ESTATE INVESTMENT TRUST

424

YTLPOWR

YTL POWER INTERNATIONAL BHD

384

GENTING

GENTING BHD

383

JCY

JCY INTERNATIONAL BERHAD

344

UCHITEC

UCHI TECHNOLOGIES BHD

311

GENM

GENTING MALAYSIA BERHAD

287

MAYBANK

MALAYAN BANKING BHD

270

Daily Stocks

HSI-HWE

0.17

-0.005

248,121,800

BORNOIL

0.01

+0.005

224,622,400

HSI-HU8

0.095

-0.01

154,131,800

HSI-CXV

0.105

-0.005

126,319,700

HSI-CXF

0.07

-0.01

101,309,000

NOVAMSC

0.215

+0.02

86,522,500

AHB-WC

0.075

+0.005

79,965,600

MYEG

1.02

+0.05

74,108,200

INGENIEU

0.05

-0.01

62,528,100

YNHPROP

0.545

+0.05

50,393,900

More active Stocks

DLADY

36.18

+0.68

15,800

MPI

39.42

+0.54

96,600

UTDPLT

24.50

+0.30

228,400

AJI

15.50

+0.26

208,600

CDB

3.68

+0.21

11,439,700

ALLIANZ-PA

23.60

+0.20

100

PETDAG

17.44

+0.18

704,400

ALLIANZ

22.30

+0.18

15,200

AIRPORT

9.90

+0.17

1,698,400

HUMEIND

3.35

+0.14

941,700

More gainer Stocks

ORIENT

6.97

-0.18

1,220,900

GESHEN

3.23

-0.17

150,100

TENAGA

13.78

-0.16

11,369,400

PETGAS

17.82

-0.16

871,600

HEIM

22.04

-0.16

164,300

APOLLO

6.71

-0.13

1,500

KUAISHO-C17

0.08

-0.12

58,800

NOTION-WD

1.77

-0.11

1,778,200

HLIND

11.12

-0.10

10,500

CANONE

3.00

-0.09

39,800

More loser Stocks

MQ Trading Signals

BUY
SELL

No trading signals available.

More Trading Signals

No trading signals available.

More Trading Signals

Featured Advertisers / Partners

Top Brokers >

AmEquities

Affin Hwang

Rakuten Trade

Hong Leong Bank

Books Review >

Ride The Bull Short The Bear

CS Tan

4.9 / 5.0

This book is the result of the author's many years of experience and observation throughout his 26 years in the stockbroking industry. It was written for general public to learn to invest based on facts and not on fantasies or hearsay....

Read More