Future Tech

AMD reveals the MI325X, a 288GB AI accelerator built to battle Nvidia's H200

Tan KW
Publish date: Mon, 03 Jun 2024, 11:08 AM
Tan KW
0 448,550
Future Tech

Computex AMD's flagship AI accelerator will receive a high-bandwidth boost when its MI325X arrives later this year.

The reveal comes as AMD follows Nvidia's pattern and transitions to a yearly release cadence for its "Instinct" range of accelerators.

The Instinct MI325X, at least from what we can tell, is a lot like Nvidia's H200 in that it's a HBM3e-enhanced version of the GPU we detailed at length during AMD's Advancing AI event in December 2023. But the part is among the most sophisticated we've seen to date - composed of eight compute, four I/O, and eight memory chiplets stitched together using a combination of 2.5D and 3D packaging technologies.

From what we've seen, it doesn't appear that the CDNA 3 GPU tiles powering the forthcoming chip have changed meaningfully - at least not in terms of FLOPS. The chip still boasts 1.3 petaFLOPS of dense BF/FP16 performance or 2.6 petaFLOPS when dropping down to FP8. To be clear, the MI325X is still faster than the H200 at any given precision.

AMD's focus seems to be extending its memory advantage over Nvidia. At launch, the 192GB MI300X boasted more than twice the HBM3 of the H100 and a 51GB edge over the upcoming H200. The MI325X boosts the accelerator's capacity to 288GB - more than twice that of the H200 and 50 percent more than Nvidia's Blackwell chips revealed at GTC this spring.

The move to HBM3e also juices the MI325X's memory bandwidth to 6TB/sec. While a decent boost from the 5.3TB/sec of the MI300X and 1.3x more than the H200, we would have expected that number to be closer to 8TB/sec - like we saw on Nvidia's Blackwell GPUs.

Unfortunately, we'll have to wait until the MI325X arrives later this year to find out what's going on with its memory config.

A precision problem?

Both memory capacity and bandwidth have become major bottlenecks for AI inferencing. As we've discussed on numerous occasions, you need about 1GB of memory for every billion parameters when running at 8-bit precision. As such, you should be able to cram a 250 billion parameter onto a single MI325X - or closer to a 2T billion parameter model for an eight GPU system - and still have room for key value caches.

Except, in a prebriefing ahead of Computex, AMD execs boasted that its MI325X systems could support 1 trillion parameter models. So what gives? Well, AMD is still focusing on FP16, which requires twice as much memory per parameter as FP8.

Despite hardware support for FP8 being a major selling point of the MI300X when it launched, AMD has generally focused on half-precision performance in its benchmarks. And amid a scrap with Nvidia over the veracity of AMD's benchmarks late last year we learned why. For a lot of its benchmarks, AMD is relying on vLLM - an inference library which hasn't had solid support for FP8 data types. This meant for inferencing, the MI300X was stuck with FP16.

And unless AMD has been able to overcome this limitation, a model that'll run at FP8 on an H200 will require twice the memory on the MI325X - eliminating any advantage its massive 288GB of capacity might have otherwise granted it. What's more, the H200 is going to boast higher floating-point performance at FP8 than an MI325X at FP16.

Of course, this isn't an apples-to-apples comparison. But if your main concern is getting a model to run on as few GPUs as possible and you can not only drop to lower precision but double your floating point throughput, it's hard to see why you wouldn't.

The competitive landscape heats up

That said, there's still some merit to sticking with FP/BF16 data types for training and inferencing. As we saw with Gaudi3, Intel's Habana Labs actually prioritized 16-bit performance.

Announced earlier this spring, Gaudi3 boasts 192GB of HBM2e memory and a dual die design capable of churning out 1.8 petaFLOPS of dense FP8 and FP16. That gives it a 1.85x lead over the H100/200 and a 1.4x advantage over the MI300X/325X.

The one caveat is that Guadi3 doesn't support sparsity, while Nvidia and AMD's chips do. However, there's a reason AMD and Intel have both focused on dense floating point performance: sparsity just isn't that common in practice.

That may not always be true, of course. A considerable amount of effort has gone into training sparse models - particularly with regard to Nvidia and waferscale contender Cerebras. At least for inferencing, support for sparse floating point math may eventually play to AMD and Nvidia's advantage.

Pitted against Nvidia's H100 and upcoming H200, AMD's MI300X already leads in floating point performance and memory bandwidth - and its latest chip extends that lead.

But while AMD would prefer to draw comparisons to Nvidia's Hopper-gen parts, they aren't the ones it should be worried about. Of more concern are the Blackwell parts, which supposedly will start trickling onto the market later this year.

In its B200 config, the 1,000W Blackwell part promises up to  4.5 petaFLOPS of dense FP8 and 2.25 petaFLOPS at FP16 performance, 192GB of HBM3e memory, and 8TB/sec of bandwidth.

Fighting harder, faster

AMD isn't oblivious to the fact Nvidia's Blackwell parts hold the advantage, and to better compete the House of Zen is moving to a yearly release cadence for new Instinct accelerators.

If that sounds at all familiar, it's because - at least according to documents provided to investors - Nvidia did the same thing last fall. AMD hasn't said much about its next-gen CDNA 4 compute architecture, but from what little we have seen it'll be much better aligned with Blackwell.

According to AMD, CDNA 4 will stick to the same 288GB of HBM3e config as MI325X, but move to a 3nm process node for the compute tiles, and add support for FP4 and FP6 data types - the latter of which Nvidia is already adopting with Blackwell

The new data types may help to alleviate some of AMD's challenges around FP8, as FP4 and FP6 don't appear to suffer from the same lack of standardization. You see, FP8 is kind of a mess, with AMD and Nvidia using wildly different implementations. With the new 4-bit and 6-bit floating point implementations this (hopefully) won't be as big a problem.

Following CDNA 4's debut in 2025, AMD claims "CDNA next" - which we're going to call CNDA 5 for the sake of consistency - will deliver a "significant architectural upgrade."

What that will entail, AMD was reluctant to reveal. But if recent discussions by top executives are anything to go by, it could entail heterogeneous multi-die deployments or even photonic memory expansion. After all, AMD is one of the investors backing Celestial AI, which is developing that very tech.

 

https://www.theregister.com//2024/06/03/amd_reveals_refreshed_mi325x_with/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment