Future Tech

Intel Gaudi's third and final hurrah is an AI accelerator built to best Nvidia's H100

Tan KW
Publish date: Wed, 10 Apr 2024, 07:58 AM
Tan KW
0 428,800
Future Tech

On paper, Intel's Habana Gaudi3 AI accelerators don't look like they're ready to take on Nvidia's H100 thanks to older process tech and slower HBM memory delivering fewer FLOPS. But Gelsinger's gang insists its latest parts can not only go toe-to-toe with the H100 in inference, but best it in training.

Announced at Intel Vision on Tuesday, the accelerator arrives four years after the x86 giant abandoned its Nervana architecture in favor of Habana's Gaudi accelerators, and almost two years after its Gaudi2 chips arrived.

For its third gen, the Habana team opted for a multi-die architecture, joining AMD and Nvidia, which have already embraced advanced packaging to scale their chips beyond the reticle limit.

Compared to AMD's MI300X, or even Intel's Ponte Vecchio GPUs, Gaudi3 is relatively conservative in this respect, using a high-speed interconnect to make a pair of compute dies behave as one.

Combined, the two compute dies pack eight matrix math engines, 64 tensor cores, 96 MB of SRAM cache, 16x lanes of PCIe 5.0, and 24 200GbE links - more on those in a minute. Surrounding the dies we find eight stacks of older HBM2e memory for a total of 128 GB of capacity and 3.7 TBps of bandwidth.

But despite a striking similarity to Nvidia's Blackwell GPUs, Habana chief operating officer Eitan Medina emphasizes that this is not a GPU.

"GPUs have the legacy of being architected to do graphic rendering. Graphic rendering is about rendering pixels, so naturally, the choice was to implement a lot of small execution units because pixels are pixels," he explains. "You don't need gigantic matrix multiplication for graphic rendering."

By comparison, Gaudi3 was built using a smaller number of very large matrix math engines that are able to process AI workloads more effectively, Medina contends. "We had the benefit of creating Gaudi just for AI."

Ironically, Intel's next AI accelerator, due out next year, will be a GPU. All together, Intel says the latest part is capable of squeezing 1,835 teraFLOPS of FP8 - about twice that of its predecessor - from between 600-900 W depending on whether you're looking at the PCIe or OAM form factor.

It's here where the world of AI accelerator speeds and feeds get a bit messy.

Paper performance

The disclosure of floating point performance is a notable departure from Intel's stance and it has contended that the metric isn't a good indicator of AI performance. And on paper it's not hard to see why. Two years after Nvidia unveiled the H100 with nearly 4 petaFLOPS of FP8 grunt, Gaudi3 barely manages half that - or does it?

The often referenced figure is perhaps misleading. The 1,835 teraFLOPS claimed by Intel is dense floating point performance, while Nvidia is relying on sparsity to achieve its 4 petaFLOP claims. Taking this into account, Gaudi3 is only about 144 teraFLOPS slower than the H100 while offering more memory grunt.

Then there's AMD's MI300X, which remains the FP8 FLOPS king, at least until Nvidia's Blackwell parts start making their way into customer's hands. Nvidia says that'll start happening later this year, but we strongly suspect it'll be 2025 before we see them in any meaningful volume.

However, FP8 is only part of the story. At half precision, Gaudi3 really sings. In a bit of a party trick, Gaudi3 can put down the same 1,835 teraFLOPS at FP16 as it can at FP8, giving it a 1.85x lead over the H100 and a 1.4x advantage over the MI300X. What it won't do is support sparsity, which means that the H100 and MI300X may retain an advantage for workloads that can take advantage of it.

"Sparsity is something that's heavily researched but we're not depending on it," Medina says, adding that Intel has "no immediate plans" to enable sparsity on Gaudi3 for either training or inference.

Floating point performance is only one indicator of AI performance and not always a good one. Memory bandwidth also plays an outsized role in determining performance, especially for larger models.

Here Gaudi3's older memory - HBM2e vs HBM3 on the H100 and MI300X, and HBM3e on the H200 - puts it in an odd spot.

With eight stacks of HBM2e, Gaudi3 actually has more and faster memory than the Nvidia's H100 with its five stacks of HBM3. Despite this, the chip still falls well behind the H200 and MI300X's HBM3e memory, which deliver 4.8 GBps and 5.3 TBps of bandwidth respectively.

The decision to stick with HBM2e came down to risk management, Medina explained. "Our methodology was to use only IPs that were already proven in silicon before we tape out. At the time we taped out Gaudi3 there was simply no available physical layers that were validated to meet our standards."

However this does suggest that Gaudi3 may have taken longer to bring to market than planned, something Intel has struggled with across the rest of its datacenter line-up until recently.

Scaling up

It's a slightly different story when it comes to interconnect bandwidth.

Depending on number of parameters involved and at what precision it is employed, whether it be FP8 or FP16/BF16, it's not unusual for a model to run across multiple accelerators. For instance, to inference a 175 billion parameter model at FP16, you'd need at least five 80 GB H100s to fit the model into memory.

To do this, Nvidia and AMD use specialized interconnects called NVLink and Infinity Fabric respectively, which provide in the neighborhood of 900 GBps bandwidth to stitch together eight or more accelerators. By comparison, Gaudi is using regular old RDMA over Converged Ethernet (ROCe).

As we mentioned earlier, each accelerator features 24 200GbE interfaces good for 1.2 TBps of total bandwidth. Three of the 24 links are dedicated to off-node communications, leaving 1 TBps for chip-to-chip communications within the server.

This has a couple of benefits. For one, Gaudi3 systems should, in theory, be much simpler as they require fewer components. In a typical Nvidia or AMD system you're going to have at least one NIC per GPU for the compute network.

Intel contends that by integrating Ethernet NICs into its accelerators, it's also much easier to scale up to support 512 and even 1,024 node clusters using a traditional spine leaf architecture.

From this illustration we see that supporting a 4,096 accelerator cluster, you'd need 96 leaf switches connected via 800GbE to 48, 64-port Spine switches.

 

https://www.theregister.com//2024/04/09/intel_gaudi_ai_accelerator/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment