Nvidia turns up the AI heat with 1,200W Blackwell GPUs

Tan KW

Publish date: Tue, 19 Mar 2024, 07:27 AM

For all the sabre-rattling from AMD and Intel, Nvidia remains, without question, the dominant provider of AI infrastructure. With today's debut of the Blackwell GPU architecture during CEO Jensen Huang's GTC keynote, it aims to extended that lead - in both performance and power consumption.

Given Nvidia's meteoric rise in the wake of the generative AI boom, the stakes couldn't be higher. But at least on paper, Blackwell - the successor to Nvidia's venerable Hopper generation - doesn't disappoint. In terms of raw FLOPS, the GPU giant's top-specced Blackwell chips are roughly 5x faster.

Of course, the devil is in the details. Getting this performance will depend heavily on a number of factors. While Nvidia says its new chip will do 20 petaFLOPS, that's only when using its new 4-bit floating point data type and opting for liquid cooled servers. Looking at gen-on-gen FP8 performance, the chip is only about 2.5x faster than the H100.

At the time of writing, Blackwell encompasses three parts: the B100, B200, and Grace-Blackwell Superchip (GB200). Presumably there will be other Blackwell GPUs at some point, like the previously teased B40, which use a different die, or rather dies, but for now the three chips all share the same silicon.

And it's this silicon which is at least partially responsible for Blackwell's performance gains this generation. Each GPU is actually two reticle-limited compute dies, which are tied together via a 10 TBps NVLink-high-bandwidth-interface fabric (HBI), which allows them to function as a single accelerator The two compute dies are flanked by a total of eight HBM3e memory stacks up to 192GB of capacity and 8 TBps of bandwidth. And unlike H100 and H200, we're told the B100 and B200 have the same memory and GPU bandwidth.

While impressive, Nvidia is hardly the first to take the chiplet route. AMD's MI300-series accelerators, which we looked at in December, are objectively more complex and rely on both 2.5D and 3D packaging tech to stitch together as many as 13 chiplets into a single part. Then there's Intel's GPU Max parts, which use even more chiplets. We'll dig into how Nvidia's new Blackwell chips shake out against the competition in a bit, but first lets touch on about the elephant in the room: power

AI's power and thermal demands hit home

Even before Blackwell's debut, datacenter operators were already feeling the heat associated with supporting massive clusters of Nvidia's 700W H100.

With twice the silicon filling out Nvidia's latest GPU, it should come as a surprise that it runs a little hotter - or, at least, it can given the ideal operating conditions.

With the B100, B200, GB200, the key differentiator comes down to power and performance rather than memory configuration. According to Nvidia, silicon can actually operate between 700W and 1,200W depending on the SKU and type of cooling used.

Within each of these regimes the silicon, understandably performs differently. According to Nvidia, the air-cooled HGX B100 systems are able to squeeze 14 petaFLOPS of FP4 per GPU, while consuming the same 700W power target as the H100. This means if your datacenter can already handle Nvidia's DGX H100 systems, you shouldn't run into trouble adding a couple B100 nodes to your cluster.

Where things get interesting is with the B200. In an air cooled HGX or DGX configuration, each GPU can push 18 petaFLOPS of FP4 while sucking down a kilowatt. According to Nvidia its DGX B200 chassis with eight B200 GPUs will consume roughly 14.3kW something that's going to require roughly 60kW of rack power and thermal headroom to handle.

For newer datacenters built with AI clusters in mind, this shouldn't be an issue, but for existing facilities it may not be so easy.

Speaking of AI datacenters, reaching Blackwell's full potential will require switching over to liquid cooling. In a liquid cooled configuration, Nvidia says the chip can output 1,200W of thermal energy, while pumping out the full 20 petaFLOPS of FP4.

All of this is to say, while liquid cooling isn't a must this generation, if you want to get the most out of Nvidia's flagship silicon, you're going to need it.

Nvidia doubles up on GPU compute with second-gen Superchips

Nvidia's most powerful GPUs can be found in its GB200. Similar to Grace-Hopper, the new Grace-Blackwell Superchip meshes together its existing 72-core Grace CPU with its Blackwell GPUs using its NVLink-C2C interconnect.

But rather than a single H100 GPU, the GB200 packs a pair of Blackwell accelerators, boosting performance to 40 petaFLOPS of FP4 performance and 384GB of HBM3e memory.

We asked Nvidia for clarification on the GB200's total power draw and were told "no further details provided at this time." However, considering the fact that the older GH200 was rated for 1,000W between an 700W and the Arm CPU, that suggests that, under peak load, the Grace-Blackwell part, with its twin GPUs is capable of sucking down somewhere in the neighborhood of 2,700W. So it's not surprising that Nvidia would skip straight to liquid cooling for this beast.

Ditching the bulky heat spreaders for a couple of cold plates allowed Nvidia to cram two of these accelerators into a slim 1U chassis capable of pushing 80 petaFLOPS of FP4 or 40 petaFLOPS at FP8.

Compared to last-gen, this dual GB200 system is capable of churning out more FLOPS than its 8U 10.2kW DGX H100 systems - 40 petaFLOPS vs 32 petaFLOPS - while consuming an eighth the space.

Nvidia's rack-scale systems get an NVLink boost

These dual GB200 systems form the backbone of Nvidia's new NVL72 rack scale AI systems which are designed to support large-scale training and inference deployments on models scaling to trillions of parameters

Each rack comes equipped with 18 nodes for a total of 32 Grace GPUs and 72 Blackwell accelerators. These nodes are then interconnected via a bank of nine NVLink switches enabling them to behave like a single GPU node with 13.5TB of HBM3e.

This is actually the same technology employed in Nvidia's past DGX systems to make eight GPUs behave as one. The difference being that, using dedicated NVLink appliances, Nvidia is able to support many more GPUs.

According to Nvidia, the approach allows a single NVL72 rack system can support model sizes up to 27 trillion parameters - presumably when using FP4. In training, Nvidia says the system is good for 720 petaFLOPS at what we assume is dense FP4 precision. For inferencing workloads, meanwhile, we're told the system will do 1.44 exaFLOPS at FP4. If that's not enough horsepower, eight NVL72 racks can be networked to form Nvidia's DGX BG200 Superpod.

If any of this sounds familiar, that's because this isn't the first time we've seen this rack scale architecture from Nvidia. Last fall, Nvidia showed off a nearly identical architecture called the NVL32 which it was deploying for AWS. That system used 16 dual GH200 Superchip systems connected via NVLink switch appliances for a total of 128 petaFLOPS of sparse FP8 performance.

However, the NVL72 design doesn't just pack more GPUs, both the NVLink and network switches used to stitch the whole thing together have gotten an upgrade too.

The system's NVLink switches now feature a pair of Nvidia's fifth-gen 7.2 Tbps NVLink ASICs, each of which are capable of 1.8 TBps of all-to-all bidirectional bandwidth - twice that of last gen.

Nvidia has also doubled the per-port bandwidth of its networking gear to 800 Gbps this time around - though reaching these speeds will likely require using either Nvidia's Connect-X8 or BlueField-3 SuperNICs in conjunction with its Quantum 3 or Spectrum 4 switches.

Nvidia's rackscale architecture appears to be a hit with the major cloud providers too, with Amazon, Microsoft, Google, and Oracle signing up to deploy instances based on the design. We've also learned that AWS's Project Ceiba has been upgraded to use 20,000 accelerators.

How Blackwell stacks up so far

While Nvidia may dominate the AI infrastructure market, it's hardly the only name out there with heavy hitters like Intel and AMD rolling out new Gaudi and Instinct accelerators, cloud providers pushing custom silicon, and AI startups like Cerebras and Samba Nova all vying for a slice of the action.

And with demand for AI accelerators expected to far outstrip supply throughout 2024, winning share doesn't always mean having faster chips, just ones available to ship.

While we don't know much about Intel's upcoming Guadi 3 chips just yet, we can make some comparison's to AMD's MI300X GPUs launched back in December.

As we mentioned earlier, the MI300X is something of a silicon sandwich which uses advanced packaging to vertically stack eight CDNA 3 compute units onto four I/O dies, which provide high speed communications between the GPUs and 192GB of HBM3 memory.

In terms of performance, the MI300X promised a 30 percent performance advantage in FP8 floating point calculations and a nearly 2.5x lead in HPC centric double precision workloads compared to Nvidia's H100.

Comparing the 750W MI300X against the 700W B100, Nvidia's chip is 2.67x faster in sparse performance. And while both chips now pack 192 GB of high bandwidth memory, the Blackwell part's memory is 2.8 TB/s faster.

Memory bandwidth has already proven to be a major indicator of AI performance, particularly when it comes to inferencing. Nvidia's H200 is essentially a bandwidth boosted H100. Yet, despite pushing the same FLOPS, as the H100, Nvidia claims its twice as fast in models like Meta's Llama 2 70B.

While Nvidia has a clear lead at lower precision, it may have come at the expense of double precision performance, an area where AMD has excelled in recent years, winning multiple high-profile supercomputer awards.

According to Nvidia, the Blackwell GPU is capable of delivering 45 teraFLOPS of FP64 tensor core performance. That's a bit of a regression from the 67 teraFLOPS of FP64 Matrix performance delivered by the H100, and puts it a disadvantage against either AMD's MI300X at 81.7 teraFLOPS FP64 vector and 163 teraFLOPS FP64 matrix.

There's also Cerebras, which recently showed off its third-gen Waferscale AI accelerators. The monster 900,000 core processor is the size of a dinner plate and designed specifically for AI training.

Cerebras claims each of these chips can squeeze 125 petaFLOPS of highly sparse FP16 performance from 23kW of power. Compared to the H100, Cerebras says the chip is about 62x faster at half precision.

However, pitting the WSE-3 against Nvidia's flagship Blackwell parts, and that lead shrinks considerably. From what we understand, Nvidia's top specced chip should deliver about 5 petaFLOPS of sparse FP16 performance. That cuts Cerebra's lead down to 25x. But as we pointed out at the time, all of this depends on your model being able to take advantage of sparsity.

Don't cancel your H200 orders just yet

Before you get too excited about Nvidia's Blackwell parts, it's going to be a while before you can get your hands on them.

Nvidia tells the register, the B100, B200, and GB200 will all ship in the second half of the year, but it's not clear exactly when in what volume. It wouldn't surprise us if the B200 and GB200 didn't start ramping until sometime in early 2025.

The reason is simple. Nvidia hasn't shipped its HBM3e packed H200 chips yet. Those parts are due out in the second quarter of this year. ®

https://www.theregister.com//2024/03/18/nvidia_turns_up_the_ai/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Trader

Introducing MY's First IPO Fund for Sophisticated Investors!

MQ Chat

New Update. Discover investment communities that resonate with your ideas

MQ Trader

M & A Value Partners IPO Equity Fund has been launched - Targeted 13% Return p.a

Latest Videos