Future Tech

What's going on with Eos, Nvidia's incredible shrinking supercomputer?

Tan KW
Publish date: Tue, 20 Feb 2024, 07:23 AM
Tan KW
0 460,351
Future Tech

Analysis Nvidia can't seem to make up its mind just how big its Eos supercomputer is.

In a blog post this month re-revealing the ninth most powerful supercomputer from last fall's TOP500 ranking, the GPU slinger said the system was built using 576 DGX H100 systems totaling 4,608 GPUs. That's about what we expected from the system back when it was first announced.

While impressive in its own right, that's less than half the number of GPUs Nvidia claimed the system had back in November when it took to the net to talk up its performance in a variety of MLPerf AI training benchmarks.

Back then, the big iron boasted a complement of 10,752 H100 GPUs, which would have spanned 1,344 DGX systems. With nearly 4 petaFLOPS of sparse FP8 performance per GPU, that super would have been capable of 42.5 exaFLOPS of peak AI compute. Compare that to the 18.4 AI exaFLOPS Nvidia says the system is capable of outputting today, and Eos appears to have lost some muscle tone.

Oh, and if you're not familiar with the term AI exaFLOPS, it's a metric commonly used to describe floating point performance at lower precision than you'd typically see in double-precision HPC benchmarks, like LINPACK. In this case, Nvidia is arriving at these figures using sparse 8-bit floating point math, but another vendor, like Cerebras, might calculate AI FLOPS using FP16 or BF16.

So where did the other 60 percent of the system go? We put this question to Nvidia and were told "the supercomputer used for MLPerf LLM training with 10,752 H100 GPUs is a different system built with the same DGX SuperPOD architecture."

"The system ranked number nine on the 2023 TOP500 list is the 4,608 GPU Eos system featured in today's blog post and video," the spokesperson added.

Except that doesn't appear to be true either. Eos's TOP500 score of 121 petaFLOPS of FP64 out of an estimated peak of 188.65 is too low. The latter should be somewhere between the 275 petaFLOPS originally claimed and the 308 petaFLOPS of FP64 that Nvidia's spec sheet says 4,608 H100s should actually net you.

So while Nvidia hasn't admitted exactly how many GPUs were used, based on these performance figures we can estimate the November run was made using somewhere between 2,816 and 3,161 GPUs.

Nvidia's decision to put forward a smaller version of the system on last fall's TOP500 ranking, when it had already demonstrated a much larger Eos cluster, strikes us as odd.

With more than ten thousand H100s on board, the larger Eos config would have boasted 720 petaFLOPS of peak double-precision performance. Granted, real-world performance would have been a fair bit lower.

We asked Nvidia for clarification on these discrepancies and were told that the timeline didn't permit for a TOP500 run on the larger system. Why? They didn't say. "Our teams are racing towards GTC and are not able to provide more details on last year's TOP500 submission at this time," a spokesperson told The Register.

Having said that, Nvidia wouldn't be the only one that couldn't get a full run of their machine done in time. Argonne National Laboratory's Aurora supercomputer, the flagship for Intel's Xeon and GPU Max families, was only able to manage a partial run too. This suggests we may catch a glimpse of an even more powerful Eos system on this spring's TOP500.

If we had to guess, Nvidia may have run into trouble with stability on the full cluster. The LINPACK benchmark is, as you might have guessed, quite the stress test for any system, let alone one assembled from hundreds of GPU nodes.

In any case, Eos's shapeshifting does highlight one of the conveniences associated with Nvidia's modular DGX SuperPOD architecture. It can be scaled out and broken into chunks depending on what it's needed for.

Each SuperPod is made up of what Nvidia calls scalable units (SU) containing 32 DGX H100 nodes containing eight GPUs connected via its 400Gb/s Quantum-2 InfiniBand network. Additional SUs can be added to scale the system to support larger workloads. Officially, Nvidia supports up to four SUs per pod, but the company notes that larger configurations are possible, which is clearly the case with Eos.

As for how big Eos really is, it appears the answer to that depends entirely on how big Nvidia wants it to be at any given moment. ®

Want more analysis? Getting your hands on a H100 GPU is probably the most difficult thing in the world right now - even for Nvidia itself.

 

https://www.theregister.com//2024/02/19/eos_nvidia_supercomputer/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment