Future Tech

Nvidia rival Cerebras says it's revived Moore's Law with third-gen waferscale chips

Tan KW
Publish date: Wed, 13 Mar 2024, 11:47 PM
Tan KW
0 428,486
Future Tech

Cerebras revealed its latest dinner-plate sized AI chip on Wednesday, which it claims offers twice the performance per watt of its predecessor, alongside a collaboration with Qualcomm aimed at accelerating machine learning inferencing.

The chip, dubbed the WSE-3, is Cerebras' third-gen waferscale processor and measures in at a whopping 46,225mm2 (that's about 71.6 inches2 in freedom units.) The 4 trillion transistor part is fabbed on TSMC's 5nm process and is imprinted with 900,000 cores and 44GB of SRAM, good for 125 AI petaFLOPS of performance, which in this case refers to highly sparse FP16 - more on that in a minute.

A single WSE-3 forms the basis of Cerebras' new CS-3 platform, which it claims offers 2x higher performance, while consuming the same 23kW as the older CS-2 platform. "So, this would be a true Moore's Law step," CEO Andrew Feldman boasted during a press briefing Tuesday. "We haven't seen that in a long time in our industry."

Compared to Nvidia's H100, the WSE-3 is roughly 57x larger and boasts roughly 62x the sparse FP16 performance. But considering the CS-3's size and power consumption, it might be more accurate to compare it to a pair of 8U DGX systems [PDF] with a total of 16 H100s inside. In this comparison, the CS-3 is still about 4x faster, but that's only when looking at sparse FP16 performance.

The lead over the two DGX H100 systems is even smaller - at 2x - when you take into account that Nvidia's chips support FP8. Though this wouldn't exactly be an apples to apples comparison.

One major advantage Cerebras has is memory bandwidth. Thanks to the 44GB of onboard SRAM - yes, you read that correctly - Cerebras' latest accelerator boasts 21PBps of memory bandwidth, compared to the 3.9TBps the H100's HBM3 maxes out at.

That's not to say Cerebras' systems are faster in every scenario. The company's performance claims rely heavily on sparsity.

While Nvidia is able to achieve a doubling in floating point operations using sparsity, Cerebras claims to have achieved a roughly 8x improvement.

That means Cerebras' new CS-3 systems should be a little slower in dense FP16 workloads than a pair of DGX H100 servers consuming roughly the same amount of energy and space at somewhere around 15 petaFLOPS vs 15.8 petaFLOPS (16x H100s 989 teraFLOPS.) We've asked Cerebras for clarification on the CS-3's dense floating performance; we'll let you know if we hear anything back.

Considering the speed-up, we have a hard time imagining anyone would opt for Cerebras' infrastructure if they couldn't take advantage of sparsity, but even if you can't, it's pretty dang close.

Cerebras is already working to put its new systems to work in the third stage of its Condor Galaxy AI supercluster. Announced last year, Condor Galaxy is being developed in collaboration with G42 and will eventually span nine sites around the globe.

The first two systems - CG-1 and CG-2 - were installed last year and each featured 64 of Cerebras' CS-2 machines and were capable of 4 AI exaFLOPS a piece.

Wednesday, Cerebras revealed that CG-3 was destined for Dallas, Texas, and would implement the newer CS-3 platform boosting the sites performance to 8 AI exaFLOPS. Assuming that the remaining six sites also feature 64 CS-3s, the nine-site cluster would actually boast 64 AI exaFLOPS of collective compute rather than the 36 exaFLOPS of sparse FP16 initially promised.

However, it's worth noting that Cerebras' CS-3 isn't limited to clusters of 64. The company claims that it can now scale to up to 2,048 systems capable of pushing 256 AI exaFLOPS.

According to Feldman, such a system would be capable of training Meta's Llama 70B model in about a day.

Qualcomm, Cerebras collab on optimized inference

Alongside its next-gen accelerators, Cerebras also revealed it's working with Qualcomm to build optimized models for the Arm SoC giant's datacenter inference chips.

The two companies have been teasing the prospect of a collab going back to at least November. A release revealing Qualcomm's Cloud AI100 Ultra accelerator included a rather peculiar quote by Feldman praising the chip. 

If you missed its launch, the 140W single-slot accelerator boasts 64 AI cores and 128GB of LPDDR4x memory capable of pushing 870 TOPS at Int8 precision and 548GB/s of memory bandwidth.

A few months later, a Cerebras blog post highlighted how Qualcomm was able to get a 10 billion parameter model running on a Snapdragon SoC.

The partnership, now official, will see the two companies work to optimize models for the AI 100 Ultra which take advantage of techniques like sparsity, speculative decoding, MX6, and network architecture search.

As we've already established, sparsity, when properly implemented, has the potential to more than double an accelerator's performance. Speculative decoding, Feldman explains, is a process of improving the efficiency of the model in deployment by using a small, lightweight model to generate the initial response, and then using a larger model to check the accuracy of that response.

"It turns out, to generate text is more compute intensive than to check text," he said. "By using the big model to check it's faster and uses less compute."

The two companies are looking at MX6 to help reduce the memory footprint of models. MX6 is a form of quantization that can be used to shrink a model by compressing its weights to a lower precision. Meanwhile, network architecture search is a process of automating the design of neural networks for specific tasks in order to boost their performance.

Combined, Cerebras claims these techniques contribute to a 10x improvement in performance per dollar. ®

 

https://www.theregister.com//2024/03/13/cerebras_claims_to_have_revived/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment