Future Tech

AI cloud startup TensorWave bets AMD can beat Nvidia

Tan KW
Publish date: Tue, 16 Apr 2024, 05:16 PM
Tan KW
0 428,800
Future Tech

Specialist cloud operators skilled at running hot and power-hungry GPUs and other AI infrastructure are emerging, and while some of these players like CoreWeave, Lambda, or Voltage Park - have built their clusters using tens of thousands of Nvidia GPUs, others are turning to AMD instead.

An example of the latter is bit barn startup TensorWave which earlier this month began racking up systems powered by AMD's Instinct MI300X ,which it plans to lease the chips at a fraction of the cost charged to access Nvidia accelerators.

TensorWave co-founder Jeff Tatarchuk believes AMD's latest accelerators have many fine qualities. For starters, you can actually buy them. TensorWave has secured a large allocation of the parts.

By the end of 2024, TensorWave aims to have 20,000 MI300X accelerators deployed across two facilities, and plans to bring additional liquid-cooled systems online next year.

AMD's latest AI silicon is also faster than Nvidia's much coveted H100. "Just in raw specs, the MI300x dominates H100," Tatarchuk said.

Launched at AMD’s Advancing AI event in December, the MI300X is the chip design firm’s most advanced accelerator to date. The 750W chip uses a combination of advanced packaging to stitch together 12 chiplets - 20 if you count the HBM3 modules - into a single GPU that's claimed to be 32 percent faster than Nvidia's H100.

In addition to higher floating point performance, the chip also boasts a larger 192GB of HBM3 memory capable of delivering 5.3TB/s of bandwidth versus the 80GB and 3.35TB/s claimed by the H100.

As we've seen from Nvidia's H200 - a version of the H100 boosted by the inclusion of HBM3e - memory bandwidth is a major contributor to AI performance, particularly in inferencing on large language models.

Much like Nvidia's HGX and Intel's OAM designs, standard configurations of AMD's latest GPU require eight accelerators per node.

That’s the configuration the folks at TensorWave are busy racking and stacking.

"We have hundreds going in now and thousands going in the months to come," Tatarchuk said.

Racking them up

In a photo posted to social media, the TensorWave crew showed what appeared to be three 8U Supermicro AS-8125GS-TNMR2 systems racked up. This led us to question whether TensorWave's racks were power or thermally limited after all, it's not unusual for these systems to pull in excess of 10kW when fully loaded.

It turns out that the folks at TensorWave hadn't finished installing the machines and that the firm is targeting four nodes with a total capacity of around 40kW per rack. These systems will be cooled using rear door heat exchangers (RDHx). As we've discussed in the past, these are rack-sized radiators through which cool water flows. As hot air exits a conventional server, it passes through the radiator which cools it to acceptable levels.

This cooling tech has become a hot commodity among datacenter operators looking to support denser GPU clusters and led to some supply chain challenges, TensorWave COO Piotr Tomasik said.

"There's a lot of capacity issues, even in the ancillary equipment around data centers right now," he said, specifically referencing RDHx as a pain point. "We've been successful thus far and we were very bullish on our ability to deploy them."

Longer term, however, TensorWave has its sights set on direct-to-chip cooling which can be hard to deploy in datacenters that weren’t designed to house GPUs, Tomasik said. "We're excited to deploy direct to chip cooling in the second half of the year. We think that that's going to be a lot better and easier with density."

Performance anxiety

Another challenge is confidence in AMD's performance. According to Tatarchuk, while there's a lot of enthusiasm around AMD offering an alternative to Nvidia, customers are not certain they will enjoy the same performance. "There's also a lot of 'We're not 100 percent sure if it's going to be as great as what we're currently used to on Nvidia',” he said.

In the interest of getting systems up and running as quickly as possible, TensorWave will launch its MI300X nodes using RDMA over Converged Ethernet (RoCE). These bare metal systems will be available for fixed lease periods, apparently for as little as $1/hr/GPU.

Scaling up

Over time, the outfit aims to introduce a more cloud-like orchestration layer for provisioning resources. Implementing GigaIO's PCIe 5.0-based FabreX technology to stitch together up to 5,750 GPUs in a single domain with more than a petabyte of high bandwidth memory is also on the agenda.

These so-called TensorNODEs are based on GigaIO's SuperNODE architecture it showed off last year, which used a pair of PCIe switch appliances to connect up to 32 AMD MI210 GPUs together. In theory, this should allow a single CPU head node to address far more than the eight accelerators typically seen in GPU nodes today.

This approach differs from Nvidia’s preferred design, which uses NVLink to stitch together multiple superchips into one big GPU. While NVLink is considerably faster topping out at 1.8TB/s of bandwidth in its latest iteration compared to just 128GB/s on PCIe 5.0, it only supports configurations up to 576 GPUs.

TensorWave will fund its bit barn build by using its GPUs as collateral for a large round of debt financing, an approach used by other datacenter operators. Just last week, Lambda revealed it'd secured a $500 million loan to fund the deployment of "tens of thousands" of Nvidia's fastest accelerators.

Meanwhile, CoreWeave, one of the largest providers of GPUs for rent, was able to secure a massive $2.3 billion loan to expand its datacenter footprint.

"You would, you should expect us to have the same sort of announcement here later this year," Tomasik said. ®

 

https://www.theregister.com//2024/04/16/amd_tensorwave_mi300x/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment