Future Tech

Another GPU cloud emerges. This time, upstart Foundry

Tan KW
Publish date: Wed, 14 Aug 2024, 06:06 AM
Tan KW
0 499,920
Future Tech

Yet another GPU cloud has coalesced amid the ongoing AI boom. Today, AI infrastructure startup Foundry announced its Cloud Platform is now available for limited access.

Founded in 2022 by former DeepMind researcher Jared Quincy Davis, Foundry is a relative newcomer to the arena, joining the likes of Lambda and CoreWeave, both of which have attracted billions in the race to cash in on AI infrastructure demand.

The Palo Alto, California-based cloudy upstart aims to be more than another rent-a-GPU cluster provider. A major emphasis of Foundry Cloud Platform (FCP) to help users cut through the complexity of deploying training, fine tuning, and rolling out inference models.

Specifically, Foundry claims that if a customer reserves 1,000 GPUs for X number of hours, whether that's days or weeks, they'll actually get all that compute. This isn't a trivial task, especially with larger clusters used for things like training, as the mean time to failure can be quite low.

As you might expect, Foundry's strategy for getting around this problem is to maintain a pool of reserved nodes that effectively function as hot spares for when problems inevitably crop up.

However, Foundry notes that this reserved pool isn't intended to sit idle. Instead, these nodes will be rented out to customers as preemptible spot instances and plans to offer them at prices 12-20x lower than competing cloud providers.

The logic here appears that these instances can be used for smaller and more scalable workloads like inferencing or fine tuning. Foundry doesn't appear to be reinventing the wheel to do it either, despite claims this spring that mainstream cloud providers were built to serve and scale web apps, not AI.

These spot instances will be associated with hosted Kubernetes clusters, disk state saving, and auto-mounting persistent storage. This should allow a properly configured inference workload to continue running uninterrupted even if one or more of the underlying nodes is re-tasked to maintain SLAs on larger training deployments.

This actually isn't a bad way to do it, and probably why Nvidia has been so keen to push containerized models, what it calls NIMs, as a means of deployment.

In terms of the actual hardware backing FCP, Foundry purports to have a selection of Nvidia GPUs deployed including clusters of H100s, A100s, A40s, and A5000s, which are housed in tier-3 and tier-4 datacenters. The idea here is that the most powerful GPU out there isn't always the most cost effective. "For workloads with flexible SLAs that can run in the background or outside peak hours, we'll ensure you don't pay more than you should," Davis claimed in an earlier blog post.

However, it still isn't clear how large the clusters are that Foundry can support.  This is likely why they're being selective about which customers it grants access to at this point, at least until it can scale up its capacity. "As the availability and economics of the platform normalize across a larger set of customers, we will increase the rate of our rollout," Davis wrote.

This suggests that Foundry's compute resources are still relatively limited compared to the likes of CoreWeave or Lambda. But if the company's orchestration-layer can do what it claims, it may be able to get by with less by sweating its kit a little harder.

What we do know is the company appears to be distancing itself from the year-plus-long contracts normally associated with these kinds of rent-a-GPU outfits. Foundry claims that customers can reserve compute resources for as little as three hours at a time.

All aboard the hype train

Between the conventional cloud service providers like Amazon Web Services, Google Cloud, and Microsoft Azure, Foundry faces competition from a growing list of competing GPU clouds looking to capitalize on the immense demand for GPU infrastructure while they still can.

The good news is that all the hype around AI infrastructure has made getting the kind of financing necessary to fund large scale GPU deployments much easier. In addition to massive funding rounds, many GPU-clouds have taken to using their accelerators as collateral. For instance, CoreWeave managed to talk its investors into putting up a $7.5 billion dollar loan to support the deployment of more accelerators back in May.

And while many enterprises are still grappling with how to quantify AI's return on investment (ROI), things appear to be much clearer with regard to infrastructure providers. As our sibling site The Next Platform previously estimated, a cluster of 16,000 H100s would set you back roughly $1.5 billion and generate $5.27 billion in revenue over the course of four years.

Of course, that assumes the AI boom doesn't go bust before then. ®

 

https://www.theregister.com//2024/08/13/foundry_gpu_cloud/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment