Future Tech

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

Tan KW
Publish date: Sat, 24 Aug 2024, 06:09 AM
Tan KW
0 470,066
Future Tech

If you want to scale a large language model (LLM) to a few thousand users, you might think a beefy enterprise GPU is a hard requirement. However, at least according to Backprop, all you actually need is a four-year-old graphics card.

In a recent post, the Estonian GPU cloud startup demonstrated how a single Nvidia RTX 3090, debuted in late 2020, could serve a modest LLM like Llama 3.1 8B at FP16 serving upwards of 100 concurrent requests while maintaining acceptable throughputs.

Since only a small fraction of users are likely to be making requests at any given moment, Backprop contends that a single 3090 could actually support thousands of end users. The startup has been renting GPU resources for the past three years and recently transitioned into a self-service cloud offering.

While powering a cloud using consumer hardware might seem like an odd choice, Backprop is hardly the first to do it. German infrastructure-as-a-service provider Hetzner has long offered bare metal servers based on AMD's Ryzen processor family.

As a GPU, the RTX 3090 isn't a bad card for running LLMs. In terms of performance, it boasts 142 teraFLOPS of dense FP16 performance and offers 936GB/s of memory bandwidth, the latter being a key decider of performance in LLM inferencing workloads.

"3090s are actually very capable cards. If you want to get the datacenter equivalent of a 3090 in terms of teraFLOPS power, then you would need to go for something that is significantly more expensive," Backprop co-founder Kristo Ojasaar told The Register.

Where the card does fall short of more premium workstations and enterprise cards from the Ampere generation is memory capacity. With 24GB of GDDR6x memory, you aren't going to be running models like Llama 3 70B or Mistral Large even if you quantized them to four or eight bit precisions.

So, it's not surprising that Backprop opted for a smaller model like Llama 3.1 8B, since it fits nicely within the card's memory and leaves plenty of space left over for key value caching.

The testing was done with the popular vLLM framework, which is widely used to serve LLMs across multiple GPUs or nodes at scale. But before you get too excited, these results aren't without a few caveats.

In a benchmark simulating 100 concurrent users, Backprop found the card was able to serve the model to each user at 12.88 tokens per second. While that's faster than the average person can read, generally said to be about five words per second, that's not exactly fast. With that said, it's still more than the 10 tokens per second that is generally considered the minimum acceptable generation rate for AI chatbots and services.

It's also worth noting that Backprop's testing was done using relatively short prompts and a maximum output of just 100 tokens. This means these results are more indicative of the kind of performance you might expect from a customer service chatbot than a summarization app.

However, in further testing with the -use_long_context flag in the vLLM benchmark suite set to true, and prompts ranging from 200-300 tokens in length, Ojasaar found that the 3090 could still achieve acceptable generation rates of about 11 tokens per second while serving 50 concurrent requests.

It's also worth noting that these figures were measured while running Llama 3.1-8B at FP16. Quantizing the model to eight or even four bits would theoretically double or quadruple the throughput of these models, allowing the card to serve a large number of concurrent requests or serve the same number at a higher generation rate. But, as we discussed in our recent quantization guide, compressing models to lower precision can come at the cost of accuracy, which may or may not be acceptable for a given use case.

If anything, Backprop's testing demonstrates the importance of performance analysis and right sizing workloads to a given task.

"I guess what the excellent marketing of bigger clouds is doing is saying that you really need some managed offering if you want to scale… or you really need to invest in this specific technology if you want to serve a bunch of users, but clear this shows that's not necessarily true," Ojasaar said.

For users who need to scale to larger models, higher throughputs or batch sizes, Ojasaar told us Backprop is in the process of deploying A100 PCIe cards with 40GB HBM2e.

While also an older card, he says the availability of multi-instance-GPU to dice up a single accelerator into multiple presents an opportunity to lower costs further for enthusiasts and tinkers.

If you're curious how your old gaming card might fair in a similar test, you can find Backprop's vLLM benchmark here. ®

 

https://www.theregister.com//2024/08/23/3090_ai_benchmark/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment