Future Tech

Buying a PC for local AI? These are the specs that actually matter

Tan KW
Publish date: Sun, 25 Aug 2024, 10:18 PM
Tan KW
0 470,238
Future Tech

Ready to dive in and play with AI locally on your machine? Whether you want to see what all the fuss is about with all the new open models popping up, or you're interested in exploring how AI can be integrated into your apps or business before making a major commitment, this guide will help you get started.

With all the marketing surrounding AI PCs, NPUs, GPUs, TOPS, and FLOPS, it's easy to get lost trying to make sense of which factors are important and which are overblown. In this guide, we'll explore some common misconceptions and get to the bottom of what specs actually matter when deploying AI at home.

Setting expectations

The kind of hardware you need to run AI models locally depends a lot on what exactly you're trying to achieve. As much as you might like to train a custom model at home, realistically, that's not going to happen.

The types of generative AI workloads within reach of the average consumer usually fall into one of two categories: image generation and large language model (LLM)-based services like chatbots, summarization engines, or code assistants.

Even here, there are some practical limitations. If you want to run a 70 billion parameter model, you're going to need a rather beefy system, potentially with multiple high-end graphics cards. But a more modest eight-billion-parameter model is certainly something you can get running on a reasonably modern notebook or entry-level GPU.

If you're interested in more advanced use cases for AI, like fine-tuning, or developing applications that stitch together the capabilities of multiple models, you may even need to start looking at workstation or server class hardware.

The specs that actually matter

With our expectations set accordingly, let's dig into the specs that make the biggest difference when running generative AI models at home:

1. Memory / vRAM capacity

Without question, the most important stat to look for is memory capacity, or more specifically, GPU vRAM, if you're looking for a system with dedicated graphics. Regardless of what kind of AI model you plan to run, they all take up a lot of space and need to be loaded into memory for the best performance.

How much memory you need depends on how big the model you want to run is, at what precision it was trained, and whether or not it can be quantized. Most AI models today are trained at 16-bit precision, which means that for every one billion parameters you need roughly 2GB of memory.

Obviously, that's pretty limiting, so it's pretty common to see models quantized to 8- or even 4-bit integer formats to make them easier to run locally. At 4-bit precision, you only need half a gig of memory for every billion parameters. You can learn more about quantization in our hands on guide here.

In general, the more memory you've got, the more flexibility you'll have when running larger models.

2. Memory bandwidth

The type and speed of your memory also matters a lot. You might think the number of TOPS a GPU, NPU, or CPU can churn out would determine how quickly an AI model runs, but that's not always the case.

With LLMs, how fast they spit out a response has more to do with how speedy your memory is than floating point or integer performance. In fact, you can get a rough estimate of their peak throughput on any given system by dividing the total memory bandwidth by the size of the model in gigabytes.

For example, if you have a GPU with 960GB/s of memory bandwidth, your peak performance when running an eight-billion-parameter model at FP16 would be about 60 tokens per second. In reality, the key value cache is going to add some additional overhead, but this at least illustrates how important fast memory is.

This is why running LLMs on CPUs or integrated graphics is usually pretty slow. Most consumer platforms are limited to a 128-bit memory bus. So, even if it's loaded with tons of fast 7,200MT/s DDR5, the best you're going to manage is about 112 GB/s.

Some systems, like Apple's higher-end M-series Macs, feature wider memory buses, up to 1024-bits in the case of the M2 Ultra, allowing it to achieve 800GB/s of bandwidth.

Because of this, dedicated GPUs are usually the go-to for LLMs. However, you do need to be careful as newer graphics cards aren't necessarily always better. Nvidia's RTX 4060 is a prime example of this, as it not only features less memory, but a narrower 128-bit memory bus than the older RTX 3060. If you're uncertain, resources like TechPowerUp's GPU database can be incredibly valuable when picking out a GPU for generative AI.

3. TOPS and FLOPS

A processor's integer and floating point performance, measured in TOPS and teraFLOPS, respectively, is still worth taking into consideration.

While LLM inferencing is largely memory bound, processing the prompt is still a compute-intensive task. The more TOPS or FLOPS your system can push, the shorter the time it will takes to start generating a response. This isn't usually that obvious with shorter queries, but for longer prompts, like a summarization task the delay can be quite noticeable if you're lacking enough compute.

Floating point and integer performance are even more important when looking at other kinds of AI like image-gen models. Compared to LLMs, models like Stable Diffusion or Black Forest Lab's Flux.1 Dev tend to be more computationally intensive.

If you pit something like the RTX 3060, which boasts 51.2 teraFLOPS of dense FP16 performance, against an RTX 6000 Ada Generation card capable of 364.2 teraFLOPS in Stable Diffusion XL, the faster card is, unsurprisingly going to generate images quicker.

However, comparing TOPS and FLOPS from one vendor to the next can get a bit dicey since the precision at which that performance is achieved isn't always obvious and not every chip supports the same data types. Nvidia's current crop of 40-series GPUs support 8-bit floating point calculations while AMD's RDNA 3-based accelerators do not.

That's not as big a deal as you might think since not all models can take advantage of that FP8 support, leading to situations where a FLOPS number doesn't necessarily mean higher performance.

Another thing to watch out for is whether the figures given are for sparse or dense mathematics. Nvidia for example likes to advertise sparse performance, while AMD and Intel have traditionally favored dense performance figures. And this isn't just a difference in philosophy, some workloads can take advantage of sparsity while others can't.

So, any time you see TOPS or FLOPS advertised, your first question should be, "at what precision?"

Specs don't tell the full story

TOPS, memory bandwidth and capacity all have an impact on what models you can run and how well they'll perform, but they don't necessarily tell the full story.

Just because an AI framework runs on an Nvidia GPU doesn't mean it'll work on Intel or AMD cards, and if it does, it may not run optimally. We can illustrate this by looking at the popular LLM runner LM Studio, which is based on the Llama.cpp project and will run on just about anything.

The downside to this broad compatibility is that not every GPU will perform its best out of the box. LM Studio runs natively on CUDA and Apple's Metal API, but uses the Vulkan compute backend for Intel and AMD GPUs. As a result, performance in LM Studio on Arc and Radeon cards isn't as good as it could be if it was running natively on something like SYCL or ROCm.

The good news is support for both already exists in Llama.cpp and there is an experimental build of LM Studio with ROCm support available if you happen to have a compatible 7000-series card.

The state of the local AI software ecosystem

At the time of writing, if you want the broadest software compatibility, Nvidia hardware is a pretty safe bet as its CUDA software libraries have been around for more than 15 years at this point.

With that said, AMD's ROCm and HIP stacks, while still not as mature as Nvidia's CUDA, are improving very quickly with support for RDNA 3 (7000-series) graphics cards having been added late last year. In our testing, we had no issues running Stable Diffusion or LLMs in Ollama and Llama.cpp, at least in Linux anyway.

It's a similar situation with Intel's Arc graphics. In our testing, software capability isn't quite as mature as either AMD or Nvidia at this point. We've found that while many popular services can be made to work on these cards, it's not always straightforward to deploy them. We expect native support for PyTorch, said to be coming in an upcoming release, to resolve some of these challenges.

But, if you already have an ARC GPU or were considering picking one up, Intel recently released its AI Playground for Arc, which, while pretty limited in terms of features, does provide an easy way to deploy both diffusion models and LLMs in Windows.

Having said all that, the software ecosystem around local AI is evolving very quickly so if you're reading this even a few months after it's published, the situation is likely to have changed.

What about NPUs?

For all the noise Microsoft and the major chipmakers have made about neural processing engines (NPUs), you don't actually need them to run AI models at home.

These dedicated accelerators are designed to churn out lots of TOPS while consuming a relatively small amount of power. However, beyond a handful of apps and Microsoft's Copilot+ features, they still aren't that well supported. One of the first places we are seeing them come into play is with image generation, so if you're mostly interested in running Stable Diffusion, it may be worth investigating.

If you're leaning toward a desktop system, your processor may not even come with an NPU. So, in short, an NPU isn't a bad thing to have, but you shouldn't sweat it if your system doesn't have one. For the moment, the majority of AI software is still optimized for GPU compute anyway.

Summing up

  • Large language models benefit most from lots of fast memory. The easiest way to achieve this is with a dedicated GPU. We find the sweet spot in terms of performance and cost tends to be among GPUs with between 12 and 16GB of vRAM capable of at least 360 GB/s of bandwidth. This offers a good amount of space for models, especially when quantized to 4-bit precision, while still being able to churn out tokens faster than you can read.
  • Diffusion models like Stable Diffusion XL or Flux.1 Dev require loads of memory - more than 24GB at FP16 in the latter's case - and benefit heavily from more compute. For these kinds of models, it's worth looking at higher-end GPUs; or if you're considering a notebook, one with an NPU that meets or exceeds Microsoft's 40-TOPS target.
  • Check software compatibility before pulling the trigger on a piece of hardware. A part with stellar specs won't do you much good if none of your models run on it.
  • This space is evolving quickly and models, software and hardware that run them are improving all the time. For example, since our how-to guide on deploying LLMs in Ollama was published, the app has extended support to a variety of new models and hardware including AMD's 7,000-series Radeon graphics cards. If something doesn't work well today, that may not always be the case.

What next?

If you're looking for ideas on how to get started with local AI, we have a growing catalog of hands-on guides to help you get started.

The Register aims to bring you more AI content soon, so be sure to share your burning questions in the comments section. And, if you haven't already, be sure to check out our other local AI guides, linked above. ®

 

https://www.theregister.com//2024/08/25/ai_pc_buying_guide/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment