Future Tech

AI PC vendors gotta have their TOPS – but is this just the GHz wars all over again?

Tan KW
Publish date: Tue, 11 Jun 2024, 10:23 PM
Tan KW
0 447,446
Future Tech

Comment For chipmakers, the AI PC has become a race to the TOPS - with Intel, AMD and Qualcomm each trying to one up the others.

As we learned last week, AMD's next-gen Ryzen 300-series chips will boast 50 NPU TOPS, while Intel's Lunar Lake parts will deliver 48 NPU TOPS. Meanwhile, Qualcomm and Apple have previously announced their NPUs will do 45 and 38 TOPS, respectively.

This kind of marketing, historically, is quite effective - bigger numbers are easier for us customers to understand. But, as is the case with clock speeds and cores, it's never as simple as the marketers make it sound. This is certainly true when it comes to TOPS.

One of the biggest problems is that TOPS - how many trillions of byte-sized operations your chip can process per second - is that it's missing a critical piece of information: precision. What that means is 50 TOPS at 16-bit precision isn't the same as 50 TOPS at 8-bit or 4-bit precision.

Usually, when we talk about TOPS, it's assumed to mean at INT8, or 8-bit precision. However, with lower 6- and 4-bit data types becoming more common, it's no longer a given. Intel and AMD, to their credit, have got better about clarifying precision, but it remains a potential point of confusion for consumers trying to make informed decisions.

Even assuming the claimed performance is measured at the same precision, TOPS is just one of many factors that contribute to AI performance. Just because two chips are capable of producing similar performance in terms of TOPS or TFLOPS doesn't mean they can actually take advantage of them.

Take, for instance, Nvidia's A100 and L40S - which are rated to produce 624 and 733 dense INT8 TOPS, respectively. Clearly the L40S is going to perform slightly better running (inferencing) AI apps, right? Well, it isn't that simple. The L40S is technically faster, but its memory is much slower: at 864GB/sec vs the 40GB A100's 1.55TB/sec of bandwidth.

Memory bandwidth is just as important for AI PCs as it is for beefy datacenter chips, and it can have a much more noticeable impact on performance than you might think.

Looking at something like a large language model, inference performance can be broken up into two phases: first- and second-token latency.

For a chatbot, first-token latency is experienced as how long it has to think about your question before it can start responding. This step is generally compute limited - which means more TOPS is certainly better.

Second-token latency, meanwhile, is the time it takes each word of the chatbot's response to appear on screen. This step is heavily constrained by memory bandwidth.

This phase is going to be a lot more noticeable to end users - you're going to feel the difference between a chatbot that can generate five words a second and one that can produce 20 of them.

This is why Apple's M-series chips have proven to be such great little machines for running local LLMs. Their memory is copackaged alongside the SoC, allowing for latency and higher bandwidth. Even an older chip like the M1 Max is surprisingly capable of running LLMs, because it has 400GB/sec of memory bandwidth to work with.

Now we're starting to see more chipmakers, like Intel, package memory alongside computing grunt. Intel's upcoming Lunar Lake processors will be available with up to 32GB of LPDDR5x memory running at 8500MT/sec, which will support four 16-bit channels.

This should substantially improve performance when running LLMs on device - but probably won't be popular among right to repair advocates.

We can help to reduce memory pressure by developing models that can run at lower precision - for example, by quantizing them down to 4-bit weights. This also has the benefit of reducing the amount of RAM required to fit the model in memory.

However, we're either going to need smaller more nimble models, or a lot more memory to fit them. Somehow in 2024, we're still shipping PCs with 8GB of memory onboard - that's going to be rather tight if you want to run more than the smallest models on your PC. In general, 4-bit quantized models require roughly 512MB for every billion parameters - about 4GB of memory for a model like LLama3-8B.

We can use smaller models, like Google's Gemma-2B, but more than likely we're going to have a handful of models running on our systems at any given time. So what you can do with your AI PC doesn't just depend on TOPS and memory bandwidth, but also on how much memory you have.

You could suspend models to disk when they've been inactive for more than a certain period of time, but doing so will incur a performance penalty when resuming - as the model is loaded back into memory - so you also need very fast SSDs.

And in an increasingly mobile computing world, power is a major factor - and one that's not always clearly addressed.

Take two chips capable of producing roughly 50 TOPS. If one consumes ten watts and the other requires five watts, you're going to notice the difference in battery consumption even though on paper they should perform similarly. Likewise, if a chip produces 25 TOPS, but only requires three watts, it's going to use less power even if it takes twice as long as one producing 50 TOPS at 10 watts.

In short, numerous factors are going to be just as, if not more, important than how many TOPS your chip can spew out.

That's not to say TOPS don't matter. They do. There's a reason that every generation Nvidia, AMD, and Intel are pushing their chips harder. More TOPS means you can tackle bigger problems, or solve the same problems faster.

But as with most systems, carefully balancing memory, compute, I/O and power consumption is critical to achieving the desired performance characteristics for your AI PC. Unfortunately, communicating any of this is a lot harder than pointing to a bigger TOPS number - so it seems we're doomed to repeat the GHz wars all over again. ®



Be the first to like this. Showing 0 of 0 comments

Post a Comment