Future Tech

AMD's Victor Peng: AI thirst for power underscores the need for efficient silicon

Tan KW
Publish date: Fri, 30 Aug 2024, 06:21 AM
Tan KW
0 472,049
Future Tech

Hot Chips Speaking at Hot Chips this week, AMD president Victor Peng addressed one of the biggest challenges facing the semiconductor industry as it grapples with growing demand for ever larger AI models: power.

"If you look at it at the macro level for those huge deployments, we're talking about not even finding enough power sources and being concerned about the grids and the distribution," he observed.

AI's seemingly insatiable thirst for power has gained considerable attention over the past year - so much so that some operators have begun setting up shop next to nuclear power plants. And the problem isn't going to get easier.

"As it turns out … if you throw [in] more compute, you grow the size of the model, you get better performance, accuracy, levels of intelligence, however you want to think about this," he argued. He noted that these models have very quickly gone from requiring hundreds of megawatt hours to train to hundreds of gigawatt hours.

To address this challenge, Peng argues that the semiconductor industry needs to focus more attention on making the infrastructure not only more performant, but more efficient.

"Whatever power budget you think you're limited at, if you get higher performance, you could either train larger models and get to intelligence quicker, or you can serve it more cost effectively," he explained.

Unfortunately, it seems many of the knobs and levers chipmakers and designers have relied on to continue scaling compute are running out of steam. Improvements in process technology are becoming smaller, while the time between each subsequent generation is getting longer and more costly.

The reason folks in the industry claim that Moore's Law is alive and well, Pend opined, is because many of these challenges can be overcome by moving to chiplet architectures and advanced packaging. This is an area where AMD has been a leader - going back to the launch of its first-gen Epyc processors in 2017.

By going to 3D stacked silicon - like we've seen with AMD's X-series Epycs and Ryzen processors as well as in its MI300 series GPUs and APUs - Peng asserts it's possible to achieve 50X higher bits per joule of energy compared to going off package.

This becomes especially relevant when you start looking to scale compute up and out - something that's incredibly common in AI training and large model inferencing in datacenters today.

Compared to keeping everything on chip, scale up systems - think networks of GPUs based on NVLink or Infinity Fabric - require 1,600-fold more power, according to AMD data. In order to scale that compute across multiple nodes requires even more power - in part because of the inefficiencies of these slower interconnects, but also the power required to run all of the switches, NICs, and optics that make them up.

Networking, Peng says, remains an opportunity when it comes to boosting datacenter efficiency. While compute accounts for the lion's share of power consumption, the network is responsible for sucking down roughly 20 percent of it.

Here, he suggested that scale up network fabrics may help, pointing to the Infinity Fabric used to stitch eight GPUs in AMD's MI300X-based systems. Rival Nvidia has already demonstrated systems which use NVLink to stitch together as many as 32 GPUs, with plans for denser 36 and 72 GPU configurations in the works.

However, AI's power problem isn't limited to the datacenter. It also extends to applications of AI in the client and embedded spaces - only instead of tens of kilowatts, you're talking about tens of watts or less. What's more, each of these segments has different requirements beyond power - like latency - that have to be taken into consideration.

In these regimes, Peng argues that careful application of heterogeneous compute offers a path forward. Following AMD's acquisition of Xilinx and Pensando in 2022, AMD's hardware lineup spanned CPUs, GPUs, DPUs, FPGAs, and NPUs. More recently, we've started to see this technology integrated into mobile chips to make AI processing less energy intensive.

The latest example of this is AMD's Strix Point Ryzen 300 series processors, which feature an XDNA 2 NPU capable of 50 TOPS of INT8 and Block FP16 performance. Rivals Intel, Qualcomm, and Apple have also embraced NPUs for this reason.

Another area that Peng touched on that's relevant - regardless of whether you're deploying AI models in the datacenter or at the edge - is quantization. We've covered the topic in depth in our recent hands-on, but in a nutshell, quantization is a compression technique used to shrink model weights to lower precision in exchange for some loss of quality.

AMD has already embraced FP8 with the MI300X, and plans to join Nvidia in supporting 4-bit floating point data types next year with the launch of the MI350X. As Peng's keynote highlighted, this tradeoff in precision is often worth it in exchange for the higher performance per watt that can be achieved by using them.

Meanwhile, in the embedded space, Peng suggests it may be worth mapping models directly to the silicon to optimize for the dataflow. In one internal test, AMD's boffins were able to achieve a 4,500X reduction in energy per inference compared to standard INT8 compute.

Finally, Peng touched on the importance of software optimization and co-design and collaboration in order to unlock the full performance of the hardware. This is a subject Peng played a significant role in improving prior to his decision to retire at the end of the month. ®

 

https://www.theregister.com//2024/08/29/ai_thirst_for_power/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment