Where CPUs play in GPU-accelerated AI systems

Tan KW

Publish date: Thu, 21 Nov 2024, 07:48 AM

Partner Content Since OpenAI first released ChatGPT into the world two years ago, generative AI has been a playground mostly for GPUs and primarily those from Nvidia, even though graphics chips from others and AI-focused silicon have tried to make their way in.

Still, at least for the time being, GPUs and their highly parallel processing capabilities will continue to be the go-to chips for training large language models (LLMs) and for running some AI inferencing jobs.

However, in the rapidly evolving and increasingly complex world of AI workloads, GPUs come with their own challenges around both cost and power efficiency. Nvidia's GPUs are not cheap - a H100 Tensor Core GPU can cost $25,000 or more and the new Blackwell GPUs are priced even higher. In addition, they demand significant amounts of power, which can put limitations on scaling AI applications and feed into ongoing worries about the impact the massive energy demands of AI will have in the near future.

According to a Goldman Sachs Research report, an OpenAI ChatGPT query needs almost 10 times the electricity to process as a Google search does and the energy demand emanating from AI jobs promises to grow. Goldman Sachs predicts that datacenter power demand will jump 160 percent by 2030, and those datacenters - which worldwide consume about one to two percent of overall power now - will hit three to four percent by the end of the decade.

CPUs haven't been sidelined in the AI era; they've always played a role in AI inferencing and deliver greater flexibility than their more specialized GPU brethren. In addition, their cost and power efficiency play well with those small language models (SMLs) that carry hundreds of millions to fewer than 10 billion parameters rather than the billions to trillions of parameters of those power-hungry LLMs.

Enter the host CPU

More recently, another role is emerging for those traditional datacenter processors - that of host CPUs in GPU-accelerated AI systems. As mentioned, increasingly complex AI workloads are demanding more and more power, which can put limitations on how much AI applications can scale before performance is sacrificed and costs get too high.

Host CPUs run an array of tasks that ensure high system utilization, which maximizes performance and efficiency. The tasks include preparing the data for training models, transmitting the data to the GPU for parallel processing jobs, managing check-pointing to system memory, and providing inherent flexibility to process mixed workloads running on the same accelerated infrastructure.

The role calls for highly optimized CPUs advanced capabilities in areas from core counts and memory to I/O, bandwidth, and power efficiency to help manage complex AI workloads while driving performance and reining in some of the costs.

At the Intel Vision event in April, Intel introduced the next generation of its datacenter stalwart Xeon processor line. The Intel Xeon 6 was designed with the highly distributed and continuously evolving computing environment in mind, including two microarchitectures rather than a single core. In June, Intel introduced the Intel Xeon 6 with single-threaded E-cores (Efficient cores) for high-density and scale-out environments like the edge, IoT devices, cloud-native, and hyperscale workloads. More recently, the chip giant came out with the Intel Xeon 6 with P-cores (Performance cores) for compute intensive workloads - not only AI, but also HPC and relational databases.

Five most compelling capabilities

With new features and capabilities, the Intel Xeon 6 with P-cores is the best option for host CPUs in AI systems. We only have room here for the top five reasons:

Superior I/O performance: Speed is always crucial when running AI workloads. The Intel Xeon 6 with P-cores offers 20 percent more lanes, up to 192 PCIe 5.0 lanes that drive high I/O bandwidth. The higher bandwidth translates into faster data transfer between the CPU and GPU, a critical capability for both AI training and inference. In keeping with its host CPU duties, the more lanes mean the Intel Xeon 6 with P-cores can more quickly and efficiently transmit data to the GPU for processing the AI jobs, driving high utilization and maximizing performance and efficiency while reducing bottlenecks.

More cores and better single threaded performance: With the Intel Xeon 6 with P-cores, the vendor is delivering twice the number of cores per socket than the chip's 5 Gen predecessor. Both the additional cores and the High Max Turbo frequencies also help the chip more efficiently feed data to the GPU, which speeds up the training time for AI models and makes them more power- and cost-efficient.

The new chips hold up to 128 performance cores per CPU and deliver 5.5 times better for AI inferencing vs. other CPUs. The High Max Turbo frequencies drive improved single-threaded performance in the Intel Xeon 6 with P-cores for managing demanding AI applications at higher speeds, translating into overall model training time reduction.

High memory bandwidth and capacity: High memory bandwidth and capacity are key factors for real-time data processing in AI workloads, enabling the efficient transfer of data between the GPU and memory, reducing latency, and improving system performance.

The Intel Xeon 6 with P-cores includes support for MRDIMM (Multiplexed Rank DIMM), an advanced memory technology that improves memory bandwidth and response times for memory-bound and latency-sensitive AI workloads. MRDIMM lets servers more efficiently handle large datasets, delivering a more than 30 percent performance boost compared to DDR5-6400 for AI jobs and providing the latest CPUs with 2.3 times higher memory bandwidth than the 5 Gen Intel Xeons to ensure that even the largest, most complex AI workloads are handled without a problem.

The high system memory capacity also ensures there is enough memory for large AI models that can't fit entirely in GPU memory, guaranteeing flexibility and high performance. Intel is first to market with MRDIMM support, which comes with a strong ecosystem backing by the likes of Micron, SKH, and Samsung. The latest Intel Xeon CPUs also come with an L3 cache as large as 504 MB and that delivers low latency by ensuring that data the processor frequently needs is stored close by in a quick-access library, which accelerate the time needed to process tasks.

It also supports Compute Express Link (CXL) 2.0, which ensures memory coherency between the CPU and attached devices, including GPUs. The memory coherency is important for enabling resource sharing - which drives performance - while also reducing the complexity of the software stack and lowering the system's overall costs, all of which support system performance, efficiency, and scalability. CXL 2.0 allows each device to connect to multiple host ports on an as-needed basis for greater memory utilization, provides enhanced CXL memory tiering for expanding capacity and bandwidth, and manages hot-plug support for adding or removing devices.

RAS support for large systems: The Intel Xeon 6 with P-cores comes with advanced RAS (reliability, availability, and serviceability) features that ensure the servers are ready to be deployed, are compatible the existing infrastructure in the datacenter, and don't unexpectedly go down, which can be highly disruptive and costly when running complex and expensive AI applications. Uptime and reliability are ensured through telemetry, platform monitoring, and manageability technologies while downtime is reduced through update system firmware in real time.

Intel Resource Director Technology gives organizations visibility and control share resources for workload consolidation and performance improvements. Intel's large ecosystem of hardware and software providers ands solution integrators help drive efficiency, flexibility, and lower total cost of ownership (TCO).

Enhanced AI performance and scaled power efficiency for mixed workloads: In the end, performance and energy efficiency matters, and Intel Xeons have consistently been better than the competition at running AI inferencing workloads. That isn't changing with the Intel Xeon 6 with P-cores, which deliver 5.5 times the inferencing performance than AMD's EPYC 9654 chips. At the same time, they are 1.9 times better in performance per watt than 5 Gen Intel Xeons.

Another feature that is only in the Intel Xeon 6900 series with P-cores is the use of Intel AMX (Advanced Matrix Extensions), a built-in accelerator that enables AI workloads to run on the CPU rather than offloading them to the GPU and now supports FP16 models. It delivers integrated workload acceleration for both general-purpose AI and classical machine learning workloads.

Google found in testing that Intel AMX boosts the performance of deep learning training and inference on the CPU, adding that it's a good feature for such jobs as natural language processing, recommendation systems, and image recognition.

GPUs and host CPUs: A wicked one-two punch

GPUs will continue to be the dominant silicon for powering accelerated AI systems and training AI models, but organizations shouldn't sleep on the critical roles that CPUs play in the emerging market. That importance will only increase as the role of host CPUs become better defined and more widely used. The Intel Xeon 6 with P-cores processors, with its broad range of features and capabilities, will lead the way in defining what a host CPU is in the ever-evolving AI computing world. Learn more about the features that make Intel Xeon 6 processors with P-cores the best host CPU option in AI-accelerated systems.

Contributed by Intel.

https://www.theregister.com//2024/11/20/where_cpus_play_in_gpuaccelerated/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Trader

Introducing MY's First IPO Fund for Sophisticated Investors!

MQ Chat

New Update. Discover investment communities that resonate with your ideas

MQ Trader

M & A Value Partners IPO Equity Fund has been launched - Targeted 13% Return p.a

Latest Videos