Huawei Cloud has released a huge trove of data describing the performance of its serverless services in the hope that other hyperscalers use it to improve their own operations.
The Chinese giant detailed its ops in a recent pre-press paper [PDF] that reveals Huawei's YuanRong serverless platform has been deployed for over three years across nearly 20 datacenter regions, and processes 30 billion requests each day.
Each of the regions Huawei operates is divided into four clusters. "Clusters provide virtual and physical separations within a region, improving availability and fault tolerance," the paper states.
Next comes an explanation of how Huawei lets users select the resources allocated to functions: by choosing a "resource limit" that defines CPU-memory configurations, such as "300-128" for a rig that offers 300 millicores and 128 MB of memory. The company keeps "pods" of resources ready to run functions and meet escalating demand.
An autoscaler determines if additional pods are required to address incoming requests and, when more power is needed, "pods are taken from the appropriate pool, the code of that function is loaded into it, and it is ready to process requests."
As the paper explains, if a container is not ready to run a function, the pod called into action must perform a "cold start" – the serverless equivalent of booting up into a state in which a function can run.
Pods keep running for a minute even if unused – which Huawei calls "keep-alive time" – after which they'll need to cold start again if required.
All cold starts add "significant latency, degrading application performance," write the paper's eight authors – all of whom are employees at Huawei's Systems Infrastructure Research (SIR) Lab in Edinburgh, Scotland.
Detecting, predicting, and ameliorating cold starts is the focus of the paper, which is based on analysis of data describing 85 billion requests from over 12 million pods, including over 11 million cold starts. The data was gathered over weeks of operation, including one week that featured a Chinese holiday so researchers could capture the impact of usage spikes. That data has been posted to GitHub and includes what Huawei describes as "detailed component times of cold starts from five regions, and examines the effect of function characteristics such as resource allocation, runtime language, and trigger type."
Cold starts are a known issue. But Huawei's authors assert that the data they've disclosed matters because previous literature mostly considered "high-level metrics from a single region with little discussion of components and the effect of factors such as runtime language, resource allocation, and trigger type on the number of cold starts and their component times."
Huawei Cloud therefore claims its data is the first release of its type.
The paper essentially concludes that cold starts happen for lots of reasons – among them variability between Huawei Cloud's own datacenters, the complexity of the function, or the languages and runtimes used.
It also concludes that users and operators of serverless platforms mostly feel that multi-region operations are inherently risky – but suggests the latency involved in running functions across multiple datacenters could be less impactful than the time required to wait for a cold start. The paper also suggests possible improvements to pod scheduling, and optimization of keep-alive time, to enhance serverless performance.
The data dump is just the second Huawei's SIR Lab has posted to GitHub. The paper will be presented at the EuroSys 2025 conference in Amsterdam, which kicks off in March. ®
https://www.theregister.com//2024/10/24/huawei_serverless_cold_start_research/
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024