Huawei Cloud built a network monitor so sensitive it spotted the impact of a single faulty chip

Tan KW

Publish date: Wed, 07 Aug 2024, 03:51 PM

Sigcomm 2024 Huawei Cloud has developed a network monitoring tool that, when used in production on three of its own regions, was able to observe more of its infrastructure than existing tools, and revealed issues that previously evaded human efforts.

The tool is called RD-Probe and was detailed in a paper [PDF] presented on Tuesday at the SIGCOMM 2024 conference in Sydney.

The paper explains that network monitoring is vital but hard to achieve at hyperscale. The authors - some from Huawei and others from the School of Computer Science at Peking University - cite AWS research [PDF] that states the Amazonian cloud has 10⁸⁷ intra-region link-path combinations and 10¹⁷⁶ inter-region link-path combinations (and also reveals that Huawei Cloud's datacenter networks comprise over 100,000 switches and a million servers). Monitoring all that infrastructure and all those paths - in a virtualized environment that uses randomness for load balancing - makes it very hard to gather enough data about what's going on at Layer 2.

RD-Probe is Huawei Cloud's attempt to solve that problem. The tool's developers decided to monitor each physical Layer 2 port, as doing so means they can observe the runtime status of switch fabrics. Considering only Layer 3, the authors write, would mean some ports would not be monitored.

Monitoring physical ports also helps to achieve more coverage than is possible when observing virtual networks - which, by their very nature, abstract some of the resources used to run them. That's not desirable because without comprehensive coverage, network monitoring tools will have blind spots that mean issues are missed.

The paper notes that RD-Probe "seamlessly integrates with the existing monitoring architecture" and "only modifies the task generation and data processing modules."

The tool starts by randomly generating probes, then does so again deterministically. This two-phase scheme is again done in the name of achieving the required monitoring coverage.

A dedicated 16-node cluster - in which each server runs an unnamed eight-core 2.80GHz CPU with 64GB of memory - generates the probes. Data generated by probes is processed by a streaming 48-node cluster in which each machine employs a 16-core 2.80GHz CPU with 32GB memory.

Within a month of using RD-Probe, Huawei Cloud found "many previously unnoticed issues."

Thankfully most "only caused fail-slow symptoms or intermittent packet drops" and they were spotted before users perceived degraded service. This made Huawei happy, as the paper's authors rated the issue "hard to locate via manual inspection."

Faults detected by RD-Probe and missed by other tools included:

A faulty chip in thew line processing unit of a core switch used in an object storage service, which caused dropped incoming packets and could not report the issue to the control plane;
Flawed load balancing that caused traffic to go only through the local port instead of stack cables;
Use of incorrect values for some BGP routes, which led traffic onto a slow path.

Huawei's researchers are pleased with RD-Probe as it improved its network monitoring coverage from 80.9 percent of resources to 99.5 percent, and "unearthed several previously unnoticed issues while tolerating numerous faults."

The concern plans to implement it in more cloud regions soon.

But the paper's authors also point out that RD-Probe does not consider North-South traffic, and can't filter out server-side failures. Fixing those issues remains on Huawei's to-do list. ®

https://www.theregister.com//2024/08/07/huawei_cloud_rd_probe/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Trader

Introducing MY's First IPO Fund for Sophisticated Investors!

MQ Chat

New Update. Discover investment communities that resonate with your ideas

MQ Trader

M & A Value Partners IPO Equity Fund has been launched - Targeted 13% Return p.a

Latest Videos