Future Tech

Tencent Cloud to revisit design after circular dependencies slowed emergency API fix

Tan KW
Publish date: Wed, 17 Apr 2024, 01:01 PM
Tan KW
0 429,007
Future Tech

Tencent Cloud has apologized for an outage that impacted customers last week - an unusual act by a Chinese cloud - and signalled it will review some aspects of its ops in the hope of avoiding future incidents of this nature.

A WeChat post from China's number three cloud - Alibaba Cloud leads the market, ahead of Huawei, with Baidu in fourth place - revealed that on April 8 it updated configuration data for one of its APIs.

The change was a dud, and the API became unavailable. Some Tencent Cloud platform-as-a-service offerings services that rely on it therefore became unreliable, effectively cutting them off from the Tencent Cloud.

The cloud provider was able to fix the mess in just 87 minutes, but has apologized to the 1,957 customers who reported failures.

The triage post explains that the failure was caused by an update to the API that didn't consider compatibility.

Changes to the API's interface protocol meant that apps targeting the old version produced nonsense data that spread across Tencent Cloud and meant the API became unstable.

Tencent Cloud would usually roll back this sort of change. But to do so, it needed the very same API it had just broken.

The cloudy concern admitted that it just didn't test this release properly - it ignored some of its own version change processes, didn't conduct proper sandbox tests, and now realizes its change management processes probably need some work.

The outfit has pledged to redesign bits of its cloud to detect abnormal changes and terminate them before they spread, conduct drills to improve its incident response, and offer alternative APIs should the interfaces fail.

Tencent Cloud is not alone in breaking its own cloud with an update. The Register has reported on similar messes at Google, AWS, and Microsoft.

One important difference, however, is that Tencent Cloud is reputedly so intolerant of outages that it has been known to fire staff responsible for resilience after incidents.

Chinese media suggest that approach may have backfired - sources allege job cuts at Tencent Cloud contributed to this incident. ®

 

https://www.theregister.com//2024/04/17/tencent_cloud_api_error_outage/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment