Future Tech

GitHub fixes pull request delay that derailed developers

Tan KW
Publish date: Wed, 13 Mar 2024, 11:37 AM
Tan KW
0 460,104
Future Tech

GitHub is experiencing a second day of degraded performance, following a bad update that threw the code locker into chaos.

Microsoft's cloudy code collaboration service today advised users of degraded performance for Pull Requests.

Users who requested anonymity told The Register that delays of around ten minutes are the norm at the time of writing - meaning commits are made and pushed to branches, but then aren't visible to all team members.

GitHub first acknowledged the issue at 23:39 UTC on March 12.

Around two hours later, it Xeeted news that it had found a mitigation and was "currently monitoring systems for recovery."

It didn't have to monitor for long: five minutes later the fix was in, and the incident ended.

Without explanation - for now.

GitHub has, however, explained the previous day's outage, which struck at 22:45 UTC on March 11 and persisted until 00:48 UTC the next day.

During that incident, Secret Scanning and 2FA using GitHub Mobile produced error rates up to 100 percent, before settling at around 30 percent for the last hour of the incident. Copilot error rates reached 17 percent, and API error rates reached one percent.

"This elevated error rate was due to a degradation of our centralized authentication service upon which many other services depend," according to GitHub's Status History page.

"The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment," states GitHub's error report.

The error was spotted within four minutes and a rollback initiated.

But the rollback failed in one datacenter, extending the time needed for recovery.

"At this point, many failed requests succeeded upon retrying," the status page adds.

Here's the rest of the service's mea culpa

GitHub has pledged to work on "various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future."

Good. ®

 

https://www.theregister.com//2024/03/13/github_outage_two_days_running/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment