Future Tech

Failure to follow proper procedures caused US-wide AT&T outage, FCC says

Tan KW
Publish date: Tue, 23 Jul 2024, 11:00 PM
Tan KW
0 470,493
Future Tech

An AT&T cellular outage lasting more than 12 hours that prevented US customers from accessing services including 911 was caused by misconfigured hardware and a failure to follow standard procedures when deploying.

Or so says the Federal Communications Commission (FCC) of the incident on February 22, which affected AT&T wireless customers across all American states, plus Puerto Rico and the US Virgin Islands. The outage cut off access to voice and 5G data services nationwide, as The Register reported at the time. It took AT&T at least 12 hours to fully restore service, during which more than 25,000 attempted 911 calls could not get through.

Based on its investigations, the FCC's Public Safety and Homeland Security Bureau today referred the matter to its Enforcement Bureau for potential violations of FCC rules, which may result in a fine for America's second largest wireless carrier.

In its published findings [PDF] on the disruption, the FCC says it was caused by an AT&T Mobility employee adding a "misconfigured network element" to the production network, intended to expand capacity, during a routine nighttime maintenance window.

The process did not follow AT&T's established install procedures, which require peer review, and this led to the misconfiguration not being detected before the network element was introduced into the infrastructure.

As a result, an automated response was triggered that shut down all connections to prevent traffic from the misconfigured device propagating further into the network. This shutdown isolated all voice and 5G data processing elements from the wireless towers and switching tech, according to the FCC report.

The outcome was that the AT&T Mobility network disconnected all devices from voice services and 5G data, starting at 2:45 AM Central Standard Time, just three minutes after the misconfigured network element was added, causing a nationwide outage of its wireless service.

This not only affected consumers, but also any devices of the First Responder Network Authority (FirstNet) - in other words, the emergency services - that were registered to AT&T Mobility's network.

"When you sign up for wireless service, you expect it will be available when you need it - especially for emergencies," FCC chairwoman Jessica Rosenworcel said in a statement.

She added that the agency is taking this failure seriously and is working to provide accountability for the lapse in service and prevent similar outages in the future.

To address the interruption, AT&T's Network Operations performed a rollback that removed the misconfigured network element and then began the process to restore the network to normal operations, the FCC report states.

However, while most of AT&T's mobile subscribers were reconnected by early morning, the FCC notes that traffic congestion from so many mobile device registrations prevented some from getting back on the network, although these issues were mostly resolved by midday. FirstNet devices and infrastructure were given priority over commercial and residential users such that FirstNet service was restored by 5:00 AM.

The FCC also notes that AT&T notified FirstNet users of the outage starting at 5:53 AM, which was more than three hours after the outage began and approximately 53 minutes after the FirstNet infrastructure had been restored.

It wasn't until 7:05 AM that the company issued a public statement about the dody service, followed by additional updates throughout the morning, and a statement released at 2:10 PM indicated that wireless service had finally been restored to all affected customers.

The report by the FCC Public Safety and Homeland Security Bureau concludes that the outage was as the result of multiple factors, all attributable to AT&T Mobility.

As well as the configuration error, these include a lack of adherence to internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls covering approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a number of system issues that prolonged the outage once the configuration error had been remedied.

While the direct cause of the outage was the employee who misconfigured a single network element, adequate peer review should have prevented the change from being approved, the FCC says.

The agency adds that post-installation testing should have ensured that network changes were implemented properly, but "to the extent that testing was performed when the misconfigured network element was placed into the production network on February 22, 2024, they were inadequate and failed to identify the incorrect behavior of the network element."

The report states that AT&T Mobility either lacked sufficient oversight and controls to ensure these test processes were followed, or, if these controls do exist, they are themselves insufficient.

A further criticism is that despite configuring its network to enter Protection Mode to prevent propagating errors to other parts of the network, AT&T failed to put in place adequate preparations for the congestion that would result as every device tried to re-register with the network upon restoration of service.

In a statement, AT&T told us: "We have implemented changes to prevent what happened in February from occurring again. We fell short of the standards that we hold ourselves to, and we regret that we failed to meet the expectations of our customers and the public safety community."

The FCC report confirms that AT&T has taken steps to avoid a repeat of the calamity, including scanning the network for any network elements lacking the controls that would have prevented it. The company has also now adopted procedures to ensure that maintenance work cannot take place without confirmation that required peer reviews have been completed.

In its recommendations, the FCC report says the blackout highlights the need for carriers to adhere to best practices, implement adequate controls in their networks to mitigate risks, and be capable of responding quickly to restore service when an outage occurs.

The agency says it plans to release a Public Notice, based on its analysis of this and other recent outages, reminding service providers of the importance of implementing relevant industry-accepted best practices, including those recommended by its Communications Security, Reliability, and Interoperability Council (CSRIC). ®

 

https://www.theregister.com//2024/07/23/atandt_outage_fcc_report/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment