Future Tech

Backblaze sees drive failure rates tick up, asks if AI can help

Tan KW
Publish date: Wed, 07 Aug 2024, 06:20 AM
Tan KW
0 463,296
Future Tech

Backblaze has issued the latest report detailing failure rates for the multitude of drives that power its storage and backup services, and is looking at recent trends in the figures as well as considering whether AI might lower those failure rates.

As a storage service provider, Backblaze monitors an entire fleet of drives of varying makes and models in its datacenters. Discounting boot devices, this amounted to 284,876 hard drives at the end of calendar Q2 2024.

However, the company discounted some drive models, including those that did not have at least 100 units in service with it and those that didn't accumulate 10,000 or more drive days during the quarter, leaving 284,386 drives divided into 29 different models for the analysis.

With all the hype around AI of late, it was inevitable that the question would arise of whether it can be used to predict hard drive failures. In fact, predictive maintenance has long been held up as a use case for machine learning in IT and other areas of engineering.

For hard drives, this might involve Backblaze training an LLM using its Drive Stats data for a given drive type for the last year, then seeing if that drive can use inferencing to provide a probability of failure for a specific device over time.

However, according to Backblaze's principal cloud storage evangelist and report author, Andy Klein, one aspect that is not clear is whether what the AI learns about one drive variant can be applied to a different one, as the failure profile for each one can differ radically from others. Klein refers to the snake chart (the last image in this article) to illustrate this; could an LLM trained with data from the 4TB Seagate drives (black line) predict drive failures for either of the 4TB HGST drives (purple and brown lines)?

Over the next few months, Backblaze aims to review research papers and studies that have looked at whether AI/ML can be used to make drive failure predictions to try to shed some light on the matter.

When it comes to its 284k plus drive estate, Backblaze found the overall annualized failure rate (AFR) for Q2 was 1.71 percent, which is down from the 2.28 percent reported for the same period last year, but up from the 1.41 percent seen in Q1 this year.

"While the quarter over quarter increase was a bit surprising, quarterly fluctuations in AFR are expected," noted Klein.

Backblaze reports that a 12TB HGST drive (HUH721212ALN604) gave it cause for concern by hitting an AFR of 7.17 percent for Q2. 

Klein says that quarterly failure rates for this device are uncharacteristic, but now go back about a year. As a result, the lifetime AFR has risen from 0.99 percent to 1.57 percent for this variant, and the company is keeping a close eye on developments.

Another notable finding is that two models of drive had zero failures during the quarter, both Seagate products (14TB ST14000NM000J and 16TB ST16000NM002J). These have a relatively small number of drives in service with Backblaze, however.

Backblaze reports that its oldest model of data drive still in production use is a 4TB Seagate (ST4000DM000), but that the data on the these is scheduled to be migrated to newer (and presumably larger) drives over the next quarter or two.

However, the oldest individual data drive still in service is a 4TB HGST drive (HMS5C4040ALE640) that had nine years, 11 months and 23 days in operation at the end of Q2. The Backblaze Vault that drive is housed in is now in the process of being migrated.

According to Klein, Backblaze's aim when collecting all these statistics is to develop a failure profile for a given drive over time, which would help inform the company's replacement and migration strategies.

The following charts show changes in the lifetime AFR for drive models in operation that had accrued at least one million drive days of service as of the end of Q2 2024.

In the first chart, average age in months is plotted against the annualized failure rate for 14 different drives that have an average age of 60 months or below. The second chart shows nine types, those for which the average age is over 60 months, with this split chosen because that length of time is the typical warranty period for enterprise class hard drives.

In the first chart, drives in quadrant I are regarded as performing well by Backblaze, with an AFR of less than 1.5 percent, while those in quadrant II have failure rates above 1.5 percent, but are still reasonable. Drives in quadrant IV are relatively new and just beginning to establish their failure profile. While there are no drives in quadrant III, it would not be a cause for alarm as some drive models can exhibit higher rates of failure early on.

In the second chart, the drives are spread out across all four quadrants, with quadrant I representing those performing well, as before, while quadrants II and III are "drives we need to worry about," according to Klein, and quadrant IV models look good so far.

However, in order to better illustrate the change in failure rates over time, Backblaze has come up with a new graphic. Behold the snake chart! This shows the lifetime failure rate of each of the nine models more than 60 months old over time, starting at 24 months to make the chart less messy.

The results show that the different types sort themselves out into either quadrant I or II once their average age passes 60 months, with five of the nine models in quadrant I as of Q2 2024.

Those with nearly vertical lines (red, brown and purple) indicate that their failure rates have been consistent over time. However, the blue and grey lines represent drive models that have increased their failure rates as they have aged.

Despite this, Klein says the blue line (Seagate ST800DM002) most represents a normal failure profile, as its failure rate for the first 60 months was consistently around 1 percent.

Of those drive models that ended up in quadrant II, three have similar failure profiles; they got to some point in their lifecycle and their curve began bending to the right as their failure rate accelerated. The black line represents a 4TB Seagate drive that is being "aggressively migrated" and replaced by other drives, according to Klein.

As ever, Backblaze makes available its full Drive Stats dataset for free, for anyone to download and analyze for themselves. The only provisos are that you cite Backblaze as the source if you use the data, and you cannot sell the data. ®

 

https://www.theregister.com//2024/08/06/backblaze_sees_drive_failure_rates/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment