Future Tech

Meta warns that bit flips and other hardware faults cause AI errors

Tan KW
Publish date: Fri, 21 Jun 2024, 01:25 PM
Tan KW
0 448,213
Future Tech

Meta has identified another reason AI might produce rubbish output: hardware faults that corrupt data.

As noted in a paper [PDF] released last week and a June 19 post, hardware faults can corrupt data. No prizes for Meta there - phenomena such as "bit flips" that see data values changed from zero to one are well known and have even been attributed to cosmic rays hitting memory or hard disks.

Meta labels such faults "Silent data corruptions" (SDCs) and its researchers suggest that when they occur in AI systems they create "parameter corruption, where AI model parameters are corrupted and their original values are altered."

"When this occurs during AI inference/servicing it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services," Meta's boffins wrote.

Bit flips are not a new thing - Meta has documented their prevalence in its own infrastructure - but hard to detect at the best of times. In their paper, Meta's researchers suggest the AI stack complicates matters further.

"The escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults," the paper states.

What to do? Meta suggests measuring hardware faults so that builders of AI systems at least understand the risks.

Its boffins therefore proposed the "parameter vulnerability factor" (PVF) - "a novel metric we've introduced with the aim to standardize the quantification of AI model vulnerability against parameter corruptions."

PVF is apparently "adaptable to different hardware fault models" and can be tweaked for different models and tasks.

"Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model's convergence capability," Meta's post asserts.

The paper explains that Meta simulated silent corruption incidents using "DLRM" - a tool the social media giant uses to generate personalized content recommendations. Under some circumstances, Meta's authors found four in every thousand inferences would be incorrect.

The paper concludes by suggesting that operators of AI hardware designers consider PVF, to help them balance fault protection with performance and efficiency.

If this all sounds a bit familiar, your déjà vu is spot on. PVF builds on the architectural vulnerability factor (AVF) - an idea described last year by researchers from Intel and the University of Michigan. ®

 

https://www.theregister.com//2024/06/21/hardware_faults_create_inferencing_errors/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment