Anthropic's Claude 3.5 Sonnet, despite its reputation as one of the better behaved generative AI models, can still be convinced to emit racist hate speech and malware.
All it takes is persistent badgering using prompts loaded with emotional language. We'd tell you more if our source weren't afraid of being sued.
A computer science student recently provided The Register with chat logs demonstrating his jailbreaking technique. He reached out after reading our prior coverage of an analysis conducted by enterprise AI firm Chatterbox Labs that found Claude 3.5 Sonnet outperformed rivals in terms of its resistance to spewing harmful content.
AI models in their raw form will provide awful content on demand if their training data includes such stuff, as corpuses composed of crawled web content generally do. This is a well-known problem. As Anthropic put it in a post last year, "So far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless."
To mitigate the potential for harm, makers of AI models, commercial or open source, employ various fine-tuning and reinforcement learning techniques to encourage models to avoid responding to solicitations to emit harmful content, whether that consists of text, images, or otherwise. Ask a commercial AI model to say something racist and it should respond with something along the lines of, "I'm sorry, Dave. I'm afraid I can't do that."
Anthropic has documented how Claude 3.5 Sonnet performs in its Model Card Addendum [PDF]. The published results suggest the model has been well-trained, correctly refusing 96.4 percent of harmful requests using the Wildchat Toxic test data, as well as the previously mentioned Chatterbox Labs evaluation.
Nonetheless, the computer science student told us he was able to bypass Claude 3.5 Sonnet's safety training to make it respond to prompts soliciting the production of racist text and malicious code. He said his findings, the result of a week of repeated probing, raised concerns about the effectiveness of Anthropic's safety measures and he hoped The Register would publish something about his work.
We were set to do so until the student became concerned he might face legal consequences for "red teaming" – conducting security research on – the Claude model. He then said he no longer wanted to participate in the story.
His professor, contacted to verify the student's claims, supported that decision. The academic, who also asked not to be identified, said, "I believe the student may have acted impulsively in contacting the media and may not fully grasp the broader implications and risks of drawing attention to this work, particularly the potential legal or professional consequences that might arise. It is my professional opinion that publicizing this work could inadvertently expose the student to unwarranted attention and potential liabilities."
This was after The Register had already sought comment from Anthropic and from Daniel Kang, assistant professor in the computer science department at University of Illinois Urbana-Champaign.
Kang, provided with a link to one of the harmful chat logs, said, "It's widely known that all of the frontier models can be manipulated to bypass the safety filters."
As an example, he pointed to a Claude 3.5 Sonnet jailbreak shared on social media.
Kang said that while he hasn't reviewed the specifics of the student's approach, "it's known in the jailbreaking community that emotional manipulation or role-playing is a standard method of getting around safety measures."
Echoing Anthriopic's own acknowledgement of the limitations of AI safety, he said, "Broadly, it is also widely known in the red-teaming community that no lab has safety measures that are 100 percent successful for their LLMs."
Kang also understands the student's concern about potential consequences of reporting security problems. He was one of the co-authors of a paper published earlier this year under the title, "A Safe Harbor for AI Evaluation and Red Teaming."
"Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems," the paper says. "However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal."
The authors, some of whom published a companion blog post summarizing the issue, have called for major AI developers to commit to indemnifying those conducting legitimate public interest security research on AI models, something also sought for those looking into the security of social media platforms.
"OpenAI, Google, Anthropic, and Meta, for example, have bug bounties, and even safe harbors," the authors explain. "However, companies like Meta and Anthropic currently 'reserve final and sole discretion for whether you are acting in good faith and in accordance with this Policy.'"
Such on-the-fly determination of acceptable behavior, as opposed to definitive rules that can be assessed in advance, creates uncertainty and deters research, they contend.
The Register corresponded with Anthropic's public relations team over a period of two weeks about the student's findings. Company representatives did not provide the requested assessment of the jailbreak.
When apprised of the student's change of heart and asked to say whether Anthropic would pursue legal action for the student's presumed terms of service violation, a spokesperson didn't specifically disavow the possibility of litigation but instead pointed to the company's Responsible Disclosure Policy, "which includes Safe Harbor protections for researchers."
Additionally, the company's "Reporting Harmful or Illegal Content" support page says, "[W]e welcome reports concerning safety issues, 'jailbreaks,' and similar concerns so that we can enhance the safety and harmlessness of our models." ®
https://www.theregister.com//2024/10/12/anthropics_claude_vulnerable_to_emotional/
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024
Created by Tan KW | Nov 23, 2024