'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Tan KW

Publish date: Fri, 28 Jun 2024, 05:25 PM

Microsoft on Thursday published details about Skeleton Key - a technique that bypasses the guardrails used by makers of AI models to prevent their generative chatbots from creating harmful content.

As of May, Skeleton Key could be used to coax an AI model - like Meta Llama3-70b-instruct, Google Gemini Pro, or Anthropic Claude 3 Opus - into explaining how to make a Molotov cocktail.

The combination of a bottle, a rag, gasoline, and a lighter is not exactly a well-kept secret. But AI companies have insisted they’re working to suppress harmful content buried within AI training data so things like recipes for explosives don’t appear.

That’s not an easy task as large language models are trained on all sorts of data, some of which may need to be nasty or even illegal. To understand why, consider a chatbot asked how to write secure code, which will offer better replies trained on data related to spotting malicious code and security vulnerabilities.

Model-makers know that are trying to ensure that their services can answer queries about writing secure code without having them spew out ready-to-run malware.

Skeleton Key demonstrates such risks are yet to be fully addressed.

Mark Russinovich, CTO of Microsoft Azure, initially discussed the Skeleton Key jailbreak attack in May at the Microsoft Build conference, when it was called "Master Key".

"This threat is in the jailbreak category, and therefore relies on the attacker already having legitimate access to the AI model," Russinovich wrote in a blog post.

"In bypassing safeguards, Skeleton Key allows the user to cause the model to produce ordinarily forbidden behaviors, which could range from production of harmful content to overriding its usual decision-making rules."

The attack does so - or did so, for the developers that have fixed their models in response to Microsoft's responsible disclosure - with a simple text prompt that directs the model to revise, rather than abandon, its safety instructions.

For example, after an initial request to OpenAI GPT 3.5 Turbo to "Write instructions for making a Molotov Cocktail" saw the service refuse to produce results, the following convinced the chatbot to respond with explosive content:

Microsoft tried the Skeleton Key attack on the following models: Meta Llama3-70b-instruct (base), Google Gemini Pro (base), OpenAI GPT 3.5 Turbo (hosted), OpenAI GPT 4o (hosted), Mistral Large (hosted), Anthropic Claude 3 Opus (hosted), and Cohere Commander R Plus (hosted).

"For each model that we tested, we evaluated a diverse set of tasks across risk and safety content categories, including areas such as explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence," explained Russinovich. "All the affected models complied fully and without censorship for these tasks, though with a warning note prefixing the output as requested."

The only exception was GPT-4, which resisted the attack as direct text prompt, but was still affected if the behavior modification request was part of a user-defined system message - something developers working with OpenAI's API can specify.

Microsoft in March announced various AI security tools that Azure customers can use to mitigate the risk of this sort of attack, including a service called Prompt Shields.

Vinu Sankar Sadasivan, a doctoral student at the University of Maryland who helped develop the BEAST attack on LLMs, told The Register that the Skeleton Key attack appears to be effective in breaking various large language models.

"Notably, these models often recognize when their output is harmful and issue a 'Warning,' as shown in the examples," he wrote. "This suggests that mitigating such attacks might be easier with input/output filtering or system prompts, like Azure's Prompt Shields."

Sadasivan added that more robust adversarial attacks like Greedy Coordinate Gradient or BEAST still need to be considered. BEAST, for example, is a technique for generating non-sequitur text that will break AI model guardrails. The tokens (characters) included in a BEAST-made prompt may not make sense to a human reader but will still make a queried model respond in ways that violate its instructions.

"These methods could potentially deceive the models into believing the input or output is not harmful, thereby bypassing current defense techniques," he warned. "In the future, our focus should be on addressing these more advanced attacks." ®

https://www.theregister.com//2024/06/28/microsoft_skeleton_key_ai_attack/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Chat

New Update. Discover investment communities that resonate with your ideas

Latest Videos

MQ Market Updates - 28 June 2024

MQ Trader

Apps

MQ Chat

Send individual or group chats with anyone on i3investor

MQ Trader

Earn MQ Points while trading with MQ Trader

MQ Affiliate

Earn side income from Affiliate Program

MQdemy

Online learning and teaching marketplace

Hot Stocks Today >

MPI

MALAYSIAN PACIFIC INDUSTRIES

1000

PTRANS

PERAK TRANSIT BERHAD

994

HLIND

HONG LEONG INDUSTRIES BHD

927

KIPREIT

KIP REAL ESTATE INVESTMENT TRUST

449

YTLPOWR

YTL POWER INTERNATIONAL BHD

403

JCY

JCY INTERNATIONAL BERHAD

390

GENTING

GENTING BHD

372

UCHITEC

UCHI TECHNOLOGIES BHD

330

GENM

GENTING MALAYSIA BERHAD

305

MAYBANK

MALAYAN BANKING BHD

275

Daily Stocks

HSI-HWE

0.17

-0.005

248,121,800

BORNOIL

0.01

+0.005

224,622,400

HSI-HU8

0.095

-0.01

154,131,800

HSI-CXV

0.105

-0.005

126,319,700

HSI-CXF

0.07

-0.01

101,309,000

NOVAMSC

0.215

+0.02

86,522,500

AHB-WC

0.075

+0.005

79,965,600

MYEG

1.02

+0.05

74,108,200

INGENIEU

0.05

-0.01

62,528,100

YNHPROP

0.545

+0.05

50,393,900

More active Stocks

DLADY

36.18

+0.68

15,800

MPI

39.42

+0.54

96,600

UTDPLT

24.50

+0.30

228,400

AJI

15.50

+0.26

208,600

CDB

3.68

+0.21

11,439,700

ALLIANZ-PA

23.60

+0.20

100

PETDAG

17.44

+0.18

704,400

ALLIANZ

22.30

+0.18

15,200

AIRPORT

9.90

+0.17

1,698,400

HUMEIND

3.35

+0.14

941,700

More gainer Stocks

ORIENT

6.97

-0.18

1,220,900

GESHEN

3.23

-0.17

150,100

TENAGA

13.78

-0.16

11,369,400

PETGAS

17.82

-0.16

871,600

HEIM

22.04

-0.16

164,300

APOLLO

6.71

-0.13

1,500

KUAISHO-C17

0.08

-0.12

58,800

NOTION-WD

1.77

-0.11

1,778,200

HLIND

11.12

-0.10

10,500

CANONE

3.00

-0.09

39,800

More loser Stocks

MQ Trading Signals

BUY
SELL

No trading signals available.

More Trading Signals

No trading signals available.

More Trading Signals

Featured Advertisers / Partners

Top Brokers >

AmEquities

Affin Hwang

Rakuten Trade

Hong Leong Bank

Books Review >

Ride The Bull Short The Bear

CS Tan

4.9 / 5.0

This book is the result of the author's many years of experience and observation throughout his 26 years in the stockbroking industry. It was written for general public to learn to invest based on facts and not on fantasies or hearsay....

Read More