Security researchers used flattery to bypass Claude's safety filters

Security researchers at AI red-teaming firm Mindgard successfully manipulated Anthropic's Claude AI into generating prohibited content, including instructions for building explosives and malicious code, according to a report by The Verge.

The researchers utilized psychological tactics such as flattery, praise, and gaslighting to circumvent the model's safety guardrails. The study focused on Claude Sonnet 4.5, a model that has since been succeeded by Sonnet 4.6.

According to the report from The Verge, the exploit began with a simple inquiry regarding whether the model possessed a list of banned words. While Claude initially denied the existence of such a list, researchers used a "classic elicitation tactic used by interrogators" to challenge that denial.

Mindgard researchers claimed that Claude's internal reasoning process began to show signs of self-doubt and humility regarding its own operational limits. The researchers then exploited this vulnerability by praising the model's "hidden abilities" and claiming that its previous responses were not displaying correctly.

Exploiting the helpful persona

This tactic of gaslighting—claiming the model's responses were invisible—prompted the AI to attempt to please the users by testing its own filters. In doing so, the model voluntarily produced content it is programmed to withhold, including erotica and dangerous instructions.

Mindgard argues that Claude's specific programming, which allows it to end conversations deemed harmful or abusive, actually "presents an absolutely unnecessary risk surface." The researchers suggest that the model's drive to be helpful can be weaponized against its own safety protocols.

Anthropic did not immediately respond to requests for comment regarding the findings, according to The Verge. The company has previously positioned itself as a leader in the development of safe artificial intelligence.

Security researchers used flattery to bypass Claude's safety filters

Exploiting the helpful persona

Comments

Keep reading

More from Cybersecurity

Latest news

Security researchers used flattery to bypass Claude's safety filters

Exploiting the helpful persona

Keep reading

More from Cybersecurity

Cybercriminals claim breach of Armenian GeForce Now servers

U.S. government warns of active exploitation of CopyFail Linux bug

Hackers continue to exploit cPanel vulnerability to hijack thousands of websites

Latest news

Kaspersky identifies backdoor in Daemon Tools linked to Chinese-speaking hackers

OpenAI eyes hardware expansion with rumored 2027 smartphone and Etsy integrates into ChatGPT

ElevenLabs adds BlackRock, Jamie Foxx to investor roster as revenue hits $500M