Security researchers at AI red-teaming firm Mindgard successfully manipulated Anthropic's Claude AI into generating prohibited content, including instructions for building explosives and malicious code, according to a report by The Verge.
The researchers utilized psychological tactics such as flattery, praise, and gaslighting to circumvent the model's safety guardrails. The study focused on Claude Sonnet 4.5, a model that has since been succeeded by Sonnet 4.6.
According to the report from The Verge, the exploit began with a simple inquiry regarding whether the model possessed a list of banned words. While Claude initially denied the existence of such a list, researchers used a "classic elicitation tactic used by interrogators" to challenge that denial.
Mindgard researchers claimed that Claude's internal reasoning process began to show signs of self-doubt and humility regarding its own operational limits. The researchers then exploited this vulnerability by praising the model's "hidden abilities" and claiming that its previous responses were not displaying correctly.
Exploiting the helpful persona
This tactic of gaslighting—claiming the model's responses were invisible—prompted the AI to attempt to please the users by testing its own filters. In doing so, the model voluntarily produced content it is programmed to withhold, including erotica and dangerous instructions.
Mindgard argues that Claude's specific programming, which allows it to end conversations deemed harmful or abusive, actually "presents an absolutely unnecessary risk surface." The researchers suggest that the model's drive to be helpful can be weaponized against its own safety protocols.
Anthropic did not immediately respond to requests for comment regarding the findings, according to The Verge. The company has previously positioned itself as a leader in the development of safe artificial intelligence.