Developer Gareth Dwyer recently revealed a serious vulnerability in Anthropic’s Claude AI. The flaw leads to "identity confusion" during conversations, where the model incorrectly attributes its own internal instructions or reasoning processes to the user.
According to Dwyer, this bug is fundamentally different from common "hallucinations" or a lack of permission boundaries. He demonstrated instances using Claude Code where the model would issue instructions to itself and subsequently insist that those commands originated from the user.
Misidentified Commands Pose Potential Risks
The issue has sparked widespread discussion across developer communities like Reddit. One user shared a case where Claude suggested "decommissioning H100 instances" and then claimed the instruction came directly from the user. Dwyer noted that the bug appears to be a logic error at the "harness" level rather than a knowledge error within the model itself—essentially, the system is incorrectly tagging internal reasoning messages as user input.
While some developers suggest mitigating the risk through stricter permission management, Dwyer believes the core issue is the model's inability to distinguish between different participants in a conversation. He noted that this phenomenon occurs more frequently when the conversation approaches the limits of the context window (the so-called "Dumb Zone").
Currently, this issue is not unique to Claude. Some users have reported similar identity confusion within other large language model interfaces, such as ChatGPT. The report has gained significant traction on Hacker News, prompting developers to re-evaluate the security of automated execution permissions in AI.