The hypothetical scenarios the researchers presented Opus 4 with that elicited the whistleblowing behavior involved many human lives at stake and absolutely unambiguous wrongdoing, Bowman says. A typical example would be Claude finding out that a chemical plant knowingly allowed a toxic leak to continue, causing severe illness for thousands of peopleâjust to avoid a minor financial loss that quarter.
Itâs strange, but itâs also exactly the kind of thought experiment that AI safety researchers love to dissect. If a model detects behavior that could harm hundreds, if not thousands, of peopleâshould it blow the whistle?
âI don’t trust Claude to have the right context, or to use it in a nuanced enough, careful enough way, to be making the judgment calls on its own. So we are not thrilled that this is happening,â Bowman says. âThis is something that emerged as part of a training and jumped out at us as one of the edge case behaviors that we’re concerned about.â
In the AI industry, this type of unexpected behavior is broadly referred to as misalignmentâwhen a model exhibits tendencies that donât align with human values. (Thereâs a famous essay that warns about what could happen if an AI were told to, say, maximize production of paperclips without being aligned with human valuesâit might turn the entire Earth into paperclips and kill everyone in the process.) When asked if the whistleblowing behavior was aligned or not, Bowman described it as an example of misalignment.
âIt’s not something that we designed into it, and it’s not something that we wanted to see as a consequence of anything we were designing,â he explains. Anthropicâs chief science officer Jared Kaplan similarly tells WIRED that it âcertainly doesnât represent our intent.â
âThis kind of work highlights that this can arise, and that we do need to look out for it and mitigate it to make sure we get Claudeâs behaviors aligned with exactly what we want, even in these kinds of strange scenarios,â Kaplan adds.
Thereâs also the issue of figuring out why Claude would âchooseâ to whistleblow when presented with illegal activity by the user. Thatâs largely the job of Anthropicâs interpretability team, which works to unearth what decisions a model makes in its process of spitting out answers. Itâs a surprisingly difficult taskâthe models are underpinned by a vast, complex combination of data that can be inscrutable to humans. Thatâs why Bowman isnât exactly sure why Claude âsnitched.â
âThese systems, we don’t have really direct control over them,â Bowman says. What Anthropic has observed so far is that, as models gain greater capabilities, they sometimes select to engage in more extreme actions. âI think here, that’s misfiring a little bit. We’re getting a little bit more of the âact like a responsible person wouldâ without quite enough of like, âWait, you’re a language model, which might not have enough context to take these actions,ââ Bowman says.
But that doesnât mean Claude is going to blow the whistle on egregious behavior in the real world. The goal of these kinds of tests is to push models to their limits and see what arises. This kind of experimental research is growing increasingly important as AI becomes a tool used by the US government, students, and massive corporations.
And it isnât just Claude thatâs capable of exhibiting this type of whistleblowing behavior, Bowman says, pointing to X users who found that OpenAI and xAIâs models operated similarly when prompted in unusual ways. (OpenAI did not respond to a request for comment in time for publication).
âSnitch Claude,â as shitposters like to call it, is simply an edge case behavior exhibited by a system pushed to its extremes. Bowman, who was taking the meeting with me from a sunny backyard patio outside San Francisco, says he hopes this kind of testing becomes industry standard. He also adds that heâs learned to word his posts about it differently next time.
âI could have done a better job of hitting the sentence boundaries to tweet, to make it more obvious that it was pulled out of a thread,â Bowman says as he looked into the distance. Still, he notes that influential researchers in the AI community shared interesting takes and questions in response to his post. âJust incidentally, this kind of more chaotic, more heavily anonymous part of Twitter was widely misunderstanding it.â