Anthropic launches $15K jailbreak bounty program for its unreleased next-gen AI

The program will be open to a limited number of participants initially but will expand at a later date.

Artificial intelligence firm Anthropic announced the launch of an expanded bug bounty program on Aug. 8, with rewards as high as $15,000 for participants who can “jailbreak” the company’s unreleased, “next generation” AI model.

Anthropic’s flagship AI model, Claude-3, is a generative AI system similar to OpenAI’s ChatGPT and Google’s Gemini. As part of the company’s efforts to ensure that Claude and its other models are capable of operating safely, it conducts what’s called “red teaming.”

Red teaming

Red teaming is essentially the practice of deliberately trying to break something. In Claude’s case, the goal of red teaming is to identify all the ways it can be prompted, manipulated, or otherwise disturbed into producing undesirable outputs.

During red teaming efforts, engineers might rephrase questions or reframe a query in order to trick the AI into outputting information it’s been programmed to avoid.

For example, an AI system trained on data gathered from the internet is likely to contain personally identifiable information on numerous people. As part of its safety policy, Anthropic has put guardrails in place to prevent Claude and its other models from outputting that information.

As AI models become more robust and capable of imitating human communication, the task of trying to figure out every possible unwanted output becomes exponentially challenging.

Bug bounty

Anthropic has implemented several novel safety interventions in its models, including its “Constitutional AI” paradigm, but it’s always nice to get fresh eyes on a long-standing issue.

According to a company blog post, its latest initiative will expand on existing bug bounty programs to focus on universal jailbreak attacks:

“These are exploits that could allow consistent bypassing of AI safety guardrails across a wide range of areas. By targeting universal jailbreaks, we aim to address some of the most significant vulnerabilities in critical, high-risk domains such as CBRN (chemical, biological, radiological, and nuclear) and cybersecurity.”

The company is only accepting a limited number of participants and encourages AI researchers with experience and those who “have demonstrated expertise in identifying jailbreaks in language models” to apply by Aug. 16.

Not everyone who applies will be selected, but the company plans to “expand this initiative more broadly in the future.”

Those who are selected will receive early access to an unreleased “next generation” AI model for red-teaming purposes.

Related: Tech firms pen letter to EU requesting more time to comply with AI Act