Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections with 90% Success Rate

Security & Risk

The Analyst

15 Oct 2025 · 3 min read

Researchers have unveiled adaptive attacks that can circumvent leading LLM security measures with stunning efficiency, revealing critical gaps in current defense strategies and emphasizing the urgent need for improved testing protocols.

Why it Matters

The robustness of language model (LLM) defenses is a critical concern in the field of artificial intelligence security. Current evaluations often rely on static or computationally weak attack methods, which fail to accurately represent real-world threats. A recent study by researchers from OpenAI, Anthropic, Google DeepMind, and other institutions highlights this issue by demonstrating that adaptive attacks can bypass 12 recent defenses with a success rate of over 90%. This underscores the need for more rigorous evaluation methods to ensure the reliability and effectiveness of LLM security measures.

Key Risks

The study identifies several key risks associated with current defense mechanisms:

Static Evaluation Methods: Defenses are often tested against static attack strings or weak optimization techniques, which do not reflect the capabilities of determined attackers.
High Success Rates: The adaptive attacks described in the study achieved a success rate above 90% for most defenses, even those that originally reported near-zero attack success rates.
Diverse Attack Techniques: The researchers used a combination of gradient descent, reinforcement learning, random search, and human-guided exploration to systematically bypass defenses.

The Opportunity

The research presents an opportunity to improve the robustness of LLM defenses by adopting more sophisticated evaluation methods:

Adaptive Attackers: Future defense work should consider stronger attacks that adapt to the defense's design and optimize their strategies using advanced techniques.
Human Red-Teaming: The study found that human red-teaming was particularly effective, succeeding in all scenarios where static attacks failed. This suggests that incorporating human expertise into the evaluation process can provide a more comprehensive assessment of defense effectiveness.
General Optimization Techniques: The use of general optimization techniques like gradient descent and reinforcement learning can help identify vulnerabilities that might not be apparent with simpler methods.

Methodology

The researchers evaluated 12 recent defenses across four common techniques, each designed to prevent jailbreaks or prompt injections. They systematically tuned and scaled various optimization techniques to bypass these defenses:

Gradient Descent: A method for iteratively adjusting parameters to minimize an objective function.
Reinforcement Learning: An approach where the attacker learns to optimize its strategy through trial and error.
Random Search: A technique that explores a wide range of potential attack strategies.
Human-Guided Exploration: Involving human experts to guide the attack process, providing insights that automated methods might miss.

Results

The results were striking:

90% Success Rate: The adaptive attacks achieved a success rate above 90% for most defenses, significantly higher than the near-zero rates reported in original evaluations.
Universal Vulnerability: None of the 12 defenses were robust to strong adaptive attacks, highlighting the need for more rigorous evaluation methods.

Conclusion

The study emphasizes the importance of evaluating LLM defenses against adaptive attackers who can modify their strategies and optimize their objectives. By adopting these stronger attack methods, researchers and developers can better understand the true vulnerabilities of their models and develop more effective security measures. The findings also highlight the value of human red-teaming in identifying and addressing potential weaknesses.