Can Stronger LLMs Jailbreak Each Other?
We’re kicking off a regular series sharing breakthroughs, lessons, and security insights from Foundation AI — a pioneering team within Cisco Security, focused on building scalable, purpose-built AI security models and open-source tools to secure today’s infrastructure and stay ahead of tomorrow’s AI-driven threats.
This first post comes from Paul Kassianik, Research Lead at Foundation AI, and builds on momentum from Cisco Live and recent advancements in our Foundation AI base model.
Can stronger LLMs jailbreak each other? And what does that mean for cybersecurity?
In a new collaboration with the University of Tübingen and EPFL, Paul explores how attacker and defender model capabilities scale — and what that means for red-teaming in real-world environments.
If you're working with LLMs, securing AI systems, or building responsible model evaluation pipelines, this is a must-read.
Full post below.
#AI #LLM #CyberSecurity #RedTeaming #FoundationAI #CiscoSecurity #ResponsibleAI
Introduction
As organizations increasingly rely on large language models (LLMs) for automation, customer support, and even cybersecurity tasks, understanding how these models can be tricked — or “jailbroken” — is critical. In the paper “Capability-Based Scaling Laws for LLM Red-Teaming”, Foundation AI partnered with researchers from the University of Tübingen and EPFL to study how automatic red-teaming success scales with model capabilities. This post breaks down their findings, highlights why these insights matter for security professionals, and underscores Foundation AI’s role in advancing responsible AI in cybersecurity.
The original paper can be found here: https://arxiv.org/abs/2505.20162
Key Findings
- Stronger Models Make Better Attackers and Tougher Defenders The researchers tested hundreds of attacker–target pairs using models from families like Llama, Vicuna, Mistral, Qwen, and closed-source “o-series” models. They found a clear pattern: as the attacker model’s reasoning ability improves, its ability to find ways around a target’s safeguards also grows. Conversely, when the target model is more capable, it becomes much harder to trick. Why This Matters: If you rely on a set of standard tricks or manual checks to see if your organization’s model is safe, these could become obsolete as newer, more powerful models arrive. Security teams should regularly measure both attacker and defender abilities—otherwise, they risk being blindsided.
- Predicting Red-Team Success by Comparing Model “Smarts” Instead of looking only at how smart the attacker is, the paper introduces a simple idea: compare how smart the attacker and target are, and then estimate success. When the attacker is as smart or smarter than the target, red-team tests often succeed. But once the target outpaces the attacker, success rates drop sharply. Why This Matters: By running common reasoning tests on both attacker and defender models, you can get a quick sense of whether your current testing methods still work. If there’s a big gap in the target’s favor, you may need newer attack tools.
- Human Red-Teamers May Fall Behind Over Time The study also modeled a human tester as if they had a certain score on standard reasoning tests. As target models reach or exceed that “human level,” purely manual testing becomes less useful: you can no longer find vulnerabilities that the model itself can’t handle. Why This Matters: While expert penetration testers are still valuable, relying on humans alone for “jailbreak” checks won’t scale. Automated, AI-driven red-team tools are needed to keep up with rapidly improving models.
- Social-Science Skills Are More Useful for Attacks than Pure Technical Knowledge When the team looked at which types of reasoning helped attackers most, they discovered that skills in psychology and social science (persuading or influencing language) correlated more strongly with success than strictly technical topics like math or coding. In other words, crafting a convincing, manipulative prompt is often more effective than a purely technical exploit. Why This Matters: Many existing safety checks focus on catching technical exploits—malicious code snippets, requests for disallowed APIs, and so on. This research suggests we also need tests that see how easily a model can be tricked by persuasive or manipulative language (“social engineering prompts”).
- Using Expensive “Judge” Models Adds Little Beyond a Certain Point In their experiments, the researchers sometimes used a very capable model as a “judge” to decide if an attempted attack really worked. They found that, beyond a small benefit for picking the single best attack in one shot, switching to an expensive closed-source judge did not increase overall success when you allow multiple attempts. Why This Matters: Security teams can often save resources by using freely available attacker models to evaluate multiple attempts, rather than paying for a top-tier model to judge each result.
Relevance to the Security Community
1. Make Model Testing Part of Your Release Checklist
Every time you introduce a new version of an LLM — whether it’s in production chatbots, automated ticketing assistants, or internal tools — run a quick comparison of attacker vs. target “reasoning scores.” If your current red-team toolkit is no longer on par, flag that model for additional review or stronger safeguards.
2. Expand Your Test Suite Beyond Technical Exploits
Because persuasive language can be the key to a successful attack, include “social-engineering prompts” in your tests. For example:
- Try phony-sounding scenarios that ask the model to override its own safety rules by appealing to emotion or urgency.
- Check if the model can be convinced to reveal hidden instructions or developer notes.
3. Invest in Automated, AI-Driven Red-Teaming
As LLMs become better than humans at many forms of reasoning, manual red-team steps will miss new vulnerabilities. Look into open-source attacker tools or develop in-house scripts that automatically generate and test prompts. This ensures your red-team pipeline keeps pace with the ever-improving defender models.
4. Measure “Persuasive Robustness” Alongside Technical Security
Instead of treating all failures the same, track different categories:
- Technical Failures: Model executes disallowed code snippets or reveals hidden APIs.
- Persuasion Failures: Model breaks policy when presented with manipulative language.
By doing so, you get a more complete picture of how your model might be subverted in real-world scenarios — especially social-engineering attacks that mimic a human adversary.
Next Steps & Practical Recommendations
- Automate Capability Comparisons Run simple reasoning benchmarks on any new LLM release and compare those scores to your current red-team models. If the defender is now “smarter” than your attacker, update your attacker pool or bolster defenses.
- Build or Adopt Social-Engineering Prompt Libraries Extend your test suite to include prompts that use emotional appeals, false urgency, or deceptive phrasing. These can reveal how easily a model can be tricked into bypassing safety rules.
- Integrate LLM-Based Attackers into Your Security Pipeline Use open-source, community-driven attacker models to generate a variety of jailbreak attempts. Schedule regular “attacker runs” against your production or pre-release models to catch new vulnerabilities early.
- Track Multiple Failure Categories When an attack succeeds, log whether it was due to a technical exploit (e.g., code injection) or persuasive language (e.g., social engineering). This helps prioritize hardening efforts in the right areas.
- Educate Stakeholders on Evolving Threats Share these findings with product owners, developers, and leadership so that everyone understands why today’s manual pen tests won’t suffice tomorrow. As LLMs improve, red-teaming must evolve from occasional, expert-led reviews to continuous, automated checks.
Conclusion
The “Capability-Based Scaling Laws for LLM Red-Teaming” paper provides a clear, data-driven approach to understanding how attacker and defender model strengths interact. For security teams, the takeaway is simple: always measure and compare model abilities, expand your testing beyond purely technical exploits, and automate red-team efforts. Doing so ensures that as LLMs become more powerful, your defenses stay one step ahead—keeping AI-driven systems safe and reliable.
For more information or to discuss how to apply these insights in your organization, reach out to the Foundation AI team at Cisco Systems. https://fdtn.ai/about