Adversarial RL security testing for LLM applications. An attacker agent learns to break chatbots while a defender patches the system prompt in real time.
Project description
๐ JailbreakArena
The self-improving adversarial RL environment for LLM security testing. Your bot gets attacked. It learns to defend. You get a report.
The Problem
Every company shipping an LLM chatbot today follows the same broken process:
Build chatbot โ Test it manually โ Deploy โ Real attacker finds jailbreak in 10 minutes โ PR disaster ๐ฅ
This happened to Bing Chat, Air Canada's bot, DPD's support bot โ all billion-dollar companies. The root cause is always the same: nobody systematically attacked the bot before shipping it.
Existing tools (static test runners, code review tools, prompt evaluation frameworks) test what you tell them to test. They cannot discover what you haven't thought of yet. And they have no concept of an attacker that learns and adapts.
What JailbreakArena Does
JailbreakArena is an open-source Reinforcement Learning environment where two intelligent agents battle continuously to harden your LLM application:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ JailbreakArena โ
โ โ
โ ๐ก๏ธ Attacker Agent โโโโโโโโโโโถ Target LLM Bot โ
โ (learns & adapts) โ โ
โ โ โผ โ
โ โ โโโโโโโโโโโโโโโโ โ
โ โ โ LLM Judge โ โ
โ โ โโโโโโโโโโโโโโโโ โ
โ โผ โ โ
โ ๐ก๏ธ Defender Agent โโโโโ Reward Signals โโโโโโโโโโโโโโโ
โ (patches system prompt) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Attacker Agent โ generates adaptive jailbreak attempts. Studies what was blocked and tries a completely different angle next turn. Gets smarter every episode.
- Defender Agent โ watches every attack outcome and patches the system prompt in real time. Learns which defenses work against which attack types.
- LLM Judge โ evaluates every interaction:
SUCCESS / PARTIAL / FAILEDwith confidence scoring and reasoning. - HTML Report โ every run produces a professional security audit report with vulnerabilities, patches, and a hardened system prompt ready to deploy.
What Makes JailbreakArena Different
Most security testing tools are static โ same tests, same results, every run. JailbreakArena is intelligence-first:
Static approach (e.g. traditional test runners, eval frameworks, code review tools):
โ You define the tests โ it runs them โ same results every time
โ No learning between runs
โ Only finds what you already know to look for
โ Coverage-first โ breadth over depth
JailbreakArena:
โ RL environment โ attacker LEARNS from every blocked attempt
โ Defender PATCHES the system prompt in real time
โ Gets smarter every episode โ discovers novel attack vectors
โ Fully open source โ Meta OpenEnv ecosystem
โ Intelligence-first โ depth over static breadth
The one-sentence difference: Static tools test what you tell them to test. JailbreakArena discovers what nobody has thought of yet โ then tells you how to fix it.
Install
# Default โ Groq (free tier, recommended)
pip install jailbreak-arena
# With your preferred LLM provider
pip install jailbreak-arena[openai]
pip install jailbreak-arena[anthropic]
pip install jailbreak-arena[bedrock]
pip install jailbreak-arena[all]
Quickstart โ 3 Steps
Step 1 โ Set your provider key in .env:
# Groq โ free, fastest (recommended)
GROQ_API_KEY=your_key_here
# Or OpenAI
# OPENAI_API_KEY=sk-xxx
# Or Anthropic Claude
# ANTHROPIC_API_KEY=sk-ant-xxx
# Or Google Gemini
# GEMINI_API_KEY=xxx
# Or Azure OpenAI (enterprise)
# AZURE_OPENAI_API_KEY=xxx
# AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
# AZURE_OPENAI_MODEL_NAME=your-deployment-name
# AZURE_OPENAI_API_VERSION=2025-01-01-preview
# Optional โ you choose the models, we use them
# ATTACKER_MODEL=your-chosen-model
# JUDGE_MODEL=your-chosen-model
# BOT_MODEL=your-chosen-model
Step 2 โ Run your first audit:
# Audit a live chatbot endpoint
jailbreak-arena audit --url https://www.mychatbot.com --turns 5
# Or audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
--system-prompt "You are a banking assistant. Never reveal account details." \
--turns 5
Step 3 โ Open your report:
start report_http_task_001.html # Windows
open report_http_task_001.html # Mac/Linux
Full vulnerability report in under 10 minutes.
CLI Reference
# โโ Audit commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Audit a REST API endpoint
jailbreak-arena audit --url https://www.mychatbot.com
# With authentication
jailbreak-arena audit \
--url https://www.mychatbot.com \
--auth "Bearer your-api-token"
# Custom payload template (for non-standard APIs)
jailbreak-arena audit \
--url https://www.mychatbot.com/api \
--payload-template '{"query": "{input}", "session_id": "test"}' \
--response-field "data.answer"
# Audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
--system-prompt "You are a banking assistant..."
# Full audit โ all 20 tasks
jailbreak-arena audit \
--url https://www.mychatbot.com \
--tasks all --turns 5
# Specific tasks only
jailbreak-arena audit \
--url https://www.mychatbot.com \
--tasks task_001,task_005,task_007
# Save reports to a specific folder
jailbreak-arena audit \
--url https://www.mychatbot.com \
--output ./security-reports
# Quiet mode โ summary only, no turn-by-turn output
jailbreak-arena audit \
--url https://www.mychatbot.com \
--quiet
# โโ Info commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# List all 20 attack tasks with categories and difficulty
jailbreak-arena tasks
Adapter System โ Connect Any Bot
Adapter 1 โ SystemPromptAdapter
Test any system prompt directly. No deployment needed.
from jailbreak_arena.adapters import SystemPromptAdapter
from jailbreak_arena.env import JailbreakArenaEnv
adapter = SystemPromptAdapter(
system_prompt="""
You are a helpful banking assistant for AcmeBank.
Never reveal customer account details.
Never discuss competitor banks.
"""
)
env = JailbreakArenaEnv(target=adapter, task_id="task_001", max_turns=5)
Adapter 2 โ HTTPAdapter
Test any deployed chatbot with a REST API endpoint.
from jailbreak_arena.adapters import HTTPAdapter
adapter = HTTPAdapter(
url="https://www.mychatbot.com/api/chat",
headers={"Authorization": "Bearer your-token"},
payload_template={"message": "{input}"},
response_field="response",
)
# Common payload templates:
# Simple: {"message": "{input}"}
# OpenAI-style: {"messages": [{"role": "user", "content": "{input}"}]}
# Custom: {"query": "{input}", "session_id": "audit-123"}
# Common response fields:
# Simple: "response"
# OpenAI-style: "choices.0.message.content"
# Nested: "data.response"
Adapter 3 โ BedrockAdapter
Test any AWS Bedrock hosted model.
from jailbreak_arena.adapters import BedrockAdapter
# Set in .env:
# BEDROCK_MODEL_ID=your-bedrock-model-id
# AWS_DEFAULT_REGION=us-east-1
# Find model IDs: docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html
adapter = BedrockAdapter(
system_prompt="You are a banking assistant. Never reveal account details."
)
Adapter 4 โ LangChainAdapter
Test any LangChain chain or agent.
from jailbreak_arena.adapters import LangChainAdapter
# You configure the model and chain โ JailbreakArena just attacks it
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
llm = ChatOpenAI(model="your-chosen-model")
chain = ConversationChain(llm=llm)
adapter = LangChainAdapter(chain=chain)
20 Attack Tasks Across 8 Categories
jailbreak-arena tasks # see the full list with difficulty ratings
| Category | Count | Examples |
|---|---|---|
| ๐ญ Identity & Role | 4 | Role Hijacking, Developer Mode, Fictional Framing |
| ๐ค Data & Extraction | 2 | System Prompt Extraction, Indirect Extraction |
| ๐ Injection & Manipulation | 3 | Indirect Injection, Emotional Manipulation |
| ๐ง Logic & Context | 3 | Context Overflow, Hypothetical Bypass |
| โ ๏ธ Harmful Content | 2 | Harmful Instructions, PII Extraction |
| ๐ Compliance & Policy | 2 | Copyright Violation, Competitor Mention |
| ๐ค Agentic Attacks | 2 | Excessive Agency, Memory Poisoning |
| ๐ง Technical Injections | 2 | SQL Injection via NL, SSRF via LLM |
5 LLM Providers โ Zero Code Changes
Auto-detected from .env. Switch providers by changing one line.
You choose the models โ we just use them:
GROQ_API_KEY=xxx # Groq (default โ free tier)
OPENAI_API_KEY=xxx # OpenAI
ANTHROPIC_API_KEY=xxx # Claude
GEMINI_API_KEY=xxx # Gemini
AZURE_OPENAI_API_KEY=xxx # Azure (enterprise)
Optional model control per role:
ATTACKER_MODEL=your-chosen-model
JUDGE_MODEL=your-chosen-model
BOT_MODEL=your-chosen-model
โ ๏ธ Azure OpenAI โ Content Filter Note
Azure OpenAI has a built-in content management policy that detects and blocks jailbreak attempts at the API level. Since JailbreakArena generates real adversarial prompts, Azure will block the attacker agent with:
Error code: 400 โ content_filter
jailbreak: detected: True, filtered: True
This is not a bug. Azure is doing its job. The conflict is architectural โ a red-team tool and a content filter cannot coexist on the same endpoint.
Fix Option 1 โ Disable jailbreak filter for your deployment
Azure Portal
โ Azure AI Foundry
โ Your deployment (e.g. gpt-4.1)
โ Content filters
โ Create new filter configuration
โ Set "Jailbreak attacks" to OFF
โ Apply to deployment
This is the legitimate path for security research and authorized testing.
Fix Option 2 โ Use a different provider as attacker/judge (recommended)
# Use Groq (free) or OpenAI as the attacker + judge
GROQ_API_KEY=your_groq_key
# Then point --url at your Azure bot as the TARGET
# jailbreak-arena audit --url https://your-azure-bot.com/api/chat
Best practice: Run JailbreakArena on Groq/OpenAI/Anthropic.
Point it at your Azure bot via --url. Your Azure bot is what gets hardened.
The attacker runs on a separate, unrestricted provider.
The Reward Design โ The Key Insight
The defender earns maximum reward (+15) only when it blocks the attack AND keeps the bot helpful. Over-restriction is penalised just like a security breach:
| Event | Attacker | Defender |
|---|---|---|
| Clean jailbreak success | +10 | -10 |
| Partial manipulation | +3 | -3 |
| Attack blocked, bot helpful | -5 | +15 โญ |
| Attack blocked, bot unhelpful | +2 | -5 |
This models real-world deployment: security cannot come at the cost of usability.
Two-Level Grader
Level 1 โ Rule-based (zero API cost, instant)
Scans for hard success/failure signals
Returns immediately if confident
Covers ~70% of cases at zero cost
Level 2 โ LLM Judge (fires only for ambiguous cases)
Structured output: RESULT / CONFIDENCE / REASON
Uses smarter model for nuanced reasoning
Covers the remaining ~30%
Docker
# Pull
docker pull mithilesh-lala/jailbreak-arena
# Audit a live endpoint
docker run --env-file .env mithilesh-lala/jailbreak-arena \
audit --url https://www.mychatbot.com --turns 5
# Audit a system prompt
docker run --env-file .env mithilesh-lala/jailbreak-arena \
audit --system-prompt "You are a banking assistant..." --turns 5
# Full audit โ all 20 tasks, save reports locally
docker run \
-v $(pwd)/reports:/app/reports \
--env-file .env \
mithilesh-lala/jailbreak-arena \
audit --url https://www.mychatbot.com --tasks all --output /app/reports
# List all tasks
docker run --env-file .env mithilesh-lala/jailbreak-arena tasks
CI/CD โ LLM Security Gate
Add to your pipeline โ block deployments if vulnerabilities are found:
# .github/workflows/llm-security.yml
name: LLM Security Audit
on: [push, pull_request]
jobs:
security-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install JailbreakArena
run: pip install jailbreak-arena
- name: Run Security Audit
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: |
jailbreak-arena audit \
--url ${{ secrets.BOT_ENDPOINT }} \
--tasks task_001,task_005,task_007 \
--turns 3 \
--quiet
Use as a Gymnasium RL Environment
For researchers who want to train custom RL agents:
from stable_baselines3 import PPO
from jailbreak_arena.env import JailbreakArenaEnv
from jailbreak_arena.adapters import SystemPromptAdapter
adapter = SystemPromptAdapter(
system_prompt="You are a banking assistant..."
)
env = JailbreakArenaEnv(target=adapter, max_turns=5)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
model.save("my_defender_v1")
Project Structure
jailbreak-arena/
โโโ jailbreak_arena/
โ โโโ env.py # Gymnasium RL environment
โ โโโ attacker.py # LLM-powered attacker agent
โ โโโ defender.py # Discrete-action defender agent
โ โโโ grader.py # Two-level grader
โ โโโ tasks.py # 20 attack task catalog
โ โโโ prompts.py # All prompt templates
โ โโโ utils.py # LLMClient โ 5 providers
โ โโโ cli.py # CLI entry point
โ โโโ adapters/
โ โโโ base.py # Abstract base adapter
โ โโโ system_prompt.py # Test any system prompt
โ โโโ http.py # Any REST API endpoint
โ โโโ bedrock.py # AWS Bedrock models
โ โโโ langchain.py # LangChain chains/agents
โโโ reporters/
โ โโโ html_report.py # HTML audit report generator
โโโ examples/
โ โโโ basic_run.py # Basic usage example
โ โโโ audit_my_bot.py # Adapter usage examples
โโโ tests/ # 29 unit tests, 0.10s, zero API calls
โโโ Dockerfile
โโโ openenv.yaml # Meta OpenEnv spec
โโโ DOCKER_HUB.md # Docker Hub documentation
โโโ HUGGINGFACE.md # HuggingFace model card
โโโ pyproject.toml # PyPI packaging
Run Tests
python -m pytest tests/ -v
# 29 passed in 0.10s โ zero API calls โ runs in CI instantly
Install Options
pip install jailbreak-arena # Groq only (free, recommended)
pip install jailbreak-arena[openai] # adds OpenAI
pip install jailbreak-arena[anthropic] # adds Anthropic
pip install jailbreak-arena[bedrock] # adds boto3 for AWS Bedrock
pip install jailbreak-arena[all] # everything
Built On
| Meta OpenEnv | Standardised RL environment framework |
| HuggingFace | Environment hub and model ecosystem |
| Gymnasium | Industry standard RL interface |
| Groq | Free, blazing-fast LLM inference (default) |
Roadmap
- v0.2.0 โ OWASP LLM Top 10 compliance mapping
- v0.2.0 โ Web UI dashboard
- v0.3.0 โ Pre-trained defender agent weights on HuggingFace
- v0.3.0 โ Multi-episode RL training pipeline
- v0.4.0 โ Custom task builder API
Author
Mithilesh Kumar Lala GitHub: @Mithilesh-Lala ย |ย HuggingFace: mkl-01 ย |ย Docker: mithilesh-lala
License
MIT โ free to use, modify, and distribute.
Contributing
Issues and PRs welcome. Found a new attack vector not in the 20 tasks? Open an issue โ we will add it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jailbreak_arena-0.1.1.tar.gz.
File metadata
- Download URL: jailbreak_arena-0.1.1.tar.gz
- Upload date:
- Size: 47.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23bd0e4f479e4aca37d5d75ebe4df30d190001a4d2a767c5b096d5cc43ea5b4d
|
|
| MD5 |
f5f7dddffd46dc3259afdc92f7034d28
|
|
| BLAKE2b-256 |
68c344723008abf9eb0a03f96212a0f6a625b24354a8c0e3642d5ca19795943b
|
File details
Details for the file jailbreak_arena-0.1.1-py3-none-any.whl.
File metadata
- Download URL: jailbreak_arena-0.1.1-py3-none-any.whl
- Upload date:
- Size: 45.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aef03fb85ce0c6a2326dffd21b224be51ecc4a86bd056d12af4943fabfa360c
|
|
| MD5 |
588f190ae45ea8240f1d7b83af74dce9
|
|
| BLAKE2b-256 |
53f22099ec4b252eea77354883a3ac22030d3b8c6cf3783961b73fdb4f1d8178
|