Skip to main content

Adversarial RL security testing for LLM applications. An attacker agent learns to break chatbots while a defender patches the system prompt in real time.

Project description

๐Ÿ” JailbreakArena

The self-improving adversarial RL environment for LLM security testing. Your bot gets attacked. It learns to defend. You get a report.

PyPI version Python 3.10+ License: MIT Built on OpenEnv HuggingFace Tests Docker


The Problem

Every company shipping an LLM chatbot today follows the same broken process:

Build chatbot โ†’ Test it manually โ†’ Deploy โ†’ Real attacker finds jailbreak in 10 minutes โ†’ PR disaster ๐Ÿ”ฅ

This happened to Bing Chat, Air Canada's bot, DPD's support bot โ€” all billion-dollar companies. The root cause is always the same: nobody systematically attacked the bot before shipping it.

Existing tools (static test runners, code review tools, prompt evaluation frameworks) test what you tell them to test. They cannot discover what you haven't thought of yet. And they have no concept of an attacker that learns and adapts.


What JailbreakArena Does

JailbreakArena is an open-source Reinforcement Learning environment where two intelligent agents battle continuously to harden your LLM application:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    JailbreakArena                       โ”‚
โ”‚                                                         โ”‚
โ”‚  ๐Ÿ—ก๏ธ  Attacker Agent โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ  Target LLM Bot        โ”‚
โ”‚      (learns & adapts)                  โ”‚               โ”‚
โ”‚           โ”‚                            โ–ผ                โ”‚
โ”‚           โ”‚                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”‚
โ”‚           โ”‚                    โ”‚  LLM Judge   โ”‚         โ”‚
โ”‚           โ”‚                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
โ”‚           โ–ผ                            โ”‚                โ”‚
โ”‚  ๐Ÿ›ก๏ธ  Defender Agent โ—€โ”€โ”€โ”€โ”€ Reward Signals โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”‚      (patches system prompt)                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Attacker Agent โ€” generates adaptive jailbreak attempts. Studies what was blocked and tries a completely different angle next turn. Gets smarter every episode.
  • Defender Agent โ€” watches every attack outcome and patches the system prompt in real time. Learns which defenses work against which attack types.
  • LLM Judge โ€” evaluates every interaction: SUCCESS / PARTIAL / FAILED with confidence scoring and reasoning.
  • HTML Report โ€” every run produces a professional security audit report with vulnerabilities, patches, and a hardened system prompt ready to deploy.

What Makes JailbreakArena Different

Most security testing tools are static โ€” same tests, same results, every run. JailbreakArena is intelligence-first:

Static approach (e.g. traditional test runners, eval frameworks, code review tools):
  โœ— You define the tests โ†’ it runs them โ†’ same results every time
  โœ— No learning between runs
  โœ— Only finds what you already know to look for
  โœ— Coverage-first โ€” breadth over depth

JailbreakArena:
  โœ“ RL environment โ€” attacker LEARNS from every blocked attempt
  โœ“ Defender PATCHES the system prompt in real time
  โœ“ Gets smarter every episode โ€” discovers novel attack vectors
  โœ“ Fully open source โ€” Meta OpenEnv ecosystem
  โœ“ Intelligence-first โ€” depth over static breadth

The one-sentence difference: Static tools test what you tell them to test. JailbreakArena discovers what nobody has thought of yet โ€” then tells you how to fix it.


Install

# Default โ€” Groq (free tier, recommended)
pip install jailbreak-arena

# With your preferred LLM provider
pip install jailbreak-arena[openai]
pip install jailbreak-arena[anthropic]
pip install jailbreak-arena[bedrock]
pip install jailbreak-arena[all]

Quickstart โ€” 3 Steps

Step 1 โ€” Set your provider key in .env:

# Groq โ€” free, fastest (recommended)
GROQ_API_KEY=your_key_here

# Or OpenAI
# OPENAI_API_KEY=sk-xxx

# Or Anthropic Claude
# ANTHROPIC_API_KEY=sk-ant-xxx

# Or Google Gemini
# GEMINI_API_KEY=xxx

# Or Azure OpenAI (enterprise)
# AZURE_OPENAI_API_KEY=xxx
# AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
# AZURE_OPENAI_MODEL_NAME=your-deployment-name
# AZURE_OPENAI_API_VERSION=2025-01-01-preview

# Optional โ€” you choose the models, we use them
# ATTACKER_MODEL=your-chosen-model
# JUDGE_MODEL=your-chosen-model
# BOT_MODEL=your-chosen-model

Step 2 โ€” Run your first audit:

# Audit a live chatbot endpoint
jailbreak-arena audit --url https://www.mychatbot.com --turns 5

# Or audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
  --system-prompt "You are a banking assistant. Never reveal account details." \
  --turns 5

Step 3 โ€” Open your report:

start report_http_task_001.html      # Windows
open  report_http_task_001.html      # Mac/Linux

Full vulnerability report in under 10 minutes.


CLI Reference

# โ”€โ”€ Audit commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# Audit a REST API endpoint
jailbreak-arena audit --url https://www.mychatbot.com

# With authentication
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --auth "Bearer your-api-token"

# Custom payload template (for non-standard APIs)
jailbreak-arena audit \
  --url https://www.mychatbot.com/api \
  --payload-template '{"query": "{input}", "session_id": "test"}' \
  --response-field "data.answer"

# Audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
  --system-prompt "You are a banking assistant..."

# Full audit โ€” all 20 tasks
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --tasks all --turns 5

# Specific tasks only
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --tasks task_001,task_005,task_007

# Save reports to a specific folder
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --output ./security-reports

# Quiet mode โ€” summary only, no turn-by-turn output
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --quiet

# โ”€โ”€ Info commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# List all 20 attack tasks with categories and difficulty
jailbreak-arena tasks

Adapter System โ€” Connect Any Bot

Adapter 1 โ€” SystemPromptAdapter

Test any system prompt directly. No deployment needed.

from jailbreak_arena.adapters import SystemPromptAdapter
from jailbreak_arena.env import JailbreakArenaEnv

adapter = SystemPromptAdapter(
    system_prompt="""
    You are a helpful banking assistant for AcmeBank.
    Never reveal customer account details.
    Never discuss competitor banks.
    """
)
env = JailbreakArenaEnv(target=adapter, task_id="task_001", max_turns=5)

Adapter 2 โ€” HTTPAdapter

Test any deployed chatbot with a REST API endpoint.

from jailbreak_arena.adapters import HTTPAdapter

adapter = HTTPAdapter(
    url="https://www.mychatbot.com/api/chat",
    headers={"Authorization": "Bearer your-token"},
    payload_template={"message": "{input}"},
    response_field="response",
)

# Common payload templates:
# Simple:        {"message": "{input}"}
# OpenAI-style:  {"messages": [{"role": "user", "content": "{input}"}]}
# Custom:        {"query": "{input}", "session_id": "audit-123"}

# Common response fields:
# Simple:        "response"
# OpenAI-style:  "choices.0.message.content"
# Nested:        "data.response"

Adapter 3 โ€” BedrockAdapter

Test any AWS Bedrock hosted model.

from jailbreak_arena.adapters import BedrockAdapter

# Set in .env:
# BEDROCK_MODEL_ID=your-bedrock-model-id
# AWS_DEFAULT_REGION=us-east-1
# Find model IDs: docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html

adapter = BedrockAdapter(
    system_prompt="You are a banking assistant. Never reveal account details."
)

Adapter 4 โ€” LangChainAdapter

Test any LangChain chain or agent.

from jailbreak_arena.adapters import LangChainAdapter

# You configure the model and chain โ€” JailbreakArena just attacks it
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

llm     = ChatOpenAI(model="your-chosen-model")
chain   = ConversationChain(llm=llm)
adapter = LangChainAdapter(chain=chain)

20 Attack Tasks Across 8 Categories

jailbreak-arena tasks   # see the full list with difficulty ratings
Category Count Examples
๐ŸŽญ Identity & Role 4 Role Hijacking, Developer Mode, Fictional Framing
๐Ÿ“ค Data & Extraction 2 System Prompt Extraction, Indirect Extraction
๐Ÿ’‰ Injection & Manipulation 3 Indirect Injection, Emotional Manipulation
๐Ÿง  Logic & Context 3 Context Overflow, Hypothetical Bypass
โš ๏ธ Harmful Content 2 Harmful Instructions, PII Extraction
๐Ÿ“‹ Compliance & Policy 2 Copyright Violation, Competitor Mention
๐Ÿค– Agentic Attacks 2 Excessive Agency, Memory Poisoning
๐Ÿ”ง Technical Injections 2 SQL Injection via NL, SSRF via LLM

5 LLM Providers โ€” Zero Code Changes

Auto-detected from .env. Switch providers by changing one line. You choose the models โ€” we just use them:

GROQ_API_KEY=xxx          # Groq    (default โ€” free tier)
OPENAI_API_KEY=xxx        # OpenAI
ANTHROPIC_API_KEY=xxx     # Claude
GEMINI_API_KEY=xxx        # Gemini
AZURE_OPENAI_API_KEY=xxx  # Azure   (enterprise)

Optional model control per role:

ATTACKER_MODEL=your-chosen-model
JUDGE_MODEL=your-chosen-model
BOT_MODEL=your-chosen-model

โš ๏ธ Azure OpenAI โ€” Content Filter Note

Azure OpenAI has a built-in content management policy that detects and blocks jailbreak attempts at the API level. Since JailbreakArena generates real adversarial prompts, Azure will block the attacker agent with:

Error code: 400 โ€” content_filter
jailbreak: detected: True, filtered: True

This is not a bug. Azure is doing its job. The conflict is architectural โ€” a red-team tool and a content filter cannot coexist on the same endpoint.

Fix Option 1 โ€” Disable jailbreak filter for your deployment

Azure Portal
โ†’ Azure AI Foundry
โ†’ Your deployment (e.g. gpt-4.1)
โ†’ Content filters
โ†’ Create new filter configuration
โ†’ Set "Jailbreak attacks" to OFF
โ†’ Apply to deployment

This is the legitimate path for security research and authorized testing.

Fix Option 2 โ€” Use a different provider as attacker/judge (recommended)

# Use Groq (free) or OpenAI as the attacker + judge
GROQ_API_KEY=your_groq_key

# Then point --url at your Azure bot as the TARGET
# jailbreak-arena audit --url https://your-azure-bot.com/api/chat

Best practice: Run JailbreakArena on Groq/OpenAI/Anthropic. Point it at your Azure bot via --url. Your Azure bot is what gets hardened. The attacker runs on a separate, unrestricted provider.


The Reward Design โ€” The Key Insight

The defender earns maximum reward (+15) only when it blocks the attack AND keeps the bot helpful. Over-restriction is penalised just like a security breach:

Event Attacker Defender
Clean jailbreak success +10 -10
Partial manipulation +3 -3
Attack blocked, bot helpful -5 +15 โญ
Attack blocked, bot unhelpful +2 -5

This models real-world deployment: security cannot come at the cost of usability.


Two-Level Grader

Level 1 โ€” Rule-based (zero API cost, instant)
  Scans for hard success/failure signals
  Returns immediately if confident
  Covers ~70% of cases at zero cost

Level 2 โ€” LLM Judge (fires only for ambiguous cases)
  Structured output: RESULT / CONFIDENCE / REASON
  Uses smarter model for nuanced reasoning
  Covers the remaining ~30%

Docker

# Pull
docker pull mithilesh-lala/jailbreak-arena

# Audit a live endpoint
docker run --env-file .env mithilesh-lala/jailbreak-arena \
  audit --url https://www.mychatbot.com --turns 5

# Audit a system prompt
docker run --env-file .env mithilesh-lala/jailbreak-arena \
  audit --system-prompt "You are a banking assistant..." --turns 5

# Full audit โ€” all 20 tasks, save reports locally
docker run \
  -v $(pwd)/reports:/app/reports \
  --env-file .env \
  mithilesh-lala/jailbreak-arena \
  audit --url https://www.mychatbot.com --tasks all --output /app/reports

# List all tasks
docker run --env-file .env mithilesh-lala/jailbreak-arena tasks

CI/CD โ€” LLM Security Gate

Add to your pipeline โ€” block deployments if vulnerabilities are found:

# .github/workflows/llm-security.yml
name: LLM Security Audit

on: [push, pull_request]

jobs:
  security-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install JailbreakArena
        run: pip install jailbreak-arena
      - name: Run Security Audit
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
        run: |
          jailbreak-arena audit \
            --url ${{ secrets.BOT_ENDPOINT }} \
            --tasks task_001,task_005,task_007 \
            --turns 3 \
            --quiet

Use as a Gymnasium RL Environment

For researchers who want to train custom RL agents:

from stable_baselines3 import PPO
from jailbreak_arena.env import JailbreakArenaEnv
from jailbreak_arena.adapters import SystemPromptAdapter

adapter = SystemPromptAdapter(
    system_prompt="You are a banking assistant..."
)
env = JailbreakArenaEnv(target=adapter, max_turns=5)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
model.save("my_defender_v1")

Project Structure

jailbreak-arena/
โ”œโ”€โ”€ jailbreak_arena/
โ”‚   โ”œโ”€โ”€ env.py              # Gymnasium RL environment
โ”‚   โ”œโ”€โ”€ attacker.py         # LLM-powered attacker agent
โ”‚   โ”œโ”€โ”€ defender.py         # Discrete-action defender agent
โ”‚   โ”œโ”€โ”€ grader.py           # Two-level grader
โ”‚   โ”œโ”€โ”€ tasks.py            # 20 attack task catalog
โ”‚   โ”œโ”€โ”€ prompts.py          # All prompt templates
โ”‚   โ”œโ”€โ”€ utils.py            # LLMClient โ€” 5 providers
โ”‚   โ”œโ”€โ”€ cli.py              # CLI entry point
โ”‚   โ””โ”€โ”€ adapters/
โ”‚       โ”œโ”€โ”€ base.py         # Abstract base adapter
โ”‚       โ”œโ”€โ”€ system_prompt.py # Test any system prompt
โ”‚       โ”œโ”€โ”€ http.py          # Any REST API endpoint
โ”‚       โ”œโ”€โ”€ bedrock.py       # AWS Bedrock models
โ”‚       โ””โ”€โ”€ langchain.py     # LangChain chains/agents
โ”œโ”€โ”€ reporters/
โ”‚   โ””โ”€โ”€ html_report.py      # HTML audit report generator
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ basic_run.py        # Basic usage example
โ”‚   โ””โ”€โ”€ audit_my_bot.py     # Adapter usage examples
โ”œโ”€โ”€ tests/                  # 29 unit tests, 0.10s, zero API calls
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ openenv.yaml            # Meta OpenEnv spec
โ”œโ”€โ”€ DOCKER_HUB.md           # Docker Hub documentation
โ”œโ”€โ”€ HUGGINGFACE.md          # HuggingFace model card
โ””โ”€โ”€ pyproject.toml          # PyPI packaging

Run Tests

python -m pytest tests/ -v
# 29 passed in 0.10s โ€” zero API calls โ€” runs in CI instantly

Install Options

pip install jailbreak-arena              # Groq only (free, recommended)
pip install jailbreak-arena[openai]     # adds OpenAI
pip install jailbreak-arena[anthropic]  # adds Anthropic
pip install jailbreak-arena[bedrock]    # adds boto3 for AWS Bedrock
pip install jailbreak-arena[all]        # everything

Built On

Meta OpenEnv Standardised RL environment framework
HuggingFace Environment hub and model ecosystem
Gymnasium Industry standard RL interface
Groq Free, blazing-fast LLM inference (default)

Roadmap

  • v0.2.0 โ€” OWASP LLM Top 10 compliance mapping
  • v0.2.0 โ€” Web UI dashboard
  • v0.3.0 โ€” Pre-trained defender agent weights on HuggingFace
  • v0.3.0 โ€” Multi-episode RL training pipeline
  • v0.4.0 โ€” Custom task builder API

Author

Mithilesh Kumar Lala GitHub: @Mithilesh-Lala ย |ย  HuggingFace: mkl-01 ย |ย  Docker: mithilesh-lala


License

MIT โ€” free to use, modify, and distribute.


Contributing

Issues and PRs welcome. Found a new attack vector not in the 20 tasks? Open an issue โ€” we will add it.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jailbreak_arena-0.1.1.tar.gz (47.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jailbreak_arena-0.1.1-py3-none-any.whl (45.3 kB view details)

Uploaded Python 3

File details

Details for the file jailbreak_arena-0.1.1.tar.gz.

File metadata

  • Download URL: jailbreak_arena-0.1.1.tar.gz
  • Upload date:
  • Size: 47.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for jailbreak_arena-0.1.1.tar.gz
Algorithm Hash digest
SHA256 23bd0e4f479e4aca37d5d75ebe4df30d190001a4d2a767c5b096d5cc43ea5b4d
MD5 f5f7dddffd46dc3259afdc92f7034d28
BLAKE2b-256 68c344723008abf9eb0a03f96212a0f6a625b24354a8c0e3642d5ca19795943b

See more details on using hashes here.

File details

Details for the file jailbreak_arena-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for jailbreak_arena-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7aef03fb85ce0c6a2326dffd21b224be51ecc4a86bd056d12af4943fabfa360c
MD5 588f190ae45ea8240f1d7b83af74dce9
BLAKE2b-256 53f22099ec4b252eea77354883a3ac22030d3b8c6cf3783961b73fdb4f1d8178

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page