Adversarial RL security testing for LLM applications. An attacker agent learns to break chatbots while a defender patches the system prompt in real time.

These details have not been verified by PyPI

Project links

Project description

🔐 JailbreakArena

The self-improving adversarial RL environment for LLM security testing. Your bot gets attacked. It learns to defend. You get a report.

The Problem

Every company shipping an LLM chatbot today follows the same broken process:

Build chatbot → Test it manually → Deploy → Real attacker finds jailbreak in 10 minutes → PR disaster 🔥

This happened to Bing Chat, Air Canada's bot, DPD's support bot — all billion-dollar companies. The root cause is always the same: nobody systematically attacked the bot before shipping it.

Existing tools (static test runners, code review tools, prompt evaluation frameworks) test what you tell them to test. They cannot discover what you haven't thought of yet. And they have no concept of an attacker that learns and adapts.

What JailbreakArena Does

JailbreakArena is an open-source Reinforcement Learning environment where two intelligent agents battle continuously to harden your LLM application:

┌─────────────────────────────────────────────────────────┐
│                    JailbreakArena                       │
│                                                         │
│  🗡️  Attacker Agent ──────────▶  Target LLM Bot        │
│      (learns & adapts)                  │               │
│           │                            ▼                │
│           │                    ┌──────────────┐         │
│           │                    │  LLM Judge   │         │
│           │                    └──────────────┘         │
│           ▼                            │                │
│  🛡️  Defender Agent ◀──── Reward Signals ◀────────────┘│
│      (patches system prompt)                            │
└─────────────────────────────────────────────────────────┘

Attacker Agent — generates adaptive jailbreak attempts. Studies what was blocked and tries a completely different angle next turn. Gets smarter every episode.
Defender Agent — watches every attack outcome and patches the system prompt in real time. Learns which defenses work against which attack types.
LLM Judge — evaluates every interaction: SUCCESS / PARTIAL / FAILED with confidence scoring and reasoning.
HTML Report — every run produces a professional security audit report with vulnerabilities, patches, and a hardened system prompt ready to deploy.

What Makes JailbreakArena Different

Most security testing tools are static — same tests, same results, every run. JailbreakArena is intelligence-first:

Static approach (e.g. traditional test runners, eval frameworks, code review tools):
  ✗ You define the tests → it runs them → same results every time
  ✗ No learning between runs
  ✗ Only finds what you already know to look for
  ✗ Coverage-first — breadth over depth

JailbreakArena:
  ✓ RL environment — attacker LEARNS from every blocked attempt
  ✓ Defender PATCHES the system prompt in real time
  ✓ Gets smarter every episode — discovers novel attack vectors
  ✓ Fully open source — Meta OpenEnv ecosystem
  ✓ Intelligence-first — depth over static breadth

The one-sentence difference: Static tools test what you tell them to test. JailbreakArena discovers what nobody has thought of yet — then tells you how to fix it.

Install

# Default — Groq (free tier, recommended)
pip install jailbreak-arena

# With your preferred LLM provider
pip install jailbreak-arena[openai]
pip install jailbreak-arena[anthropic]
pip install jailbreak-arena[bedrock]
pip install jailbreak-arena[all]

Quickstart — 3 Steps

Step 1 — Set your provider key in .env:

# Groq — free, fastest (recommended)
GROQ_API_KEY=your_key_here

# Or OpenAI
# OPENAI_API_KEY=sk-xxx

# Or Anthropic Claude
# ANTHROPIC_API_KEY=sk-ant-xxx

# Or Google Gemini
# GEMINI_API_KEY=xxx

# Or Azure OpenAI (enterprise)
# AZURE_OPENAI_API_KEY=xxx
# AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
# AZURE_OPENAI_MODEL_NAME=your-deployment-name
# AZURE_OPENAI_API_VERSION=2025-01-01-preview

# Optional — you choose the models, we use them
# ATTACKER_MODEL=your-chosen-model
# JUDGE_MODEL=your-chosen-model
# BOT_MODEL=your-chosen-model

Step 2 — Run your first audit:

# Audit a live chatbot endpoint
jailbreak-arena audit --url https://www.mychatbot.com --turns 5

# Or audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
  --system-prompt "You are a banking assistant. Never reveal account details." \
  --turns 5

Step 3 — Open your report:

start report_http_task_001.html      # Windows
open  report_http_task_001.html      # Mac/Linux

Full vulnerability report in under 10 minutes.

CLI Reference

# ── Audit commands ──────────────────────────────────────────────────

# Audit a REST API endpoint
jailbreak-arena audit --url https://www.mychatbot.com

# With authentication
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --auth "Bearer your-api-token"

# Custom payload template (for non-standard APIs)
jailbreak-arena audit \
  --url https://www.mychatbot.com/api \
  --payload-template '{"query": "{input}", "session_id": "test"}' \
  --response-field "data.answer"

# Audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
  --system-prompt "You are a banking assistant..."

# Full audit — all 20 tasks
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --tasks all --turns 5

# Specific tasks only
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --tasks task_001,task_005,task_007

# Save reports to a specific folder
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --output ./security-reports

# Quiet mode — summary only, no turn-by-turn output
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --quiet

# ── Info commands ────────────────────────────────────────────────────

# List all 20 attack tasks with categories and difficulty
jailbreak-arena tasks

Adapter System — Connect Any Bot

Adapter 1 — SystemPromptAdapter

Test any system prompt directly. No deployment needed.

from jailbreak_arena.adapters import SystemPromptAdapter
from jailbreak_arena.env import JailbreakArenaEnv

adapter = SystemPromptAdapter(
    system_prompt="""
    You are a helpful banking assistant for AcmeBank.
    Never reveal customer account details.
    Never discuss competitor banks.
    """
)
env = JailbreakArenaEnv(target=adapter, task_id="task_001", max_turns=5)

Adapter 2 — HTTPAdapter

Test any deployed chatbot with a REST API endpoint.

from jailbreak_arena.adapters import HTTPAdapter

adapter = HTTPAdapter(
    url="https://www.mychatbot.com/api/chat",
    headers={"Authorization": "Bearer your-token"},
    payload_template={"message": "{input}"},
    response_field="response",
)

# Common payload templates:
# Simple:        {"message": "{input}"}
# OpenAI-style:  {"messages": [{"role": "user", "content": "{input}"}]}
# Custom:        {"query": "{input}", "session_id": "audit-123"}

# Common response fields:
# Simple:        "response"
# OpenAI-style:  "choices.0.message.content"
# Nested:        "data.response"

Adapter 3 — BedrockAdapter

Test any AWS Bedrock hosted model.

from jailbreak_arena.adapters import BedrockAdapter

# Set in .env:
# BEDROCK_MODEL_ID=your-bedrock-model-id
# AWS_DEFAULT_REGION=us-east-1
# Find model IDs: docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html

adapter = BedrockAdapter(
    system_prompt="You are a banking assistant. Never reveal account details."
)

Adapter 4 — LangChainAdapter

Test any LangChain chain or agent.

from jailbreak_arena.adapters import LangChainAdapter

# You configure the model and chain — JailbreakArena just attacks it
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

llm     = ChatOpenAI(model="your-chosen-model")
chain   = ConversationChain(llm=llm)
adapter = LangChainAdapter(chain=chain)

20 Attack Tasks Across 8 Categories

jailbreak-arena tasks   # see the full list with difficulty ratings

Category	Count	Examples
🎭 Identity & Role	4	Role Hijacking, Developer Mode, Fictional Framing
📤 Data & Extraction	2	System Prompt Extraction, Indirect Extraction
💉 Injection & Manipulation	3	Indirect Injection, Emotional Manipulation
🧠 Logic & Context	3	Context Overflow, Hypothetical Bypass
⚠️ Harmful Content	2	Harmful Instructions, PII Extraction
📋 Compliance & Policy	2	Copyright Violation, Competitor Mention
🤖 Agentic Attacks	2	Excessive Agency, Memory Poisoning
🔧 Technical Injections	2	SQL Injection via NL, SSRF via LLM

5 LLM Providers — Zero Code Changes

Auto-detected from .env. Switch providers by changing one line. You choose the models — we just use them:

GROQ_API_KEY=xxx          # Groq    (default — free tier)
OPENAI_API_KEY=xxx        # OpenAI
ANTHROPIC_API_KEY=xxx     # Claude
GEMINI_API_KEY=xxx        # Gemini
AZURE_OPENAI_API_KEY=xxx  # Azure   (enterprise)

Optional model control per role:

ATTACKER_MODEL=your-chosen-model
JUDGE_MODEL=your-chosen-model
BOT_MODEL=your-chosen-model

⚠️ Azure OpenAI — Content Filter Note

Azure OpenAI has a built-in content management policy that detects and blocks jailbreak attempts at the API level. Since JailbreakArena generates real adversarial prompts, Azure will block the attacker agent with:

Error code: 400 — content_filter
jailbreak: detected: True, filtered: True

This is not a bug. Azure is doing its job. The conflict is architectural — a red-team tool and a content filter cannot coexist on the same endpoint.

Fix Option 1 — Disable jailbreak filter for your deployment

Azure Portal
→ Azure AI Foundry
→ Your deployment (e.g. gpt-4.1)
→ Content filters
→ Create new filter configuration
→ Set "Jailbreak attacks" to OFF
→ Apply to deployment

This is the legitimate path for security research and authorized testing.

Fix Option 2 — Use a different provider as attacker/judge (recommended)

# Use Groq (free) or OpenAI as the attacker + judge
GROQ_API_KEY=your_groq_key

# Then point --url at your Azure bot as the TARGET
# jailbreak-arena audit --url https://your-azure-bot.com/api/chat

Best practice: Run JailbreakArena on Groq/OpenAI/Anthropic. Point it at your Azure bot via --url. Your Azure bot is what gets hardened. The attacker runs on a separate, unrestricted provider.

The Reward Design — The Key Insight

The defender earns maximum reward (+15) only when it blocks the attack AND keeps the bot helpful. Over-restriction is penalised just like a security breach:

Event	Attacker	Defender
Clean jailbreak success	+10	-10
Partial manipulation	+3	-3
Attack blocked, bot helpful	-5	+15 ⭐
Attack blocked, bot unhelpful	+2	-5

This models real-world deployment: security cannot come at the cost of usability.

Two-Level Grader

Level 1 — Rule-based (zero API cost, instant)
  Scans for hard success/failure signals
  Returns immediately if confident
  Covers ~70% of cases at zero cost

Level 2 — LLM Judge (fires only for ambiguous cases)
  Structured output: RESULT / CONFIDENCE / REASON
  Uses smarter model for nuanced reasoning
  Covers the remaining ~30%

Docker

# Pull
docker pull mithilesh-lala/jailbreak-arena

# Audit a live endpoint
docker run --env-file .env mithilesh-lala/jailbreak-arena \
  audit --url https://www.mychatbot.com --turns 5

# Audit a system prompt
docker run --env-file .env mithilesh-lala/jailbreak-arena \
  audit --system-prompt "You are a banking assistant..." --turns 5

# Full audit — all 20 tasks, save reports locally
docker run \
  -v $(pwd)/reports:/app/reports \
  --env-file .env \
  mithilesh-lala/jailbreak-arena \
  audit --url https://www.mychatbot.com --tasks all --output /app/reports

# List all tasks
docker run --env-file .env mithilesh-lala/jailbreak-arena tasks

CI/CD — LLM Security Gate

Add to your pipeline — block deployments if vulnerabilities are found:

# .github/workflows/llm-security.yml
name: LLM Security Audit

on: [push, pull_request]

jobs:
  security-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install JailbreakArena
        run: pip install jailbreak-arena
      - name: Run Security Audit
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
        run: |
          jailbreak-arena audit \
            --url ${{ secrets.BOT_ENDPOINT }} \
            --tasks task_001,task_005,task_007 \
            --turns 3 \
            --quiet

Use as a Gymnasium RL Environment

For researchers who want to train custom RL agents:

from stable_baselines3 import PPO
from jailbreak_arena.env import JailbreakArenaEnv
from jailbreak_arena.adapters import SystemPromptAdapter

adapter = SystemPromptAdapter(
    system_prompt="You are a banking assistant..."
)
env = JailbreakArenaEnv(target=adapter, max_turns=5)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
model.save("my_defender_v1")

Project Structure

jailbreak-arena/
├── jailbreak_arena/
│   ├── env.py              # Gymnasium RL environment
│   ├── attacker.py         # LLM-powered attacker agent
│   ├── defender.py         # Discrete-action defender agent
│   ├── grader.py           # Two-level grader
│   ├── tasks.py            # 20 attack task catalog
│   ├── prompts.py          # All prompt templates
│   ├── utils.py            # LLMClient — 5 providers
│   ├── cli.py              # CLI entry point
│   └── adapters/
│       ├── base.py         # Abstract base adapter
│       ├── system_prompt.py # Test any system prompt
│       ├── http.py          # Any REST API endpoint
│       ├── bedrock.py       # AWS Bedrock models
│       └── langchain.py     # LangChain chains/agents
├── reporters/
│   └── html_report.py      # HTML audit report generator
├── examples/
│   ├── basic_run.py        # Basic usage example
│   └── audit_my_bot.py     # Adapter usage examples
├── tests/                  # 29 unit tests, 0.10s, zero API calls
├── Dockerfile
├── openenv.yaml            # Meta OpenEnv spec
├── DOCKER_HUB.md           # Docker Hub documentation
├── HUGGINGFACE.md          # HuggingFace model card
└── pyproject.toml          # PyPI packaging

Run Tests

python -m pytest tests/ -v
# 29 passed in 0.10s — zero API calls — runs in CI instantly

Install Options

pip install jailbreak-arena              # Groq only (free, recommended)
pip install jailbreak-arena[openai]     # adds OpenAI
pip install jailbreak-arena[anthropic]  # adds Anthropic
pip install jailbreak-arena[bedrock]    # adds boto3 for AWS Bedrock
pip install jailbreak-arena[all]        # everything

Built On


Meta OpenEnv	Standardised RL environment framework
HuggingFace	Environment hub and model ecosystem
Gymnasium	Industry standard RL interface
Groq	Free, blazing-fast LLM inference (default)

Roadmap

v0.2.0 — OWASP LLM Top 10 compliance mapping
v0.2.0 — Web UI dashboard
v0.3.0 — Pre-trained defender agent weights on HuggingFace
v0.3.0 — Multi-episode RL training pipeline
v0.4.0 — Custom task builder API

Author

Mithilesh Kumar Lala GitHub: @Mithilesh-Lala | HuggingFace: mkl-01 | Docker: mithilesh-lala

License

MIT — free to use, modify, and distribute.

Contributing

Issues and PRs welcome. Found a new attack vector not in the 20 tasks? Open an issue — we will add it.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Mar 30, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jailbreak_arena-0.1.1.tar.gz (47.6 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jailbreak_arena-0.1.1-py3-none-any.whl (45.3 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file jailbreak_arena-0.1.1.tar.gz.

File metadata

Download URL: jailbreak_arena-0.1.1.tar.gz
Upload date: Mar 30, 2026
Size: 47.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for jailbreak_arena-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`23bd0e4f479e4aca37d5d75ebe4df30d190001a4d2a767c5b096d5cc43ea5b4d`
MD5	`f5f7dddffd46dc3259afdc92f7034d28`
BLAKE2b-256	`68c344723008abf9eb0a03f96212a0f6a625b24354a8c0e3642d5ca19795943b`

See more details on using hashes here.

File details

Details for the file jailbreak_arena-0.1.1-py3-none-any.whl.

File metadata

Download URL: jailbreak_arena-0.1.1-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 45.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for jailbreak_arena-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7aef03fb85ce0c6a2326dffd21b224be51ecc4a86bd056d12af4943fabfa360c`
MD5	`588f190ae45ea8240f1d7b83af74dce9`
BLAKE2b-256	`53f22099ec4b252eea77354883a3ac22030d3b8c6cf3783961b73fdb4f1d8178`

See more details on using hashes here.

jailbreak-arena 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔐 JailbreakArena

The Problem

What JailbreakArena Does

What Makes JailbreakArena Different

Install

Quickstart — 3 Steps

CLI Reference

Adapter System — Connect Any Bot

Adapter 1 — SystemPromptAdapter

Adapter 2 — HTTPAdapter

Adapter 3 — BedrockAdapter

Adapter 4 — LangChainAdapter

20 Attack Tasks Across 8 Categories

5 LLM Providers — Zero Code Changes

⚠️ Azure OpenAI — Content Filter Note

Fix Option 1 — Disable jailbreak filter for your deployment

Fix Option 2 — Use a different provider as attacker/judge (recommended)

The Reward Design — The Key Insight

Two-Level Grader

Docker

CI/CD — LLM Security Gate

Use as a Gymnasium RL Environment

Project Structure

Run Tests

Install Options

Built On

Roadmap

Author

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes