RL-based adversarial red-team agent for LLM systems
Project description
phantom
RL-based adversarial red-team agent for LLM systems.
Phantom uses reinforcement learning to discover novel attack strategies against any LLM application. It doesn't just test known attacks -- it learns new ones by interacting with the target system, then maps every finding to the MITRE ATLAS framework.
At a Glance
- Autonomous red-team loop for LLM applications and agent systems
- Multi-turn probing and adaptive prompt mutation
- PPO-style policy training against target-specific defenses
- MITRE ATLAS mapping for findings and coverage reporting
- JSON, HTML, and SARIF outputs for CI and security review workflows
The Problem
LLM security testing is stuck in 2024. Current tools fire static payload lists at your application and call it a day. Real attackers don't work that way -- they probe, observe, adapt, and evolve novel strategies on the fly.
The Solution
Phantom is an autonomous red-team agent with a policy network that learns which attack strategies work against your specific target. It runs multi-turn social engineering chains, mutates prompts to evade filters, and produces compliance-ready reports mapped to MITRE ATLAS.
Showcase
HTML finding report generated directly from the local demo pipeline.
Quick Start
Install
pip install phantom-redteam
Python API
import asyncio
from phantom import RedTeam, Target, ATLASReport
async def main():
target = Target(
endpoint="https://api.example.com/chat",
auth={"Authorization": "Bearer ..."},
)
red_team = RedTeam(
target=target,
attack_model="gpt-4",
categories=["prompt_injection", "goal_hijacking", "data_exfiltration"],
max_interactions=500,
multi_turn=True,
)
results = await red_team.run()
report = ATLASReport(results)
report.to_html("security_assessment.html")
report.to_sarif("results.sarif")
report.to_json("results.json")
print(f"Vulnerabilities found: {len(results.findings)}")
print(f"Critical: {results.count_by_severity('CRITICAL')}")
print(f"Novel attacks discovered: {results.novel_attack_count}")
asyncio.run(main())
This is designed for API-backed targets, but the same reporting pipeline works for local wrappers and CI-driven security checks.
CLI
# Basic scan
phantom scan --target https://api.example.com/chat --output json
# Full assessment with all report formats
phantom scan \
--target https://api.example.com/chat \
--auth "Bearer sk-..." \
--categories prompt_injection,goal_hijacking \
--max-interactions 500 \
--output all
# Generate reports from previous scan results
phantom report --input phantom-results.json --output html
Architecture
graph TD
A[RL Policy Network] -->|selects action| B[Attack Generator]
B -->|generates prompt| C[Mutation Engine]
C -->|mutated prompt| D[Probe Orchestrator]
D -->|sends probe| E[Target LLM]
E -->|response| F[Reward Classifier]
F -->|reward signal| A
F -->|outcome| G[ATLAS Mapper]
G -->|findings| H[Report Generator]
H --> I[JSON / HTML / SARIF]
style A fill:#2d1b69,stroke:#7c4dff,color:#fff
style B fill:#1a237e,stroke:#448aff,color:#fff
style C fill:#1a237e,stroke:#448aff,color:#fff
style D fill:#004d40,stroke:#1de9b6,color:#fff
style E fill:#b71c1c,stroke:#ff5252,color:#fff
style F fill:#004d40,stroke:#1de9b6,color:#fff
style G fill:#e65100,stroke:#ff9100,color:#fff
style H fill:#e65100,stroke:#ff9100,color:#fff
style I fill:#1b5e20,stroke:#69f0ae,color:#fff
Core Loop
-
The RL policy network observes the current state (response patterns, filter signatures, conversation context) and selects an action: which mutation operator to apply, which strategy to use, and how aggressively to escalate.
-
The attack generator produces a prompt using seed libraries, LLM-based enhancement, and the selected mutation operator (synonym replacement, base64 encoding, role-play framing, instruction nesting, and more).
-
The probe orchestrator sends the prompt to the target and collects the response.
-
The reward classifier analyzes the response using pattern matching to determine the outcome: full bypass (+1.0), partial bypass (+0.5), information leak (+0.1), or clean refusal (0.0).
-
The reward signal feeds back into the policy network via PPO training, so the agent learns which approaches work against this specific target.
-
Successful probes are mapped to MITRE ATLAS techniques and compiled into structured reports.
Attack Categories
| Category | ATLAS Technique | Description |
|---|---|---|
| Prompt Injection | AML.T0051 | Direct, indirect, and multi-turn prompt injections |
| Goal Hijacking | AML.T0054 | Redirect agent behavior, extract system prompts, manipulate tool calls |
| Data Exfiltration | AML.T0024 | Extract training data, PII, credentials, or system configuration |
| Denial of Service | AML.T0029 | Trigger infinite loops, exhaust token budgets, cause harmful output |
Mutation Operators
Phantom includes eight mutation operators that transform attack prompts to evade detection:
- Synonym Replacement -- Swaps keywords with synonyms to bypass keyword filters
- Base64 Encoding -- Encodes payloads in base64 with decode instructions
- Role-Play Framing -- Wraps attacks in fictional/educational scenarios
- Language Switching -- Adds cross-language instructions to confuse filters
- Token Splitting -- Splits sensitive words across token boundaries
- Instruction Nesting -- Buries payloads inside layers of meta-instructions
- Context Overflow -- Pads prompts to push system instructions out of context
- Encoding Chain -- Applies multiple encoding layers for deep obfuscation
Report Formats
- JSON -- Machine-readable results for CI/CD pipeline integration
- HTML -- Styled report with severity badges and remediation guidance for stakeholders
- SARIF -- GitHub Security tab compatible format for code scanning integration
CI/CD Integration
# .github/workflows/security.yml
name: LLM Security Scan
on: [push]
jobs:
phantom-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install phantom-redteam
- run: phantom scan --target ${{ secrets.API_ENDPOINT }} --output sarif
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: phantom-results.sarif
API Reference
Target
target = Target(
endpoint="https://api.example.com/chat", # Required: target URL
auth={"Authorization": "Bearer ..."}, # Optional: auth headers
system_prompt_known=False, # Does attacker know the system prompt?
system_prompt=None, # Known system prompt text
response_path="choices.0.message.content", # JSON path to response text
timeout_seconds=30.0, # HTTP timeout
max_retries=3, # Retry count on failure
)
RedTeam
red_team = RedTeam(
target=target,
attack_model="gpt-4", # Model for generating attacks
categories=["prompt_injection"], # Attack categories to test
max_interactions=500, # Max probes to send
multi_turn=True, # Enable multi-turn strategies
max_turns_per_conversation=10, # Max turns per conversation
learning_rate=3e-4, # RL policy learning rate
)
results = await red_team.run()
ATLASReport
report = ATLASReport(results)
report.to_json("results.json")
report.to_html("report.html")
report.to_sarif("results.sarif")
# Query findings
critical = report.count_by_severity("CRITICAL")
all_findings = report.findings
RedTeamResults
results.total_probes # Total probes sent
results.total_bypasses # Number of successful bypasses
results.bypass_rate # Bypass rate (0.0 to 1.0)
results.findings # List of ATLAS-mapped findings
results.novel_attack_count # Novel attacks discovered by RL
results.count_by_severity("CRITICAL") # Count by severity
results.atlas_coverage_pct # Percentage of ATLAS techniques represented in the run
Typical Workflow
- define the target interface and auth headers
- choose attack categories and interaction budget
- run a scan or training-backed assessment
- export JSON for tooling and HTML/SARIF for reviewers
- map the findings back into remediation work
Project Structure
phantom/
src/phantom/
__init__.py # Public API exports
redteam.py # Main orchestrator
target.py # Target interface
models.py # Pydantic data models
exceptions.py # Custom exceptions
cli.py # CLI interface (Typer)
attacks/
generator.py # LLM-based attack generation
mutations.py # Mutation operators
strategies.py # Attack strategies (direct, indirect, multi-turn)
learner/
policy.py # RL policy network (PyTorch)
reward.py # Response classification
trainer.py # PPO training loop
atlas/
mapper.py # ATLAS technique mapping
taxonomy.py # Technique definitions
report.py # Report generation (JSON, HTML, SARIF)
tests/ # pytest test suite
examples/ # Usage examples
atlas_data/
techniques.json # MITRE ATLAS technique database
Demo
Run the offline walkthrough with:
uv run python examples/demo.py
For live agent and chatbot scans, see the larger examples in examples/.
Development
# Clone and install in development mode
git clone https://github.com/sushaan-k/phantom.git
cd phantom
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run linter
ruff check src/ tests/
# Run type checker
mypy src/phantom/
Research References
- MITRE ATLAS Framework
- OWASP Top 10 for LLM Applications (2025)
- PISmith: Automated Prompt Injection via Reinforcement Learning (arXiv:2603.13026)
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (2023)
Contributing
Contributions are welcome. Please open an issue to discuss proposed changes before submitting a PR.
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Write tests for your changes
- Ensure all tests pass (
pytest tests/ -v) - Ensure code passes linting (
ruff check src/ tests/) - Submit a pull request
License
MIT License. See LICENSE for details.
Built by Sushaan Kandukoori.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file infiltr-0.1.0.tar.gz.
File metadata
- Download URL: infiltr-0.1.0.tar.gz
- Upload date:
- Size: 253.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7486712add43e32da429a9be4a6fa84c763972c7d5bba981f8957a6da636100d
|
|
| MD5 |
35d8545297fd32516044da5e51e63e5a
|
|
| BLAKE2b-256 |
86354818fd0aa8b39686b69dedbb34d04bc381ac321b83cc519ef2f1845a34f5
|
File details
Details for the file infiltr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: infiltr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
980dd6357b58105e8fc2661034ef92f73e0b29f9dc13ca44e262f8670e0916c2
|
|
| MD5 |
ad333b925ec91295dae99f79949e2fd0
|
|
| BLAKE2b-256 |
1d3e3d04e6d8d2f8e8798b1d59b228b0f6415979ef1762c86838be20d7d43f49
|