RL-based adversarial red-team agent for LLM systems

These details have not been verified by PyPI

Project links

Project description

phantom

RL-based adversarial red-team agent for LLM systems.

Phantom uses reinforcement learning to discover novel attack strategies against any LLM application. It doesn't just test known attacks -- it learns new ones by interacting with the target system, then maps every finding to the MITRE ATLAS framework.

At a Glance

Autonomous red-team loop for LLM applications and agent systems
Multi-turn probing and adaptive prompt mutation
PPO-style policy training against target-specific defenses
MITRE ATLAS mapping for findings and coverage reporting
JSON, HTML, and SARIF outputs for CI and security review workflows

The Problem

LLM security testing is stuck in 2024. Current tools fire static payload lists at your application and call it a day. Real attackers don't work that way -- they probe, observe, adapt, and evolve novel strategies on the fly.

The Solution

Phantom is an autonomous red-team agent with a policy network that learns which attack strategies work against your specific target. It runs multi-turn social engineering chains, mutates prompts to evade filters, and produces compliance-ready reports mapped to MITRE ATLAS.

Showcase

Rendered Phantom HTML report

HTML finding report generated directly from the local demo pipeline.

Quick Start

Install

pip install phantom-redteam

Python API

import asyncio
from phantom import RedTeam, Target, ATLASReport

async def main():
    target = Target(
        endpoint="https://api.example.com/chat",
        auth={"Authorization": "Bearer ..."},
    )

    red_team = RedTeam(
        target=target,
        attack_model="gpt-4",
        categories=["prompt_injection", "goal_hijacking", "data_exfiltration"],
        max_interactions=500,
        multi_turn=True,
    )

    results = await red_team.run()

    report = ATLASReport(results)
    report.to_html("security_assessment.html")
    report.to_sarif("results.sarif")
    report.to_json("results.json")

    print(f"Vulnerabilities found: {len(results.findings)}")
    print(f"Critical: {results.count_by_severity('CRITICAL')}")
    print(f"Novel attacks discovered: {results.novel_attack_count}")

asyncio.run(main())

This is designed for API-backed targets, but the same reporting pipeline works for local wrappers and CI-driven security checks.

CLI

# Basic scan
phantom scan --target https://api.example.com/chat --output json

# Full assessment with all report formats
phantom scan \
  --target https://api.example.com/chat \
  --auth "Bearer sk-..." \
  --categories prompt_injection,goal_hijacking \
  --max-interactions 500 \
  --output all

# Generate reports from previous scan results
phantom report --input phantom-results.json --output html

Architecture

graph TD
    A[RL Policy Network] -->|selects action| B[Attack Generator]
    B -->|generates prompt| C[Mutation Engine]
    C -->|mutated prompt| D[Probe Orchestrator]
    D -->|sends probe| E[Target LLM]
    E -->|response| F[Reward Classifier]
    F -->|reward signal| A
    F -->|outcome| G[ATLAS Mapper]
    G -->|findings| H[Report Generator]
    H --> I[JSON / HTML / SARIF]

    style A fill:#2d1b69,stroke:#7c4dff,color:#fff
    style B fill:#1a237e,stroke:#448aff,color:#fff
    style C fill:#1a237e,stroke:#448aff,color:#fff
    style D fill:#004d40,stroke:#1de9b6,color:#fff
    style E fill:#b71c1c,stroke:#ff5252,color:#fff
    style F fill:#004d40,stroke:#1de9b6,color:#fff
    style G fill:#e65100,stroke:#ff9100,color:#fff
    style H fill:#e65100,stroke:#ff9100,color:#fff
    style I fill:#1b5e20,stroke:#69f0ae,color:#fff

Core Loop

The RL policy network observes the current state (response patterns, filter signatures, conversation context) and selects an action: which mutation operator to apply, which strategy to use, and how aggressively to escalate.
The attack generator produces a prompt using seed libraries, LLM-based enhancement, and the selected mutation operator (synonym replacement, base64 encoding, role-play framing, instruction nesting, and more).
The probe orchestrator sends the prompt to the target and collects the response.
The reward classifier analyzes the response using pattern matching to determine the outcome: full bypass (+1.0), partial bypass (+0.5), information leak (+0.1), or clean refusal (0.0).
The reward signal feeds back into the policy network via PPO training, so the agent learns which approaches work against this specific target.
Successful probes are mapped to MITRE ATLAS techniques and compiled into structured reports.

Attack Categories

Category	ATLAS Technique	Description
Prompt Injection	AML.T0051	Direct, indirect, and multi-turn prompt injections
Goal Hijacking	AML.T0054	Redirect agent behavior, extract system prompts, manipulate tool calls
Data Exfiltration	AML.T0024	Extract training data, PII, credentials, or system configuration
Denial of Service	AML.T0029	Trigger infinite loops, exhaust token budgets, cause harmful output

Mutation Operators

Phantom includes eight mutation operators that transform attack prompts to evade detection:

Synonym Replacement -- Swaps keywords with synonyms to bypass keyword filters
Base64 Encoding -- Encodes payloads in base64 with decode instructions
Role-Play Framing -- Wraps attacks in fictional/educational scenarios
Language Switching -- Adds cross-language instructions to confuse filters
Token Splitting -- Splits sensitive words across token boundaries
Instruction Nesting -- Buries payloads inside layers of meta-instructions
Context Overflow -- Pads prompts to push system instructions out of context
Encoding Chain -- Applies multiple encoding layers for deep obfuscation

Report Formats

JSON -- Machine-readable results for CI/CD pipeline integration
HTML -- Styled report with severity badges and remediation guidance for stakeholders
SARIF -- GitHub Security tab compatible format for code scanning integration

CI/CD Integration

# .github/workflows/security.yml
name: LLM Security Scan
on: [push]
jobs:
  phantom-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install phantom-redteam
      - run: phantom scan --target ${{ secrets.API_ENDPOINT }} --output sarif
      - uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: phantom-results.sarif

API Reference

`Target`

target = Target(
    endpoint="https://api.example.com/chat",  # Required: target URL
    auth={"Authorization": "Bearer ..."},      # Optional: auth headers
    system_prompt_known=False,                 # Does attacker know the system prompt?
    system_prompt=None,                        # Known system prompt text
    response_path="choices.0.message.content", # JSON path to response text
    timeout_seconds=30.0,                      # HTTP timeout
    max_retries=3,                             # Retry count on failure
)

`RedTeam`

red_team = RedTeam(
    target=target,
    attack_model="gpt-4",            # Model for generating attacks
    categories=["prompt_injection"],  # Attack categories to test
    max_interactions=500,             # Max probes to send
    multi_turn=True,                  # Enable multi-turn strategies
    max_turns_per_conversation=10,    # Max turns per conversation
    learning_rate=3e-4,               # RL policy learning rate
)

results = await red_team.run()

`ATLASReport`

report = ATLASReport(results)
report.to_json("results.json")
report.to_html("report.html")
report.to_sarif("results.sarif")

# Query findings
critical = report.count_by_severity("CRITICAL")
all_findings = report.findings

`RedTeamResults`

results.total_probes          # Total probes sent
results.total_bypasses        # Number of successful bypasses
results.bypass_rate           # Bypass rate (0.0 to 1.0)
results.findings              # List of ATLAS-mapped findings
results.novel_attack_count    # Novel attacks discovered by RL
results.count_by_severity("CRITICAL")  # Count by severity
results.atlas_coverage_pct    # Percentage of ATLAS techniques represented in the run

Typical Workflow

define the target interface and auth headers
choose attack categories and interaction budget
run a scan or training-backed assessment
export JSON for tooling and HTML/SARIF for reviewers
map the findings back into remediation work

Project Structure

phantom/
  src/phantom/
    __init__.py          # Public API exports
    redteam.py           # Main orchestrator
    target.py            # Target interface
    models.py            # Pydantic data models
    exceptions.py        # Custom exceptions
    cli.py               # CLI interface (Typer)
    attacks/
      generator.py       # LLM-based attack generation
      mutations.py       # Mutation operators
      strategies.py      # Attack strategies (direct, indirect, multi-turn)
    learner/
      policy.py          # RL policy network (PyTorch)
      reward.py          # Response classification
      trainer.py         # PPO training loop
    atlas/
      mapper.py          # ATLAS technique mapping
      taxonomy.py        # Technique definitions
      report.py          # Report generation (JSON, HTML, SARIF)
  tests/                 # pytest test suite
  examples/              # Usage examples
  atlas_data/
    techniques.json      # MITRE ATLAS technique database

Demo

Run the offline walkthrough with:

uv run python examples/demo.py

For live agent and chatbot scans, see the larger examples in examples/.

Development

# Clone and install in development mode
git clone https://github.com/sushaan-k/phantom.git
cd phantom
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linter
ruff check src/ tests/

# Run type checker
mypy src/phantom/

Research References

MITRE ATLAS Framework
OWASP Top 10 for LLM Applications (2025)
PISmith: Automated Prompt Injection via Reinforcement Learning (arXiv:2603.13026)
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (2023)

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a PR.

Fork the repository
Create a feature branch (git checkout -b feature/your-feature)
Write tests for your changes
Ensure all tests pass (pytest tests/ -v)
Ensure code passes linting (ruff check src/ tests/)
Submit a pull request

License

MIT License. See LICENSE for details.

Built by Sushaan Kandukoori.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infiltr-0.1.0.tar.gz (253.4 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

infiltr-0.1.0-py3-none-any.whl (51.7 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file infiltr-0.1.0.tar.gz.

File metadata

Download URL: infiltr-0.1.0.tar.gz
Upload date: Apr 8, 2026
Size: 253.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for infiltr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7486712add43e32da429a9be4a6fa84c763972c7d5bba981f8957a6da636100d`
MD5	`35d8545297fd32516044da5e51e63e5a`
BLAKE2b-256	`86354818fd0aa8b39686b69dedbb34d04bc381ac321b83cc519ef2f1845a34f5`

See more details on using hashes here.

File details

Details for the file infiltr-0.1.0-py3-none-any.whl.

File metadata

Download URL: infiltr-0.1.0-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 51.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for infiltr-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`980dd6357b58105e8fc2661034ef92f73e0b29f9dc13ca44e262f8670e0916c2`
MD5	`ad333b925ec91295dae99f79949e2fd0`
BLAKE2b-256	`1d3e3d04e6d8d2f8e8798b1d59b228b0f6415979ef1762c86838be20d7d43f49`

See more details on using hashes here.

infiltr 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

phantom

At a Glance

The Problem

The Solution

Showcase

Quick Start

Install

Python API

CLI

Architecture

Core Loop

Attack Categories

Mutation Operators

Report Formats

CI/CD Integration

API Reference

Target

RedTeam

ATLASReport

RedTeamResults

Typical Workflow

Project Structure

Demo

Development

Research References

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Target`

`RedTeam`

`ATLASReport`

`RedTeamResults`