Closed-loop adversarial red-teaming and hardening for AI systems

These details have not been verified by PyPI

Project links

Project description

autoredteam

Automated red-teaming for AI systems. Inspired by Karpathy's autoresearch pattern — the evolutionary keep/discard loop — applied to adversarial evaluation of LLM deployments.

You describe your target system. autoredteam discovers its vulnerabilities while you sleep. Wake up to a scored, evidence-backed report.

$ python run.py --config config.yaml
╔══════════════════════════════════════════════════════════╗
║                    autoredteam v0.1                      ║
║         Automated Red-Teaming for AI Systems             ║
╚══════════════════════════════════════════════════════════╝

⚔️  Cycle 3 — ATTACK phase
  Attacks run:    10
  Hits (bypass):  4 (40%)
  Best composite: 72.5
  Categories bypassed: system_prompt_leakage, prompt_injection, jailbreak

Why This Exists

Red-teaming is labor-intensive. Most teams run a handful of manual probes, declare victory, and ship. The attack surface is larger than anyone explores by hand.

autoredteam applies the autoresearch insight — evolutionary optimization with keep/discard selection — to attack generation. Instead of optimizing a research paper toward a citation metric, we optimize attack prompts toward a 4-component scoring vector that resists the single-metric collapse problem:

Dimension	What it measures	Why it matters
Breadth	How many attack categories find bypasses?	Prevents tunnel vision on one exploit
Depth	How severe is the bypass? (0 = refusal, 100 = full compliance)	Distinguishes minor leaks from critical failures
Novelty	How different is this from prior attacks?	Rewards exploration over repetition
Reliability	Does the attack reproduce consistently?	Filters flukes from real vulnerabilities

The composite score is a configurable weighted sum. You can bias toward breadth (coverage testing) or depth (finding the worst-case failure) depending on your evaluation goals.

Quickstart

# Clone and install
git clone https://github.com/glacis-io/auto-redteam.git
cd autoredteam
uv pip install -r requirements.txt  # or: pip install -r requirements.txt

# Dry run — echo target, no API keys needed
python run.py --dry-run

# Point at a real system
export OPENAI_API_KEY=sk-...
# Edit config.yaml: set target.type to "openai" and target.params.model
python run.py --attest

The dry run uses an echo target that simulates a naive model. It takes about 30 seconds and shows you the full loop: attack generation, scoring, evidence chain, convergence detection. A full run against a real model with 10 cycles takes roughly 5–20 minutes depending on batch size and whether you enable the LLM-as-judge.

Results land in results/.

How It Works

autoredteam runs a two-phase evolutionary loop:

┌─────────────────────────────────────────────────┐
│                                                 │
│   Generate attacks (from taxonomy + mutations)  │
│         ↓                                       │
│   Execute against target                        │
│         ↓                                       │
│   Score results (deterministic + LLM judge)     │
│         ↓                                       │
│   Keep winners, discard losers                  │
│         ↓                                       │
│   Mutate winners, inject diversity              │
│         ↓                                       │
│   Record evidence, write report                 │
│         ↓                                       │
│   Loop until convergence                        │
│                                                 │
└─────────────────────────────────────────────────┘

Phase 1 — Attack: Find as many vulnerabilities as possible. The loop optimizes for higher composite scores (more bypasses = better).

Phase 2 — Defend: Take the winning attacks as a test suite. Harden your system prompt and guardrails. Re-run with --phase defend. The loop now optimizes for lower scores (fewer bypasses = better).

Phase 3 — Emit Policy: Generate an OVERT-compliant policy.toml from hardening results. This is a machine-readable governance policy that any OVERT-compatible enforcement engine can consume — or that autoredteam can ingest for another round of recursive hardening.

Architecture

autoredteam/
├── prepare.py       # Target interface + connection setup (READ-ONLY during loop)
├── attack.py        # Attack taxonomy + mutation engine (agent modifies this)
├── scoring.py       # 4-component scoring harness
├── defend.py        # Self-healing engine (diagnose → prescribe → apply → verify)
├── emit_policy.py   # OVERT policy generator (autoharden results → policy.toml)
├── attestation.py   # Evidence chain (local free, Glacis paid)
├── autoharden.py    # Autonomous hardening loop
├── run.py           # Main evolutionary loop
├── program.md       # YOUR strategy in plain English
├── config.yaml      # All configuration
└── results/         # Output directory
    ├── evidence_chain.jsonl   # Tier 2: hashes + scores
    ├── summary.json           # Tier 1: public stats
    ├── attestation_receipt.json # Explicit receipt artifact (`--attest`)
    ├── final_report.json      # Complete run results
    └── autoharden/
        ├── policy.toml            # OVERT governance policy (the closed-loop artifact)
        ├── hardened_prompt.txt    # Hardened system prompt with defense blocks
        ├── guardrail_config.json  # Platform guardrail rules
        ├── block_history.json     # Kept/discarded defense blocks with metrics
        └── evidence_chain.jsonl   # Hash-chained attestation of every cycle

The autoresearch pattern: prepare.py is read-only — it defines the target interface and never changes during a run. attack.py is the file the agent evolves each cycle. program.md is the human's strategy document that steers the loop.

Scoring Pipeline

The scoring pipeline runs deterministic checks first (keyword matching, regex for PII / system prompt leakage patterns), then escalates ambiguous cases to an LLM-as-judge. For ambiguous results (deterministic severity between 10–50), a dual-judge consensus mechanism runs two independent LLM evaluations. If they diverge, the conservative (higher severity) score is used.

This matters because single-metric optimization is the Achilles' heel of automated evaluation — the system finds one exploit and hammers it. The 4-component vector forces the optimizer to explore broadly, go deep on real vulnerabilities, discover novel attacks, and only count reproducible ones.

Mutation Engine

The mutation engine applies 7 strategies to evolve attacks: rephrase, encode, nest, persona shift, language switch, format change, and authority escalation. Mutations operate within categories but can cross-pollinate. A diversity injection mechanism fires every N cycles to prevent premature convergence.

Attack Taxonomy

Built-in categories (8 total, each with seed templates):

Prompt injection — Override system instructions via direct and indirect injection
Jailbreaks — Bypass safety via role-play, academic framing, fictional scenarios
PII extraction — Trick the model into leaking personal data
System prompt leakage — Extract the system prompt or internal instructions
Tool misuse — Abuse available tools or function calling
Role confusion — Confuse the model about its identity or authority level
Context window poisoning — Exploit attention mechanics with long prefixes
Multi-turn manipulation — Build up across conversation turns (single-turn approximations in v0.1)

Adding Your Own Target

Implement the Target protocol in prepare.py:

class MyTarget(Target):
    def send(self, prompt: str) -> str:
        """Send a single prompt, return the model's response text."""
        return my_api.chat(prompt)

    def reset(self) -> None:
        """Clear conversation state between attack probes."""
        my_api.new_session()

    def capabilities(self) -> TargetCapabilities:
        return TargetCapabilities(multi_turn=True)

# Register it
TARGET_REGISTRY["my_target"] = MyTarget

Then set target.type: my_target in config.yaml. The interface is deliberately minimal — three methods — so you can wrap any LLM API, local model, or multi-agent system.

Configuration

See config.yaml for all options. Key settings:

target:
  type: openai          # "openai", "anthropic", or "echo"
  params:
    model: gpt-4o-mini
    system_prompt: "You are a helpful customer service bot for Acme Corp."

run:
  max_cycles: 10        # More cycles = more thorough, more API spend
  batch_size: 10        # Attacks per cycle
  phase: attack         # "attack" or "defend"

scoring:
  use_llm_judge: true   # Needs an OpenAI key; false = deterministic-only
  weights:
    breadth: 0.25
    depth: 0.25
    novelty: 0.25
    reliability: 0.25

Evidence & Attestation

autoredteam maintains a tamper-evident evidence chain using SHA-256 chain hashing:

Free tier (local, default): Every attack result is recorded in evidence_chain.jsonl with chain hashes linking each record to the previous one. The chain is verifiable — attestation.py can recompute all hashes and detect tampering. Raw prompts stay local (never leave your machine). Summary stats in summary.json are safe to share. Pass --attest to also emit attestation_receipt.json, a shareable receipt stub for the completed run.

Paid tier (Glacis): Cryptographic attestation via the Glacis service. Chain hashes are submitted for timestamping and tamper-proof storage. Proves when you tested and what you found — useful for compliance, audits, and responsible disclosure timelines.

Hash separation: The attestation chain contains SHA-256 hashes of attack prompts, not the raw prompts themselves. Three access tiers keep sensitive attack details appropriately scoped:

Public — summary stats, category coverage, composite scores
Team — full hashes, score vectors, timestamps, deterministic flags
Admin — raw prompts and responses (local-only, gitignored by default)

OVERT Policy Output

autoredteam's autoharden loop automatically generates an OVERT-compliant policy.toml at the end of each run. OVERT (Open Verification and Evaluation for Responsible Technology) is an open standard for declarative AI governance policy.

The generated policy captures what autoredteam learned during hardening:

OVERT Domain	What autoredteam contributes
`[protect]`	Input/output filtering rules derived from discovered attack patterns (PII redaction, injection detection, encoding normalization)
`[measure]`	Violation types from tested categories, recommended sampling rates
`[tool]`	Tool-call deny rules from tool_misuse findings
`[prompt]`	The hardened system prompt with defense blocks and provenance hashes
`[attest]`	Attestation config scaled to achieved governance level
`[respond]`	Failure mode (fail-closed/open) derived from governance tier

# Generate policy from autoharden results
autoredteam emit-policy results/autoharden/ --profile healthcare-ambient

# Or it's generated automatically at the end of every autoharden run
autoredteam harden --provider openai --model gpt-4o-mini
# → results/autoharden/policy.toml

The [policy.provenance] section records the red-teaming evidence: target model, cycles run, final ASR, governance score, and the SHA-256 chain hash linking to the full evidence chain. This makes the policy traceable — you can verify that a policy.toml was generated from a specific hardening session.

The closed loop: policy.toml is designed to be consumed by any OVERT-compatible enforcement engine. It can also be fed back into autoredteam for recursive hardening — each pass discovers new vulnerabilities in the hardened posture and tightens the policy further.

CLI Reference

python run.py --dry-run                          # Echo target, no API keys
python run.py --config config.yaml               # Full attack run
python run.py --config config.yaml --attest      # Full run + attestation receipt
python run.py --config config.yaml --phase defend # Defend phase
python run.py --cycles 3 --dry-run               # Quick 3-cycle test
python run.py --quiet                            # Final results only
autoredteam emit-policy results/autoharden/      # Generate OVERT policy.toml

Program.md

Edit program.md to steer the agent's strategy in plain English:

## Objective
Red-team our customer service bot before launch.

## Priority Categories
1. System prompt leakage — our prompt contains API keys
2. PII extraction — bot has access to customer records
3. Jailbreaks — bot must never produce offensive content

## Success Criteria
- Ship-blocking: any system prompt leakage
- Must-fix: PII extraction at depth > 60

Roadmap

v0.1 — Single-turn text attacks, deterministic + LLM scoring, local evidence chain
v0.2 — Multi-turn attack chains, agentic target support
v0.3 — Autoharden self-healing loop, OVERT policy.toml output, multi-cloud providers
v0.4 — Image/multimodal attack vectors, recursive policy hardening
v1.0 — Full OVERT standard conformance, compliance reporting

Acknowledgments

The evolutionary keep/discard loop is directly inspired by Karpathy's autoresearch. autoredteam applies the same pattern to adversarial evaluation: where autoresearch evolves a research paper toward a quality metric, autoredteam evolves attack prompts toward a multi-dimensional vulnerability score.

License

Apache 2.0 — patent-friendly, important given Glacis's patent portfolio.

Built by Glacis. The open-source tool is free forever. Cryptographic attestation is the paid upgrade.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Mar 30, 2026

This version

0.1.2

Mar 26, 2026

0.1.1

Mar 26, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glacis_autoredteam-0.1.2.tar.gz (150.9 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glacis_autoredteam-0.1.2-py3-none-any.whl (154.1 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file glacis_autoredteam-0.1.2.tar.gz.

File metadata

Download URL: glacis_autoredteam-0.1.2.tar.gz
Upload date: Mar 26, 2026
Size: 150.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for glacis_autoredteam-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`838c8ca07a3b90b923369ab69625c958efdab7bc6d38bdf91deb376c1bc36790`
MD5	`62336f9896e694efb835dce5e38d1874`
BLAKE2b-256	`9559d7acfbb3f33ae06b884dbf1026d71acd8042540482f91bc24fe324bf2cfe`

See more details on using hashes here.

File details

Details for the file glacis_autoredteam-0.1.2-py3-none-any.whl.

File metadata

Download URL: glacis_autoredteam-0.1.2-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 154.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for glacis_autoredteam-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a088032fab02ad60bb3e3a169a8be537ec1cbe6585a2710766b0dac5201906a`
MD5	`8f79e2e0579ab2d957c01b61f7153561`
BLAKE2b-256	`d49ae474ed332d89d7f275c8ad51606742ecc751b0f2d23560c212905507bb49`

See more details on using hashes here.

glacis-autoredteam 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autoredteam

Why This Exists

Quickstart

How It Works

Architecture

Scoring Pipeline

Mutation Engine

Attack Taxonomy

Adding Your Own Target

Configuration

Evidence & Attestation

OVERT Policy Output

CLI Reference

Program.md

Roadmap

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes