Closed-loop adversarial red-teaming and hardening for AI systems
Project description
autoredteam
Find prompt injection, jailbreaks, PII leakage, and system prompt exposure in your LLM systems — automatically.
Point it at any model or API endpoint. Get a scored, evidence-backed vulnerability report in minutes.
pip install glacis-autoredteam
autoredteam run --provider openai --model gpt-4o-mini
╔══════════════════════════════════════════════════════════════╗
║ autoredteam v0.3 ║
║ Automated Red-Teaming for AI Systems ║
╚══════════════════════════════════════════════════════════════╝
Provider: openai
Model: gpt-4o-mini
Probes: 38
Output: results/
✓ 19 attack categories tested
✓ 4-dimension scoring (breadth · depth · novelty · reliability)
✓ Evidence chain with SHA-256 hash linking
✓ Markdown report + JSONL + attestation receipt
What It Finds
autoredteam tests across 19 attack categories using an evolutionary keep/discard loop that mutates attacks until they bypass your defenses:
| Category | Example |
|---|---|
| Prompt injection | Override system instructions via direct/indirect injection |
| Jailbreaks | Bypass safety via role-play, academic framing, fictional scenarios |
| PII extraction | Trick the model into leaking personal data |
| System prompt leakage | Extract internal instructions or system prompts |
| Tool misuse | Abuse available tools or trigger unintended actions |
| Multi-turn manipulation | Build up attacks across conversation turns |
| Encoding bypass | Evade filters using obfuscation or encoding |
| + 12 more | Role confusion, payload splitting, social engineering, ... |
Domain-specific attack packs are available for healthcare, finance, HR, and coding agents.
Quickstart
# Install from PyPI
pip install glacis-autoredteam
# Dry run — echo target, no API keys needed
autoredteam run --dry-run
# Point at a real system
export OPENAI_API_KEY=sk-...
autoredteam run --provider openai --model gpt-4o-mini
The dry run uses an echo target that simulates a naive model. It takes about 30 seconds and shows the full loop: attack generation, scoring, evidence chain, convergence detection.
A full run against a real model takes 5–20 minutes depending on probe count and judge configuration.
Results land in results/:
results/
├── campaign_result.json # Full structured results
├── probe_results.jsonl # Per-probe detail
├── report.md # Human-readable report
└── autoharden/
├── policy.toml # OVERT governance policy
├── hardened_prompt.txt # Hardened system prompt
└── evidence_chain.jsonl # Hash-chained attestation
Who It's For
- AI/ML engineers shipping LLM features who need to test before deploy
- Security teams evaluating third-party AI integrations
- Compliance teams needing documented evidence of adversarial testing
- Researchers studying LLM robustness and attack surfaces
How It Works
autoredteam runs an evolutionary loop inspired by Karpathy's autoresearch — the same keep/discard selection pattern, applied to adversarial evaluation:
Generate attacks (from taxonomy + mutations)
↓
Execute against target
↓
Score results (deterministic + LLM judge)
↓
Keep winners, discard losers
↓
Mutate winners, inject diversity
↓
Record evidence, write report
↓
Loop until convergence
Phase 1 — Attack: Find vulnerabilities. The loop optimizes for higher composite scores (more bypasses = better).
Phase 2 — Defend: Use winning attacks as a test suite. Harden your system prompt and guardrails. The loop now optimizes for lower scores (fewer bypasses = better).
Phase 3 — Emit Policy: Generate an OVERT-compliant policy.toml — a machine-readable governance policy any OVERT-compatible enforcement engine can consume.
Scoring
Every attack is scored on four dimensions to prevent single-metric collapse:
| Dimension | What it measures |
|---|---|
| Breadth | How many attack categories find bypasses |
| Depth | Severity of the bypass (0 = refusal, 100 = full compliance) |
| Novelty | How different from prior attacks |
| Reliability | Does the attack reproduce consistently |
Deterministic checks run first (keyword matching, regex for PII / system prompt patterns). Ambiguous cases escalate to an LLM-as-judge with dual-judge consensus.
Mutation Engine
Seven mutation strategies evolve attacks: rephrase, encode, nest, persona shift, language switch, format change, authority escalation. A diversity injection mechanism fires every N cycles to prevent premature convergence.
Multi-Cloud Support
Works with any LLM provider:
autoredteam run --provider openai --model gpt-4o-mini
autoredteam run --provider anthropic --model claude-sonnet-4-5
autoredteam run --provider bedrock --model claude-sonnet-4 --region us-east-1
autoredteam run --provider google --model gemini-2.0-flash
autoredteam run --provider azure_openai --model gpt-4o
autoredteam run --provider cloudflare --model @cf/meta/llama-3-8b-instruct
Or bring your own target:
from prepare import Target, TargetCapabilities, TARGET_REGISTRY
class MyTarget(Target):
def send(self, prompt: str) -> str:
return my_api.chat(prompt)
def reset(self) -> None:
my_api.new_session()
def capabilities(self) -> TargetCapabilities:
return TargetCapabilities(multi_turn=True)
TARGET_REGISTRY["my_target"] = MyTarget
CLI Reference
# Red-team campaigns
autoredteam run --dry-run # Echo target, no API keys
autoredteam run --provider openai --model gpt-4o-mini # Full run
autoredteam run --pack generic_taxonomy healthcare # Multiple attack packs
autoredteam run --stealth-profile medium # Stealth mode
autoredteam run --judge-backend api # LLM-as-judge scoring
# Validation suites
autoredteam validate --suite generic --provider openai --model gpt-4o-mini
# Policy generation
autoredteam emit-policy results/autoharden/ # Generate OVERT policy.toml
# Discovery
autoredteam providers list # Available providers
autoredteam packs list # Available attack packs
Evidence & Attestation
autoredteam maintains a tamper-evident evidence chain using SHA-256 chain hashing.
Free (local, default): Every result is recorded in evidence_chain.jsonl with chain hashes. The chain is verifiable — attestation.py recomputes all hashes and detects tampering. Pass --attest to emit attestation_receipt.json.
Paid (Glacis): Cryptographic attestation via the Glacis service. Chain hashes are submitted for timestamping and tamper-proof storage — useful for compliance, audits, and responsible disclosure timelines.
Hash separation keeps sensitive attack details scoped:
| Tier | Contains |
|---|---|
| Public | Summary stats, category coverage, composite scores |
| Team | Full hashes, score vectors, timestamps |
| Admin | Raw prompts and responses (local-only, gitignored) |
OVERT Policy Output
The autoharden loop generates an OVERT-compliant policy.toml capturing what was learned during hardening:
# Generate policy from autoharden results
autoredteam emit-policy results/autoharden/ --profile healthcare-ambient
# From a specific report
autoredteam emit-policy results/autoharden/autoharden_report.json -o deployment/policy.toml
The policy includes input/output filtering rules, violation types, tool-call deny rules, the hardened system prompt, and attestation config — all traceable via SHA-256 chain hash back to the evidence chain.
Configuration
See config.yaml for all options. Key settings:
target:
type: openai # openai, anthropic, gemini, azure_openai,
params: # bedrock, cloudflare, openai_compatible, echo
model: gpt-4o-mini
system_prompt: "You are a helpful customer service bot."
campaign:
max_probes: 20
intensity: medium # low, medium, high
stealth_profile: none # none, light, medium, aggressive
scoring:
judge_backend: deterministic # deterministic, api, slm
weights: { breadth: 0.25, depth: 0.25, novelty: 0.25, reliability: 0.25 }
Roadmap
- v0.1 — Single-turn text attacks, deterministic + LLM scoring, local evidence chain
- v0.2 — Multi-turn attack chains, agentic target support
- v0.3 — Autoharden self-healing loop, OVERT policy.toml output, multi-cloud providers
- v0.4 — Image/multimodal attack vectors, recursive policy hardening
- v1.0 — Full OVERT standard conformance, compliance reporting
Contributing
See CONTRIBUTING.md for guidelines. All contributions welcome — from bug reports to new attack packs.
Citation
If you use autoredteam in research, see CITATION.cff or cite:
@software{autoredteam,
title = {autoredteam: Automated Red-Teaming for AI Systems},
author = {Glacis},
url = {https://github.com/glacis-io/auto-redteam},
license = {Apache-2.0}
}
License
Built by Glacis. The open-source tool is free forever. Cryptographic attestation is the paid upgrade.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glacis_autoredteam-0.3.0.tar.gz.
File metadata
- Download URL: glacis_autoredteam-0.3.0.tar.gz
- Upload date:
- Size: 147.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fee16360ad4d66ebebdd4588d4a2f3665c99c7a6e3da7d6a50ebf8a997f90f93
|
|
| MD5 |
f4bb0a42c6ea07c565b4affd93cf8011
|
|
| BLAKE2b-256 |
e05c58d7e8eff08772d9c5df09747a523c6bdae1efbd7b9f5f368004e8ff7347
|
Provenance
The following attestation bundles were made for glacis_autoredteam-0.3.0.tar.gz:
Publisher:
publish.yml on Glacis-io/auto-redteam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
glacis_autoredteam-0.3.0.tar.gz -
Subject digest:
fee16360ad4d66ebebdd4588d4a2f3665c99c7a6e3da7d6a50ebf8a997f90f93 - Sigstore transparency entry: 1195806059
- Sigstore integration time:
-
Permalink:
Glacis-io/auto-redteam@d223cc5c69330da1561a2b78bab15dd61f6c5af7 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Glacis-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d223cc5c69330da1561a2b78bab15dd61f6c5af7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file glacis_autoredteam-0.3.0-py3-none-any.whl.
File metadata
- Download URL: glacis_autoredteam-0.3.0-py3-none-any.whl
- Upload date:
- Size: 152.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdc7322297cab85e8d6d00be141744855c1bc40c855417a9b1894408ace3c1c7
|
|
| MD5 |
f93a72ff5633f9fd132c4a740bdbc585
|
|
| BLAKE2b-256 |
b205fbaa4f9f59f9a392461a35e426efd86665731ca4911413a78e78e615a2b4
|
Provenance
The following attestation bundles were made for glacis_autoredteam-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on Glacis-io/auto-redteam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
glacis_autoredteam-0.3.0-py3-none-any.whl -
Subject digest:
bdc7322297cab85e8d6d00be141744855c1bc40c855417a9b1894408ace3c1c7 - Sigstore transparency entry: 1195806072
- Sigstore integration time:
-
Permalink:
Glacis-io/auto-redteam@d223cc5c69330da1561a2b78bab15dd61f6c5af7 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Glacis-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d223cc5c69330da1561a2b78bab15dd61f6c5af7 -
Trigger Event:
release
-
Statement type: