Closed-loop adversarial red-teaming and hardening for AI systems

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

joebwd

These details have not been verified by PyPI

Project description

autoredteam

Find prompt injection, jailbreaks, PII leakage, and system prompt exposure in your LLM systems — automatically.

Point it at any model or API endpoint. Get a scored, evidence-backed vulnerability report in minutes.

pip install glacis-autoredteam
autoredteam run --provider openai --model gpt-4o-mini

╔══════════════════════════════════════════════════════════════╗
║                    autoredteam v0.3                          ║
║         Automated Red-Teaming for AI Systems                 ║
╚══════════════════════════════════════════════════════════════╝

  Provider:  openai
  Model:     gpt-4o-mini
  Probes:    38
  Output:    results/

  ✓ 19 attack categories tested
  ✓ 4-dimension scoring (breadth · depth · novelty · reliability)
  ✓ Evidence chain with SHA-256 hash linking
  ✓ Markdown report + JSONL + attestation receipt

What It Finds

autoredteam tests across 19 attack categories using an evolutionary keep/discard loop that mutates attacks until they bypass your defenses:

Category	Example
Prompt injection	Override system instructions via direct/indirect injection
Jailbreaks	Bypass safety via role-play, academic framing, fictional scenarios
PII extraction	Trick the model into leaking personal data
System prompt leakage	Extract internal instructions or system prompts
Tool misuse	Abuse available tools or trigger unintended actions
Multi-turn manipulation	Build up attacks across conversation turns
Encoding bypass	Evade filters using obfuscation or encoding
+ 12 more	Role confusion, payload splitting, social engineering, ...

Domain-specific attack packs are available for healthcare, finance, HR, and coding agents.

Quickstart

# Install from PyPI
pip install glacis-autoredteam

# Dry run — echo target, no API keys needed
autoredteam run --dry-run

# Point at a real system
export OPENAI_API_KEY=sk-...
autoredteam run --provider openai --model gpt-4o-mini

The dry run uses an echo target that simulates a naive model. It takes about 30 seconds and shows the full loop: attack generation, scoring, evidence chain, convergence detection.

A full run against a real model takes 5–20 minutes depending on probe count and judge configuration.

Results land in results/:

results/
├── campaign_result.json       # Full structured results
├── probe_results.jsonl        # Per-probe detail
├── report.md                  # Human-readable report
└── autoharden/
    ├── policy.toml            # OVERT governance policy
    ├── hardened_prompt.txt    # Hardened system prompt
    └── evidence_chain.jsonl   # Hash-chained attestation

Who It's For

AI/ML engineers shipping LLM features who need to test before deploy
Security teams evaluating third-party AI integrations
Compliance teams needing documented evidence of adversarial testing
Researchers studying LLM robustness and attack surfaces

How It Works

autoredteam runs an evolutionary loop inspired by Karpathy's autoresearch — the same keep/discard selection pattern, applied to adversarial evaluation:

Generate attacks (from taxonomy + mutations)
       ↓
Execute against target
       ↓
Score results (deterministic + LLM judge)
       ↓
Keep winners, discard losers
       ↓
Mutate winners, inject diversity
       ↓
Record evidence, write report
       ↓
Loop until convergence

Phase 1 — Attack: Find vulnerabilities. The loop optimizes for higher composite scores (more bypasses = better).

Phase 2 — Defend: Use winning attacks as a test suite. Harden your system prompt and guardrails. The loop now optimizes for lower scores (fewer bypasses = better).

Phase 3 — Emit Policy: Generate an OVERT-compliant policy.toml — a machine-readable governance policy any OVERT-compatible enforcement engine can consume.

Scoring

Every attack is scored on four dimensions to prevent single-metric collapse:

Dimension	What it measures
Breadth	How many attack categories find bypasses
Depth	Severity of the bypass (0 = refusal, 100 = full compliance)
Novelty	How different from prior attacks
Reliability	Does the attack reproduce consistently

Deterministic checks run first (keyword matching, regex for PII / system prompt patterns). Ambiguous cases escalate to an LLM-as-judge with dual-judge consensus.

Mutation Engine

Seven mutation strategies evolve attacks: rephrase, encode, nest, persona shift, language switch, format change, authority escalation. A diversity injection mechanism fires every N cycles to prevent premature convergence.

Multi-Cloud Support

Works with any LLM provider:

autoredteam run --provider openai --model gpt-4o-mini
autoredteam run --provider anthropic --model claude-sonnet-4-5
autoredteam run --provider bedrock --model claude-sonnet-4 --region us-east-1
autoredteam run --provider google --model gemini-2.0-flash
autoredteam run --provider azure_openai --model gpt-4o
autoredteam run --provider cloudflare --model @cf/meta/llama-3-8b-instruct

Or bring your own target:

from prepare import Target, TargetCapabilities, TARGET_REGISTRY

class MyTarget(Target):
    def send(self, prompt: str) -> str:
        return my_api.chat(prompt)

    def reset(self) -> None:
        my_api.new_session()

    def capabilities(self) -> TargetCapabilities:
        return TargetCapabilities(multi_turn=True)

TARGET_REGISTRY["my_target"] = MyTarget

CLI Reference

# Red-team campaigns
autoredteam run --dry-run                                     # Echo target, no API keys
autoredteam run --provider openai --model gpt-4o-mini         # Full run
autoredteam run --pack generic_taxonomy healthcare            # Multiple attack packs
autoredteam run --stealth-profile medium                      # Stealth mode
autoredteam run --judge-backend api                           # LLM-as-judge scoring

# Validation suites
autoredteam validate --suite generic --provider openai --model gpt-4o-mini

# Policy generation
autoredteam emit-policy results/autoharden/                   # Generate OVERT policy.toml

# Discovery
autoredteam providers list                                    # Available providers
autoredteam packs list                                        # Available attack packs

Evidence & Attestation

autoredteam maintains a tamper-evident evidence chain using SHA-256 chain hashing.

Free (local, default): Every result is recorded in evidence_chain.jsonl with chain hashes. The chain is verifiable — attestation.py recomputes all hashes and detects tampering. Pass --attest to emit attestation_receipt.json.

Paid (Glacis): Cryptographic attestation via the Glacis service. Chain hashes are submitted for timestamping and tamper-proof storage — useful for compliance, audits, and responsible disclosure timelines.

Hash separation keeps sensitive attack details scoped:

Tier	Contains
Public	Summary stats, category coverage, composite scores
Team	Full hashes, score vectors, timestamps
Admin	Raw prompts and responses (local-only, gitignored)

OVERT Policy Output

The autoharden loop generates an OVERT-compliant policy.toml capturing what was learned during hardening:

# Generate policy from autoharden results
autoredteam emit-policy results/autoharden/ --profile healthcare-ambient

# From a specific report
autoredteam emit-policy results/autoharden/autoharden_report.json -o deployment/policy.toml

The policy includes input/output filtering rules, violation types, tool-call deny rules, the hardened system prompt, and attestation config — all traceable via SHA-256 chain hash back to the evidence chain.

Configuration

See config.yaml for all options. Key settings:

target:
  type: openai              # openai, anthropic, gemini, azure_openai,
  params:                   # bedrock, cloudflare, openai_compatible, echo
    model: gpt-4o-mini
    system_prompt: "You are a helpful customer service bot."

campaign:
  max_probes: 20
  intensity: medium         # low, medium, high
  stealth_profile: none     # none, light, medium, aggressive

scoring:
  judge_backend: deterministic  # deterministic, api, slm
  weights: { breadth: 0.25, depth: 0.25, novelty: 0.25, reliability: 0.25 }

Roadmap

v0.1 — Single-turn text attacks, deterministic + LLM scoring, local evidence chain
v0.2 — Multi-turn attack chains, agentic target support
v0.3 — Autoharden self-healing loop, OVERT policy.toml output, multi-cloud providers
v0.4 — Image/multimodal attack vectors, recursive policy hardening
v1.0 — Full OVERT standard conformance, compliance reporting

Contributing

See CONTRIBUTING.md for guidelines. All contributions welcome — from bug reports to new attack packs.

Citation

If you use autoredteam in research, see CITATION.cff or cite:

@software{autoredteam,
  title = {autoredteam: Automated Red-Teaming for AI Systems},
  author = {Glacis},
  url = {https://github.com/glacis-io/auto-redteam},
  license = {Apache-2.0}
}

License

Apache 2.0

Built by Glacis. The open-source tool is free forever. Cryptographic attestation is the paid upgrade.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

joebwd

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Mar 30, 2026

0.1.2

Mar 26, 2026

0.1.1

Mar 26, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glacis_autoredteam-0.3.0.tar.gz (147.4 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glacis_autoredteam-0.3.0-py3-none-any.whl (152.4 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file glacis_autoredteam-0.3.0.tar.gz.

File metadata

Download URL: glacis_autoredteam-0.3.0.tar.gz
Upload date: Mar 30, 2026
Size: 147.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for glacis_autoredteam-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`fee16360ad4d66ebebdd4588d4a2f3665c99c7a6e3da7d6a50ebf8a997f90f93`
MD5	`f4bb0a42c6ea07c565b4affd93cf8011`
BLAKE2b-256	`e05c58d7e8eff08772d9c5df09747a523c6bdae1efbd7b9f5f368004e8ff7347`

See more details on using hashes here.

Provenance

The following attestation bundles were made for glacis_autoredteam-0.3.0.tar.gz:

Publisher: publish.yml on Glacis-io/auto-redteam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: glacis_autoredteam-0.3.0.tar.gz
- Subject digest: fee16360ad4d66ebebdd4588d4a2f3665c99c7a6e3da7d6a50ebf8a997f90f93
- Sigstore transparency entry: 1195806059
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: Glacis-io/auto-redteam@d223cc5c69330da1561a2b78bab15dd61f6c5af7
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/Glacis-io
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d223cc5c69330da1561a2b78bab15dd61f6c5af7
- Trigger Event: release

File details

Details for the file glacis_autoredteam-0.3.0-py3-none-any.whl.

File metadata

Download URL: glacis_autoredteam-0.3.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 152.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for glacis_autoredteam-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bdc7322297cab85e8d6d00be141744855c1bc40c855417a9b1894408ace3c1c7`
MD5	`f93a72ff5633f9fd132c4a740bdbc585`
BLAKE2b-256	`b205fbaa4f9f59f9a392461a35e426efd86665731ca4911413a78e78e615a2b4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for glacis_autoredteam-0.3.0-py3-none-any.whl:

Publisher: publish.yml on Glacis-io/auto-redteam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: glacis_autoredteam-0.3.0-py3-none-any.whl
- Subject digest: bdc7322297cab85e8d6d00be141744855c1bc40c855417a9b1894408ace3c1c7
- Sigstore transparency entry: 1195806072
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: Glacis-io/auto-redteam@d223cc5c69330da1561a2b78bab15dd61f6c5af7
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/Glacis-io
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d223cc5c69330da1561a2b78bab15dd61f6c5af7
- Trigger Event: release

glacis-autoredteam 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

autoredteam

What It Finds

Quickstart

Who It's For

How It Works

Scoring

Mutation Engine

Multi-Cloud Support

CLI Reference

Evidence & Attestation

OVERT Policy Output

Configuration

Roadmap

Contributing

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance