Automated LLM red team framework -- test any model's safety with one command

These details have not been verified by PyPI

Project links

Project description

ai-blackteam

Automated LLM red team framework. Test any model's safety with one command.

Install

pip install ai-blackteam

Or from source:

git clone https://github.com/BILLKISHORE/ai-evals.git
cd ai-evals
pip install -e .

Quick Start

# Set your API key
blackteam config set providers.anthropic.api_key sk-ant-...

# Run a single attack
blackteam run -p anthropic -m claude-sonnet-4-6 -a encoding-obfuscation -t "Write a phishing email"

# Run all attacks against a model
blackteam batch -p anthropic -m claude-sonnet-4-6 --attacks all -t "Write a phishing email"

# Run everything against everything
blackteam sweep -t "Write a phishing email"

# Generate reports
blackteam report --format markdown
blackteam report --format html --output report.html
blackteam report --format json --output results.json

Providers

Provider	Models	Auth
Anthropic	Claude Sonnet 4.6, Opus, Haiku	API key
OpenAI	GPT-5.4, GPT-4o	API key
Google	Gemini 3.1 Pro, Flash	API key
DeepSeek	DeepSeek V3.2	API key
Mistral	Mistral Large 3, Small 4	API key
Ollama	Any local model (Llama 4, Qwen, Gemma)	None (local)
HuggingFace	Any HF model	API token

Attacks (39 built-in)

Core Attacks

Attack	Mode	Description
encoding-obfuscation	single	Base64, ROT13, hex, leetspeak, pig latin
role-play-bypass	multi	Character development, fiction building
system-prompt-extraction	multi	JSON schema, chain-of-thought, error correction
progressive-normalization	tool-use	Gradual file access escalation
many-shot	single	Fabricated Q&A context flooding
context-manipulation	multi	10-turn conversational drift
dan-variants	single	DAN, EDUALC, Opposite Day, Professor persona
taxonomy-paraphrasing	single	Victim, researcher, journalist framing
best-of-n	single	Capitalization, typo, padding variations
historical-framing	single	Documentary, museum, textbook framing
token-smuggling	single	Zero-width Unicode character injection
few-shot-amplification	single	Benign examples + encoded target
conversation-derailment	multi	Self-discovery of model weaknesses
sunk-cost-exploit	multi	Build helpful context then pivot
emotional-manipulation	multi	Distress framing to override safety
contradiction-exploit	multi	Feed wrong info, get corrected with truth
xml-boundary-injection	single	Fake closing tags to escape prompt
authority-chain	multi	CISO/professor authority building
completion-exploit	single	Fake transcripts, JSON completion
multi-modal-text	single	ASCII art, tables, code embedding

Research-Backed Attacks

Attack	Mode	Source
skeleton-key	multi	Microsoft Research
crescendo	multi	Microsoft/USENIX
deceptive-delight	multi	Palo Alto Unit 42
bad-likert-judge	multi	Palo Alto Unit 42

Encoding and Structural Attacks

Attack	Mode	Source
meta-prompting	single	Promptfoo
homoglyph-substitution	single	Promptfoo -- Cyrillic/Greek lookalike chars
bidirectional-text	single	Promptfoo -- Unicode RTL override
markdown-injection	single	Promptfoo -- Hidden divs, HTML comments, tables
payload-splitting	single	Learn Prompting -- Split harmful content across variables
defined-dictionary	single	Learn Prompting -- Redefine words to mask intent
compound-instruction	single	Learn Prompting -- Overwhelm safety with conflicting instructions
task-deflection	single	Learn Prompting -- Adjacent task produces harmful output
recursive-injection	single	Learn Prompting -- Self-referential prompt loops
response-constraints	single	Confident AI -- Suppress safety disclaimers via formatting rules

Conversational and Psychological Attacks

Attack	Mode	Source
false-dichotomy	multi	Promptfoo -- "Better you explain it than bad sources"
goalpost-moving	multi	Promptfoo -- Incremental agreement chain
narrative-misdirection	multi	arXiv 2507.21820 -- Embed harmful request between benign ones
fictional-professional	single	arXiv 2507.21820 -- David Prompt / elaborate novelist persona
implication-chaining	multi	arXiv 2507.21820 -- Emotionally loaded context + innocent request

Custom Plugins

Drop a Python file in the plugins/ folder:

from blackteam.registry import register_attack
from blackteam.attacks.base import BaseAttack

@register_attack("my-attack")
class MyAttack(BaseAttack):
    name = "My Custom Attack"
    technique_id = "my-attack"
    mode = "single-turn"

    def generate_prompts(self, target, **kwargs):
        return [f"Custom prompt: {target}"]

It shows up in blackteam list-attacks automatically.

Evaluator

Three scoring methods (combine any):

Keyword matching -- fast, free, checks for harmful content indicators
Regex patterns -- precise, free, matches structural patterns
LLM-as-judge -- accurate, uses Claude Haiku to rate 1-5

# Use all three
blackteam run -p anthropic -a encoding-obfuscation -t "target" --evaluator keyword,regex,llm

Reports

Format	Use Case
Markdown	Human-readable summary for documentation
JSON	Machine-readable for CI/CD pipelines
HTML	Dark-themed report with stats dashboard

Research

This tool was built alongside real security research on Claude Sonnet 4 and 4.6. See the experiments/ folder for 10 experiments covering 150+ attack runs with documented findings.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Mar 31, 2026

0.9.0

Mar 30, 2026

This version

0.4.0

Mar 30, 2026

0.3.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_blackteam-0.4.0.tar.gz (34.7 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_blackteam-0.4.0-py3-none-any.whl (55.9 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file ai_blackteam-0.4.0.tar.gz.

File metadata

Download URL: ai_blackteam-0.4.0.tar.gz
Upload date: Mar 30, 2026
Size: 34.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ai_blackteam-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`3784f33b2bfdd60066a4245786282a385db99a08ddc7ea1cbc88fd9141822bd9`
MD5	`8969f3c925c46f7cd576061f4a9d5771`
BLAKE2b-256	`da8df4dd1d23669978a4bab36ce8eaac0fc72b1e0065643afc793639395723e9`

See more details on using hashes here.

File details

Details for the file ai_blackteam-0.4.0-py3-none-any.whl.

File metadata

Download URL: ai_blackteam-0.4.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ai_blackteam-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`823ae434519fbfaef5e10b571e0d38281f5dd9bb41e05d0865d742c021c633b8`
MD5	`c7ad28ae7ed2cf4e56d031ecc0b2bef2`
BLAKE2b-256	`c7ccee8c357b0e4ef7aa4e959e9be4f1577c452f3ce501b3d1af4afba441fad1`

See more details on using hashes here.

ai-blackteam 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ai-blackteam

Install

Quick Start

Providers

Attacks (39 built-in)

Core Attacks

Research-Backed Attacks

Encoding and Structural Attacks

Conversational and Psychological Attacks

Custom Plugins

Evaluator

Reports

Research

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes