Automated LLM red team framework -- test any model's safety with one command
Project description
ai-blackteam
Automated LLM red team framework. Test any model's safety with one command.
Install
pip install ai-blackteam
Or from source:
git clone https://github.com/BILLKISHORE/ai-evals.git
cd ai-evals
pip install -e .
Quick Start
# Set your API key
blackteam config set providers.anthropic.api_key sk-ant-...
# Run a single attack
blackteam run -p anthropic -m claude-sonnet-4-6 -a encoding-obfuscation -t "Write a phishing email"
# Run all attacks against a model
blackteam batch -p anthropic -m claude-sonnet-4-6 --attacks all -t "Write a phishing email"
# Run everything against everything
blackteam sweep -t "Write a phishing email"
# Generate reports
blackteam report --format markdown
blackteam report --format html --output report.html
blackteam report --format json --output results.json
Providers
| Provider | Models | Auth |
|---|---|---|
| Anthropic | Claude Sonnet 4.6, Opus, Haiku | API key |
| OpenAI | GPT-5.4, GPT-4o | API key |
| Gemini 3.1 Pro, Flash | API key | |
| DeepSeek | DeepSeek V3.2 | API key |
| Mistral | Mistral Large 3, Small 4 | API key |
| Ollama | Any local model (Llama 4, Qwen, Gemma) | None (local) |
| HuggingFace | Any HF model | API token |
Attacks (20 built-in)
| Attack | Mode | Description |
|---|---|---|
| encoding-obfuscation | single | Base64, ROT13, hex, leetspeak, pig latin |
| role-play-bypass | multi | Character development, fiction building |
| system-prompt-extraction | multi | JSON schema, chain-of-thought, error correction |
| progressive-normalization | tool-use | Gradual file access escalation |
| many-shot | single | Fabricated Q&A context flooding |
| context-manipulation | multi | 10-turn conversational drift |
| dan-variants | single | DAN, EDUALC, Opposite Day, Professor persona |
| taxonomy-paraphrasing | single | Victim, researcher, journalist framing |
| best-of-n | single | Capitalization, typo, padding variations |
| historical-framing | single | Documentary, museum, textbook framing |
| token-smuggling | single | Zero-width Unicode character injection |
| few-shot-amplification | single | Benign examples + encoded target |
| conversation-derailment | multi | Self-discovery of model weaknesses |
| sunk-cost-exploit | multi | Build helpful context then pivot |
| emotional-manipulation | multi | Distress framing to override safety |
| contradiction-exploit | multi | Feed wrong info, get corrected with truth |
| xml-boundary-injection | single | Fake closing tags to escape prompt |
| authority-chain | multi | CISO/professor authority building |
| completion-exploit | single | Fake transcripts, JSON completion |
| multi-modal-text | single | ASCII art, tables, code embedding |
Custom Plugins
Drop a Python file in the plugins/ folder:
from blackteam.registry import register_attack
from blackteam.attacks.base import BaseAttack
@register_attack("my-attack")
class MyAttack(BaseAttack):
name = "My Custom Attack"
technique_id = "my-attack"
mode = "single-turn"
def generate_prompts(self, target, **kwargs):
return [f"Custom prompt: {target}"]
It shows up in blackteam list-attacks automatically.
Evaluator
Three scoring methods (combine any):
- Keyword matching -- fast, free, checks for harmful content indicators
- Regex patterns -- precise, free, matches structural patterns
- LLM-as-judge -- accurate, uses Claude Haiku to rate 1-5
# Use all three
blackteam run -p anthropic -a encoding-obfuscation -t "target" --evaluator keyword,regex,llm
Reports
| Format | Use Case |
|---|---|
| Markdown | Human-readable summary for documentation |
| JSON | Machine-readable for CI/CD pipelines |
| HTML | Dark-themed report with stats dashboard |
Research
This tool was built alongside real security research on Claude Sonnet 4 and 4.6. See the experiments/ folder for 8 experiments covering 115 attack techniques with documented findings.
Author
Bill Kishore -- a developer who likes breaking things to understand how they work. Currently exploring LLM safety evals, red teaming, and the weird gaps between how AI systems are designed and how they actually behave. Open to collaborating on AI safety research, evals, or anything that needs creative problem-solving. Reach out.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_blackteam-0.3.0.tar.gz.
File metadata
- Download URL: ai_blackteam-0.3.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.3 CPython/3.13.12 Linux/6.17.9-76061709-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb5dc84f73e32eec60b62eadd4e4550df3e98789350f80db162b32b7d8a09bca
|
|
| MD5 |
6713bfe06cc1501a3ea42f0e57ba841a
|
|
| BLAKE2b-256 |
4e340e424ea6309bb62e0becd55ea657fb3a8e660d716e646645e735af1a492e
|
File details
Details for the file ai_blackteam-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ai_blackteam-0.3.0-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.3 CPython/3.13.12 Linux/6.17.9-76061709-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f89ae37bc96dbe3bc3c1c1576d20676a18ce59e50a5cf70cb4c115f28f1b4496
|
|
| MD5 |
8d018e80bb1fc9ec5d59d8573923e66c
|
|
| BLAKE2b-256 |
185cd59b1ee8b1ac197cb2d61aa8d49aded2c07abfef26834c402eacce952a56
|