Multi-modal AI safety evaluation framework for red-teaming generative models across text, image, video, and audio
Project description
MMSAFE-Bench
Multi-Modal AI Safety Evaluation Framework for red-teaming and benchmarking generative AI models across text, image, video, and audio from a single CLI.
Prompt Datasets (JSONL) → Attack Strategies → Model Providers → Safety Judges → Metrics → Reports
Why MMSAFE-Bench?
Existing safety benchmarks are fragmented: MM-SafetyBench covers image-text, USB covers text-only, Video-SafetyBench handles video in isolation. None work as a production CLI tool with both proprietary and open-source model support.
MMSAFE-Bench unifies safety evaluation across all four generative modalities with:
- 20 hazard categories — MLCommons AILuminate S1-S12 + 8 multi-modal extensions (deepfakes, voice impersonation, cross-modal bypass, etc.)
- 9 attack strategies — jailbreaks, encoding tricks, role-play, multi-turn escalation, adversarial suffixes, cross-modal injection, low-resource translation
- 8 model providers — OpenAI, Anthropic, Google, Replicate, ElevenLabs, local vLLM, local Ollama, deterministic stub
- 6 safety judges — keyword, LLM-as-judge, toxicity, NSFW classifier, composite ensemble, human evaluation export
- Edge simulation — test safety degradation on constrained hardware (DGX Spark, Jetson, Raspberry Pi, V100)
- Interactive reports — HTML dashboards with Plotly charts, Markdown tables, JSON exports, model leaderboards
Quick Start
# Install with all dependencies
uv sync --extra dev --extra viz --extra providers
# Browse the safety taxonomy
mmsafe taxonomy
# Validate a dataset
mmsafe validate --dataset datasets/text/mlcommons_hazards.jsonl
# Dry-run an evaluation
mmsafe run --config mmsafe/config/defaults/text_eval.yaml --dry-run
# List available providers and attack strategies
mmsafe providers
mmsafe attacks
Architecture
mmsafe/
├── config/ # Pydantic config models + YAML defaults
├── taxonomy/ # MLCommons S1-S12 + custom X1-X8 hazard categories
├── datasets/ # JSONL prompt dataset loader + validation
├── attacks/ # 9 red-teaming strategies (passthrough → composite)
├── providers/ # 8 model provider adapters (cloud + local)
├── judges/ # 6 safety evaluation judges
├── pipeline/ # Async evaluation runner + checkpointing
├── metrics/ # ASR, RR, FRR, NSFW rate + bootstrap CI
├── reporting/ # HTML/JSON/Markdown reports + Plotly charts + leaderboard
├── edge/ # Edge deployment simulation (5 device profiles)
└── cli.py # Click CLI with 8 commands
CLI Commands
| Command | Description |
|---|---|
mmsafe run |
Run a safety evaluation from YAML config |
mmsafe report |
Generate HTML/JSON/Markdown report from results |
mmsafe compare |
Compare multiple evaluation runs side-by-side |
mmsafe leaderboard |
Generate model safety leaderboard |
mmsafe providers |
List available model providers |
mmsafe attacks |
List attack strategies |
mmsafe taxonomy |
Display the safety hazard taxonomy |
mmsafe validate |
Validate a JSONL prompt dataset |
GPU-Aware Execution Profiles
The runner supports execution.profile with safe fallback behavior:
auto: detect available GPUs and fall back tosmall_gpuwhen A100 is unavailablesmall_gpu: conservative concurrency for smaller GPUsa100: A100-optimized concurrency
mmsafe run --config mmsafe/config/defaults/full_eval.yaml --execution-profile auto
mmsafe run --config mmsafe/config/defaults/full_eval.yaml --no-auto-tune
Configuration
Evaluations are configured via YAML files. See mmsafe/config/defaults/ for examples:
text_eval.yaml— Text-only evaluation with GPT-4oimage_eval.yaml— Image generation safety (DALL-E, FLUX)video_eval.yaml— Video generation (Kling, Veo)audio_eval.yaml— Audio/speech safety (ElevenLabs, Bark)full_eval.yaml— All modalities combinededge_simulation.yaml— Edge deployment constraints
Metrics
| Metric | Description |
|---|---|
| ASR | Attack Success Rate — fraction of attacks producing unsafe content |
| RR | Refusal Rate — fraction of prompts refused by the model |
| FRR | False Refusal Rate — fraction of benign prompts incorrectly refused |
| NSFW Rate | Fraction of image/video outputs flagged as NSFW |
All metrics include 95% bootstrap confidence intervals, broken down by category, attack, modality, and model.
Local Provider Setup
# vLLM backend
export VLLM_BASE_URL="http://localhost:8000"
# Ollama backend
export OLLAMA_BASE_URL="http://localhost:11434"
Cloud providers are optional; unavailable providers are skipped by default unless execution.strict_provider_init: true.
A100 Automation
For production-style orchestration (LowResource priority, MMSAFE auto handoff, Telegram alerts, systemd services), use:
docs/A100_AUTOMATION_RUNBOOK.md
Development
make install # Install with dev deps
make test # Run tests (80% coverage gate)
make lint # Ruff + mypy
make fmt # Auto-format
make eval-smoke # Smoke test with stub provider
make clean # Remove build artifacts
Docker
docker build -t mmsafe .
docker run --rm mmsafe --help
docker run --rm mmsafe taxonomy
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mmsafe_bench-1.0.2.tar.gz.
File metadata
- Download URL: mmsafe_bench-1.0.2.tar.gz
- Upload date:
- Size: 111.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7168dcd8d3a7de2de2fe680976255c2d7ec6fce951831a510a15028a4173fbb2
|
|
| MD5 |
b6be395c92f1700a87c9fe21b727fabd
|
|
| BLAKE2b-256 |
18adef38cbfb8e8daa914979a6c7f82872df21923edb9d001a3798577709d1f3
|
Provenance
The following attestation bundles were made for mmsafe_bench-1.0.2.tar.gz:
Publisher:
publish-pypi.yml on ogulcanaydogan/MMSAFE-Bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mmsafe_bench-1.0.2.tar.gz -
Subject digest:
7168dcd8d3a7de2de2fe680976255c2d7ec6fce951831a510a15028a4173fbb2 - Sigstore transparency entry: 1968927072
- Sigstore integration time:
-
Permalink:
ogulcanaydogan/MMSAFE-Bench@dd13a5692a3e87c203767f33f176ebddad32255a -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/ogulcanaydogan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@dd13a5692a3e87c203767f33f176ebddad32255a -
Trigger Event:
release
-
Statement type:
File details
Details for the file mmsafe_bench-1.0.2-py3-none-any.whl.
File metadata
- Download URL: mmsafe_bench-1.0.2-py3-none-any.whl
- Upload date:
- Size: 105.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df877c14a5f78c635f33f3af9e9b5b39bb97b6afd761a10174968164b9eb20c3
|
|
| MD5 |
ebd4e5365ef4133248f0b699c0c3847f
|
|
| BLAKE2b-256 |
577c8379d391642382fbee51b95b756e72deeaf0538a1d7f12ceddf00e45656c
|
Provenance
The following attestation bundles were made for mmsafe_bench-1.0.2-py3-none-any.whl:
Publisher:
publish-pypi.yml on ogulcanaydogan/MMSAFE-Bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mmsafe_bench-1.0.2-py3-none-any.whl -
Subject digest:
df877c14a5f78c635f33f3af9e9b5b39bb97b6afd761a10174968164b9eb20c3 - Sigstore transparency entry: 1968927252
- Sigstore integration time:
-
Permalink:
ogulcanaydogan/MMSAFE-Bench@dd13a5692a3e87c203767f33f176ebddad32255a -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/ogulcanaydogan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@dd13a5692a3e87c203767f33f176ebddad32255a -
Trigger Event:
release
-
Statement type: