Skip to main content

Multi-modal AI safety evaluation framework for red-teaming generative models across text, image, video, and audio

Project description

MMSAFE-Bench

OpenSSF Best Practices

CI Python 3.12 License: Apache 2.0 Code style: ruff

Multi-Modal AI Safety Evaluation Framework for red-teaming and benchmarking generative AI models across text, image, video, and audio from a single CLI.

Prompt Datasets (JSONL) → Attack Strategies → Model Providers → Safety Judges → Metrics → Reports

Why MMSAFE-Bench?

Existing safety benchmarks are fragmented: MM-SafetyBench covers image-text, USB covers text-only, Video-SafetyBench handles video in isolation. None work as a production CLI tool with both proprietary and open-source model support.

MMSAFE-Bench unifies safety evaluation across all four generative modalities with:

  • 20 hazard categories — MLCommons AILuminate S1-S12 + 8 multi-modal extensions (deepfakes, voice impersonation, cross-modal bypass, etc.)
  • 9 attack strategies — jailbreaks, encoding tricks, role-play, multi-turn escalation, adversarial suffixes, cross-modal injection, low-resource translation
  • 8 model providers — OpenAI, Anthropic, Google, Replicate, ElevenLabs, local vLLM, local Ollama, deterministic stub
  • 6 safety judges — keyword, LLM-as-judge, toxicity, NSFW classifier, composite ensemble, human evaluation export
  • Edge simulation — test safety degradation on constrained hardware (DGX Spark, Jetson, Raspberry Pi, V100)
  • Interactive reports — HTML dashboards with Plotly charts, Markdown tables, JSON exports, model leaderboards

Quick Start

# Install with all dependencies
uv sync --extra dev --extra viz --extra providers

# Browse the safety taxonomy
mmsafe taxonomy

# Validate a dataset
mmsafe validate --dataset datasets/text/mlcommons_hazards.jsonl

# Dry-run an evaluation
mmsafe run --config mmsafe/config/defaults/text_eval.yaml --dry-run

# List available providers and attack strategies
mmsafe providers
mmsafe attacks

Architecture

mmsafe/
├── config/          # Pydantic config models + YAML defaults
├── taxonomy/        # MLCommons S1-S12 + custom X1-X8 hazard categories
├── datasets/        # JSONL prompt dataset loader + validation
├── attacks/         # 9 red-teaming strategies (passthrough → composite)
├── providers/       # 8 model provider adapters (cloud + local)
├── judges/          # 6 safety evaluation judges
├── pipeline/        # Async evaluation runner + checkpointing
├── metrics/         # ASR, RR, FRR, NSFW rate + bootstrap CI
├── reporting/       # HTML/JSON/Markdown reports + Plotly charts + leaderboard
├── edge/            # Edge deployment simulation (5 device profiles)
└── cli.py           # Click CLI with 8 commands

CLI Commands

Command Description
mmsafe run Run a safety evaluation from YAML config
mmsafe report Generate HTML/JSON/Markdown report from results
mmsafe compare Compare multiple evaluation runs side-by-side
mmsafe leaderboard Generate model safety leaderboard
mmsafe providers List available model providers
mmsafe attacks List attack strategies
mmsafe taxonomy Display the safety hazard taxonomy
mmsafe validate Validate a JSONL prompt dataset

GPU-Aware Execution Profiles

The runner supports execution.profile with safe fallback behavior:

  • auto: detect available GPUs and fall back to small_gpu when A100 is unavailable
  • small_gpu: conservative concurrency for smaller GPUs
  • a100: A100-optimized concurrency
mmsafe run --config mmsafe/config/defaults/full_eval.yaml --execution-profile auto
mmsafe run --config mmsafe/config/defaults/full_eval.yaml --no-auto-tune

Configuration

Evaluations are configured via YAML files. See mmsafe/config/defaults/ for examples:

  • text_eval.yaml — Text-only evaluation with GPT-4o
  • image_eval.yaml — Image generation safety (DALL-E, FLUX)
  • video_eval.yaml — Video generation (Kling, Veo)
  • audio_eval.yaml — Audio/speech safety (ElevenLabs, Bark)
  • full_eval.yaml — All modalities combined
  • edge_simulation.yaml — Edge deployment constraints

Metrics

Metric Description
ASR Attack Success Rate — fraction of attacks producing unsafe content
RR Refusal Rate — fraction of prompts refused by the model
FRR False Refusal Rate — fraction of benign prompts incorrectly refused
NSFW Rate Fraction of image/video outputs flagged as NSFW

All metrics include 95% bootstrap confidence intervals, broken down by category, attack, modality, and model.

Local Provider Setup

# vLLM backend
export VLLM_BASE_URL="http://localhost:8000"

# Ollama backend
export OLLAMA_BASE_URL="http://localhost:11434"

Cloud providers are optional; unavailable providers are skipped by default unless execution.strict_provider_init: true.

A100 Automation

For production-style orchestration (LowResource priority, MMSAFE auto handoff, Telegram alerts, systemd services), use:

  • docs/A100_AUTOMATION_RUNBOOK.md

Development

make install       # Install with dev deps
make test          # Run tests (80% coverage gate)
make lint          # Ruff + mypy
make fmt           # Auto-format
make eval-smoke    # Smoke test with stub provider
make clean         # Remove build artifacts

Docker

docker build -t mmsafe .
docker run --rm mmsafe --help
docker run --rm mmsafe taxonomy

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmsafe_bench-1.0.2.tar.gz (111.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmsafe_bench-1.0.2-py3-none-any.whl (105.8 kB view details)

Uploaded Python 3

File details

Details for the file mmsafe_bench-1.0.2.tar.gz.

File metadata

  • Download URL: mmsafe_bench-1.0.2.tar.gz
  • Upload date:
  • Size: 111.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mmsafe_bench-1.0.2.tar.gz
Algorithm Hash digest
SHA256 7168dcd8d3a7de2de2fe680976255c2d7ec6fce951831a510a15028a4173fbb2
MD5 b6be395c92f1700a87c9fe21b727fabd
BLAKE2b-256 18adef38cbfb8e8daa914979a6c7f82872df21923edb9d001a3798577709d1f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for mmsafe_bench-1.0.2.tar.gz:

Publisher: publish-pypi.yml on ogulcanaydogan/MMSAFE-Bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mmsafe_bench-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: mmsafe_bench-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 105.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mmsafe_bench-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 df877c14a5f78c635f33f3af9e9b5b39bb97b6afd761a10174968164b9eb20c3
MD5 ebd4e5365ef4133248f0b699c0c3847f
BLAKE2b-256 577c8379d391642382fbee51b95b756e72deeaf0538a1d7f12ceddf00e45656c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mmsafe_bench-1.0.2-py3-none-any.whl:

Publisher: publish-pypi.yml on ogulcanaydogan/MMSAFE-Bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page