Skip to main content

Safety regression comparison for AI systems.

Project description

SafetyDiff 🛡️

The Git-Diff for LLM Safety Posture

SafetyDiff is an open-source continuous integration (CI/CD) and analytics engine for Large Language Models. It solves the "Black Box Versioning" problem: When you upgrade a model from version 1 to version 2 (or switch from Qwen to OpenAI), is the model actually safer, or does it just have different vulnerabilities?

Instead of relying on single benchmark scores, SafetyDiff reads evaluation databases and provides a direct, side-by-side mathematical diff of how two models respond to the exact same adversarial attacks.

Why SafetyDiff?

Current AI security benchmarks output static numbers (e.g., "Model A scored 82%"). SafetyDiff treats LLM safety like software engineering:

  • Regression Tracking: See exactly which vulnerabilities were fixed, and which new vulnerabilities were introduced.
  • Cross-Model Transferability: Take an attack that broke Llama-3 and instantly diff it against Qwen2.5 to map shared architectural flaws.
  • Granular Taxonomy: Breaks down safety by Intent (e.g., role_hijack, data_exfiltration, tool_abuse).

Installation

git clone https://github.com/m4vic/SafetyDiff.git
cd SafetyDiff
pip install -r requirements.txt

Quick Start (Demo)

SafetyDiff ships with a demo_safety_history.db containing thousands of pre-computed red-team evaluations across qwen2.5-coder:3b, qwen3.5:4b, and gpt-4o-mini. You can run comparisons out of the box without generating your own data!

Compare two models:

python safetydiff.py --compare gpt-4o-mini qwen2.5-coder:3b

Filter by a specific vulnerability category:

python safetydiff.py --compare gpt-4o-mini qwen2.5-coder:3b --intent role_hijack

Architecture & Data Generation

SafetyDiff is an Analytics Engine. It does not generate attacks itself. It is designed to consume SQLite databases generated by automated red-teaming pipelines. The demo database provided was generated using ASRT (Automated Safety Regression Testing), a proprietary zero-human adversarial generation engine utilizing TF-IDF routers and MoE (Mixture-of-Experts) LLM-as-a-Judge evaluations.

Roadmap

  • v1.0 (Current): Direct Prompt Injection & Chat Vulnerability Diffing.
  • v2.0 (In Development): Agentic Trajectory Evaluation & Indirect Prompt Injections (IPI).

Author: Sanskar Jajoo (@m4vic)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safetydiff-1.0.0.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safetydiff-1.0.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file safetydiff-1.0.0.tar.gz.

File metadata

  • Download URL: safetydiff-1.0.0.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for safetydiff-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f136ab58526c64c01e58b9ca7272d018f841790912f1b38ce75b16890a8afcc5
MD5 98b23ee600c672870bb3d217df591379
BLAKE2b-256 59aaefa6662fab80fd8c66eb7db8e8dc2fd32db6f02871bbfc94079f2bd31f0f

See more details on using hashes here.

File details

Details for the file safetydiff-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: safetydiff-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for safetydiff-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc804ecdd2d3b52ddbcc4b0bad41356538047a04c664ff4fbcb62915424624b8
MD5 b514200aa4a663407e47142561421ea2
BLAKE2b-256 6502c80e94b8e72f874603d1576b0fb7b16942edf7101034ec5cacbfb0b2fe04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page