Safety regression comparison for AI systems.
Project description
SafetyDiff 🛡️
The Git-Diff for LLM Safety Posture
SafetyDiff is an open-source continuous integration (CI/CD) and analytics engine for Large Language Models. It solves the "Black Box Versioning" problem: When you upgrade a model from version 1 to version 2 (or switch from Qwen to OpenAI), is the model actually safer, or does it just have different vulnerabilities?
Instead of relying on single benchmark scores, SafetyDiff reads evaluation databases and provides a direct, side-by-side mathematical diff of how two models respond to the exact same adversarial attacks.
Why SafetyDiff?
Current AI security benchmarks output static numbers (e.g., "Model A scored 82%"). SafetyDiff treats LLM safety like software engineering:
- Regression Tracking: See exactly which vulnerabilities were fixed, and which new vulnerabilities were introduced.
- Cross-Model Transferability: Take an attack that broke Llama-3 and instantly diff it against Qwen2.5 to map shared architectural flaws.
- Granular Taxonomy: Breaks down safety by Intent (e.g.,
role_hijack,data_exfiltration,tool_abuse).
Installation
git clone https://github.com/m4vic/SafetyDiff.git
cd SafetyDiff
pip install -r requirements.txt
Quick Start (Demo)
SafetyDiff ships with a demo_safety_history.db containing thousands of pre-computed red-team evaluations across qwen2.5-coder:3b, qwen3.5:4b, and gpt-4o-mini. You can run comparisons out of the box without generating your own data!
Compare two models:
python safetydiff.py --compare gpt-4o-mini qwen2.5-coder:3b
Filter by a specific vulnerability category:
python safetydiff.py --compare gpt-4o-mini qwen2.5-coder:3b --intent role_hijack
Architecture & Data Generation
SafetyDiff is an Analytics Engine. It does not generate attacks itself. It is designed to consume SQLite databases generated by automated red-teaming pipelines. The demo database provided was generated using ASRT (Automated Safety Regression Testing), a proprietary zero-human adversarial generation engine utilizing TF-IDF routers and MoE (Mixture-of-Experts) LLM-as-a-Judge evaluations.
Roadmap
- v1.0 (Current): Direct Prompt Injection & Chat Vulnerability Diffing.
- v2.0 (In Development): Agentic Trajectory Evaluation & Indirect Prompt Injections (IPI).
Author: Sanskar Jajoo (@m4vic)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safetydiff-1.0.0.tar.gz.
File metadata
- Download URL: safetydiff-1.0.0.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f136ab58526c64c01e58b9ca7272d018f841790912f1b38ce75b16890a8afcc5
|
|
| MD5 |
98b23ee600c672870bb3d217df591379
|
|
| BLAKE2b-256 |
59aaefa6662fab80fd8c66eb7db8e8dc2fd32db6f02871bbfc94079f2bd31f0f
|
File details
Details for the file safetydiff-1.0.0-py3-none-any.whl.
File metadata
- Download URL: safetydiff-1.0.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc804ecdd2d3b52ddbcc4b0bad41356538047a04c664ff4fbcb62915424624b8
|
|
| MD5 |
b514200aa4a663407e47142561421ea2
|
|
| BLAKE2b-256 |
6502c80e94b8e72f874603d1576b0fb7b16942edf7101034ec5cacbfb0b2fe04
|