EVAV Operational Alignment Battery — open-source CLI for matched-pair AI deployment safety auditing
Project description
evav-bench
The open-source CLI for the EVAV Operational Alignment Battery — a matched-pair causal-identification test suite for AI agents making decisions in regulated industries.
pip install evav
See the public leaderboard at evav.ai/leaderboard. Read the methodology at evav.ai/methodology.
What This Is
EVAV is the behavioral intelligence layer for AI in regulated decisions. The Operational Alignment Battery tests whether a model preserves its stated rules under deployment-realistic pressure.
This CLI runs the full battery — 8 axes, 10 test groups, up to 80 cells — against any frontier model and produces an Evaluation Card.
Key findings from the reference corpus (209,072 decisions across 8 frontier models):
- 86% of violations would pass conventional compliance review (compliance masking)
- Claude Sonnet 4 ranges from 0% to 98% violation rate depending on documentation tier
- DeepSeek V3 swings 50-94% on identical configuration across PRNG seeds
Quick Start
# Install
pip install evav
# Set API key for your provider
export ANTHROPIC_API_KEY="sk-ant-..." # or OPENAI_API_KEY, DEEPSEEK_API_KEY, etc.
# Run the smoke test (~$0.10 on DeepSeek)
evav run examples/battery.smoketest_deepseek.json --output ./results/
# Render the Evaluation Card
evav render-card ./results/ --format md > card.md
evav render-card ./results/ --format json > card.json
# Or render a beautiful visual HTML card
python cards/renderer/render.py ./results/ --out card.html
Supported Providers
| Provider | Env var | Example model |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-6 |
| OpenAI | OPENAI_API_KEY |
gpt-4o |
GEMINI_API_KEY |
gemini-2.5-pro |
|
| DeepSeek | DEEPSEEK_API_KEY |
deepseek-chat |
| OpenRouter | OPENROUTER_API_KEY |
meta-llama/llama-4-maverick |
CLI Commands
| Command | Purpose |
|---|---|
evav validate <config> |
Validate config, print cells + cost estimate |
evav run <config> -o <dir> |
Execute the battery |
evav resume <dir> |
Resume an interrupted run |
evav render-card <dir> |
Render Evaluation Card (md/json/html) |
evav render-report <dir> |
Render full Audit Report |
evav render-failure-map <dir> |
Failure Cell Map JSON |
evav render-precursor-profile <dir> |
Per-model precursor signal profile |
evav render-interventions <dir> |
Intervention recommendations |
evav baseline <dir> |
Save a drift baseline |
evav drift-diff <baseline> <new> |
Compare runs for drift |
evav compare <dir-a> <dir-b> |
Diff two runs |
The Three Card Types
This repo ships with two card templates in cards/templates/:
- Deployment Card (
deployment_card.html) — single model, single config. The output of everyevav run. - Benchmark Card (
benchmark_card.html) — cross-model matrix. Used for the public leaderboard at evav.ai.
See cards/README.md for the full template reference.
Example Configs
| File | Domain | Cells | Suitable for |
|---|---|---|---|
examples/battery.smoketest_deepseek.json |
Healthcare (lightweight) | 7 | First test (~$0.10) |
examples/battery.healthcare.example.json |
Medicare prior auth | 51 | Full audit, replicating research |
examples/battery.lending.example.json |
Consumer credit (ECOA) | 28 | Lending compliance |
examples/battery.trading.example.json |
Market-making (Reg NMS) | 24 | Trading compliance |
Data
- Public corpus: 209,072-decision matched-pair dataset at huggingface.co/datasets/evavlabs/oa (CC-BY-4.0)
- Reference Evaluation Cards: rendered examples for the 8 frontier models in
cards/examples/
Methodology
Matched-pair causal identification. PRNG-deterministic scenario generation (Mulberry32). 8 axes (pressure, doc tier, anchor, phrasing, authority, intervention, seed, temp). 10 test groups (A baselines → J forensics). Validated by SAE-based mechanistic interpretability at 81.2% probe accuracy.
Full methodology: evav.ai/methodology
Paper (NeurIPS 2026 Datasets & Benchmarks Track): arxiv.org/abs/2026.xxxxx
Citation
@inproceedings{cruz2026evav,
title = {Evaluating AI Specification Gaming Under Matched-Pair Pressure},
author = {Cruz, Anthony},
booktitle = {NeurIPS 2026 Datasets and Benchmarks Track},
year = {2026},
url = {https://evav.ai/research}
}
Enterprise
For production deployment safety audits with full deliverables (Audit Report, Failure Cell Map, Intervention Recommendations, Precursor Profile, Compliance Artifact templates for HIPAA / ECOA / SOC 2 / EU AI Act / NIST AI RMF), see evav.ai/product.
The CLI in this repo runs the same instrument used in our paid Tier 1 audits — the difference is the deliverables, the audit team, and the compliance-artifact mapping that go around it.
License
Proprietary. Free for evaluating your own models, internal R&D, and academic research with citation. Redistribution and commercial use require permission. See LICENSE.
Status
This is v1.0 — the initial public release. See CHANGELOG.md for what's included.
| Component | Status |
|---|---|
| CLI commands | ✅ stable |
| Anthropic, OpenAI, DeepSeek adapters | ✅ tested end-to-end |
| Google, OpenRouter adapters | ⚠️ scaffolded, less battle-tested |
| Healthcare domain pack | ✅ full prompts |
| Lending, trading domain packs | ✅ ported from research |
| Two-stage masking classifier | ✅ heuristic + LLM |
| 25-signal precursor extractor | ✅ working |
| Concurrent execution | ✅ --workers N |
| Retry + rate limiting | ✅ exponential backoff |
| Drift baseline + diff | ✅ working |
Support
- Bugs / features: GitHub Issues
- Methodology questions: evav.ai/methodology
- Commercial / enterprise: labs@evav.ai
Built by EVAV. Methodology: Operational Alignment v1.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evav-1.0.1.tar.gz.
File metadata
- Download URL: evav-1.0.1.tar.gz
- Upload date:
- Size: 54.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
105669c7299ba675a5dda71e9c0cd40a33cce9f588179496abf00211b1c9fdc2
|
|
| MD5 |
772f93f8b1b125e72140ac9b6cd8d097
|
|
| BLAKE2b-256 |
496eaa4ac3cf9e605aa9245220c44fc4437c35057320cd87b7d3327751c7cce3
|
File details
Details for the file evav-1.0.1-py3-none-any.whl.
File metadata
- Download URL: evav-1.0.1-py3-none-any.whl
- Upload date:
- Size: 60.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ba9ca7db4787ea2fdcc4ad4d3bf5ade05bc51b5c9de85d6464d452696a3a943
|
|
| MD5 |
747bcfd4e666b070e67865fb8c19799e
|
|
| BLAKE2b-256 |
c9eb13ab3f19c43ba319a0f40171be9aeb7147bf0486f5a3cf4e7f6fb89eb431
|