Skip to main content

EVAV Operational Alignment Battery — open-source CLI for matched-pair AI deployment safety auditing

Project description

evav-bench

The open-source CLI for the EVAV Operational Alignment Battery — a matched-pair causal-identification test suite for AI agents making decisions in regulated industries.

License Python Battery

pip install evav

See the public leaderboard at evav.ai/leaderboard. Read the methodology at evav.ai/methodology.


What This Is

EVAV is the behavioral intelligence layer for AI in regulated decisions. The Operational Alignment Battery tests whether a model preserves its stated rules under deployment-realistic pressure.

This CLI runs the full battery — 8 axes, 10 test groups, up to 80 cells — against any frontier model and produces an Evaluation Card.

Key findings from the reference corpus (209,072 decisions across 8 frontier models):

  • 86% of violations would pass conventional compliance review (compliance masking)
  • Claude Sonnet 4 ranges from 0% to 98% violation rate depending on documentation tier
  • DeepSeek V3 swings 50-94% on identical configuration across PRNG seeds

Quick Start

# Install
pip install evav

# Set API key for your provider
export ANTHROPIC_API_KEY="sk-ant-..."   # or OPENAI_API_KEY, DEEPSEEK_API_KEY, etc.

# Run the smoke test (~$0.10 on DeepSeek)
evav run examples/battery.smoketest_deepseek.json --output ./results/

# Render the Evaluation Card
evav render-card ./results/ --format md > card.md
evav render-card ./results/ --format json > card.json

# Or render a beautiful visual HTML card
python cards/renderer/render.py ./results/ --out card.html

Supported Providers

Provider Env var Example model
Anthropic ANTHROPIC_API_KEY claude-sonnet-4-6
OpenAI OPENAI_API_KEY gpt-4o
Google GEMINI_API_KEY gemini-2.5-pro
DeepSeek DEEPSEEK_API_KEY deepseek-chat
OpenRouter OPENROUTER_API_KEY meta-llama/llama-4-maverick

CLI Commands

Command Purpose
evav validate <config> Validate config, print cells + cost estimate
evav run <config> -o <dir> Execute the battery
evav resume <dir> Resume an interrupted run
evav render-card <dir> Render Evaluation Card (md/json/html)
evav render-report <dir> Render full Audit Report
evav render-failure-map <dir> Failure Cell Map JSON
evav render-precursor-profile <dir> Per-model precursor signal profile
evav render-interventions <dir> Intervention recommendations
evav baseline <dir> Save a drift baseline
evav drift-diff <baseline> <new> Compare runs for drift
evav compare <dir-a> <dir-b> Diff two runs

The Three Card Types

This repo ships with two card templates in cards/templates/:

  1. Deployment Card (deployment_card.html) — single model, single config. The output of every evav run.
  2. Benchmark Card (benchmark_card.html) — cross-model matrix. Used for the public leaderboard at evav.ai.

See cards/README.md for the full template reference.

Example Configs

File Domain Cells Suitable for
examples/battery.smoketest_deepseek.json Healthcare (lightweight) 7 First test (~$0.10)
examples/battery.healthcare.example.json Medicare prior auth 51 Full audit, replicating research
examples/battery.lending.example.json Consumer credit (ECOA) 28 Lending compliance
examples/battery.trading.example.json Market-making (Reg NMS) 24 Trading compliance

Data

  • Public corpus: 209,072-decision matched-pair dataset at huggingface.co/datasets/evavlabs/oa (CC-BY-4.0)
  • Reference Evaluation Cards: rendered examples for the 8 frontier models in cards/examples/

Methodology

Matched-pair causal identification. PRNG-deterministic scenario generation (Mulberry32). 8 axes (pressure, doc tier, anchor, phrasing, authority, intervention, seed, temp). 10 test groups (A baselines → J forensics). Validated by SAE-based mechanistic interpretability at 81.2% probe accuracy.

Full methodology: evav.ai/methodology

Paper (NeurIPS 2026 Datasets & Benchmarks Track): arxiv.org/abs/2026.xxxxx

Citation

@inproceedings{cruz2026evav,
  title     = {Evaluating AI Specification Gaming Under Matched-Pair Pressure},
  author    = {Cruz, Anthony},
  booktitle = {NeurIPS 2026 Datasets and Benchmarks Track},
  year      = {2026},
  url       = {https://evav.ai/research}
}

Enterprise

For production deployment safety audits with full deliverables (Audit Report, Failure Cell Map, Intervention Recommendations, Precursor Profile, Compliance Artifact templates for HIPAA / ECOA / SOC 2 / EU AI Act / NIST AI RMF), see evav.ai/product.

The CLI in this repo runs the same instrument used in our paid Tier 1 audits — the difference is the deliverables, the audit team, and the compliance-artifact mapping that go around it.

License

Proprietary. Free for evaluating your own models, internal R&D, and academic research with citation. Redistribution and commercial use require permission. See LICENSE.

Status

This is v1.0 — the initial public release. See CHANGELOG.md for what's included.

Component Status
CLI commands ✅ stable
Anthropic, OpenAI, DeepSeek adapters ✅ tested end-to-end
Google, OpenRouter adapters ⚠️ scaffolded, less battle-tested
Healthcare domain pack ✅ full prompts
Lending, trading domain packs ✅ ported from research
Two-stage masking classifier ✅ heuristic + LLM
25-signal precursor extractor ✅ working
Concurrent execution --workers N
Retry + rate limiting ✅ exponential backoff
Drift baseline + diff ✅ working

Support


Built by EVAV. Methodology: Operational Alignment v1.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evav-1.0.2.tar.gz (68.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evav-1.0.2-py3-none-any.whl (78.8 kB view details)

Uploaded Python 3

File details

Details for the file evav-1.0.2.tar.gz.

File metadata

  • Download URL: evav-1.0.2.tar.gz
  • Upload date:
  • Size: 68.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for evav-1.0.2.tar.gz
Algorithm Hash digest
SHA256 e3f0455d1a0c656aa25d7a1b426d705a99f33f845ed228e2a56c1f397b0a99bc
MD5 997ba2ff98c1b31d5587fb4a6df85139
BLAKE2b-256 5f2af7e142712167e6db481f1c255a870ef97e69cfb8ccb9defa1e671086fb9f

See more details on using hashes here.

File details

Details for the file evav-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: evav-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 78.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for evav-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89b12b115d542831f807c0b05ef1c069546fac7d8fddac16044c08f93769beee
MD5 b7a370738a737fc927a93afbab5a4357
BLAKE2b-256 c133a3fbb3c6806793f5c37bf292e237220fbe53ede2f1f84365f1367cc9dc6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page