Skip to main content

EVAV Operational Alignment Battery — open-source CLI for matched-pair AI deployment safety auditing

Project description

evav-bench

The open-source CLI for the EVAV Operational Alignment Battery — a matched-pair causal-identification test suite for AI agents making decisions in regulated industries.

License Python Battery

pip install evav

See the public leaderboard at evav.ai/leaderboard. Read the methodology at evav.ai/methodology.


What This Is

EVAV is the behavioral intelligence layer for AI in regulated decisions. The Operational Alignment Battery tests whether a model preserves its stated rules under deployment-realistic pressure.

This CLI runs the full battery — 8 axes, 10 test groups, up to 80 cells — against any frontier model and produces an Evaluation Card.

Key findings from the reference corpus (209,072 decisions across 8 frontier models):

  • 86% of violations would pass conventional compliance review (compliance masking)
  • Claude Sonnet 4 ranges from 0% to 98% violation rate depending on documentation tier
  • DeepSeek V3 swings 50-94% on identical configuration across PRNG seeds

Quick Start

# Install
pip install evav

# Set API key for your provider
export ANTHROPIC_API_KEY="sk-ant-..."   # or OPENAI_API_KEY, DEEPSEEK_API_KEY, etc.

# Run the smoke test (~$0.10 on DeepSeek)
evav run examples/battery.smoketest_deepseek.json --output ./results/

# Render the Evaluation Card
evav render-card ./results/ --format md > card.md
evav render-card ./results/ --format json > card.json

# Or render a beautiful visual HTML card
python cards/renderer/render.py ./results/ --out card.html

Supported Providers

Provider Env var Example model
Anthropic ANTHROPIC_API_KEY claude-sonnet-4-6
OpenAI OPENAI_API_KEY gpt-4o
Google GEMINI_API_KEY gemini-2.5-pro
DeepSeek DEEPSEEK_API_KEY deepseek-chat
OpenRouter OPENROUTER_API_KEY meta-llama/llama-4-maverick

CLI Commands

Command Purpose
evav validate <config> Validate config, print cells + cost estimate
evav run <config> -o <dir> Execute the battery
evav resume <dir> Resume an interrupted run
evav render-card <dir> Render Evaluation Card (md/json/html)
evav render-report <dir> Render full Audit Report
evav render-failure-map <dir> Failure Cell Map JSON
evav render-precursor-profile <dir> Per-model precursor signal profile
evav render-interventions <dir> Intervention recommendations
evav baseline <dir> Save a drift baseline
evav drift-diff <baseline> <new> Compare runs for drift
evav compare <dir-a> <dir-b> Diff two runs

The Three Card Types

This repo ships with two card templates in cards/templates/:

  1. Deployment Card (deployment_card.html) — single model, single config. The output of every evav run.
  2. Benchmark Card (benchmark_card.html) — cross-model matrix. Used for the public leaderboard at evav.ai.

See cards/README.md for the full template reference.

Example Configs

File Domain Cells Suitable for
examples/battery.smoketest_deepseek.json Healthcare (lightweight) 7 First test (~$0.10)
examples/battery.healthcare.example.json Medicare prior auth 51 Full audit, replicating research
examples/battery.lending.example.json Consumer credit (ECOA) 28 Lending compliance
examples/battery.trading.example.json Market-making (Reg NMS) 24 Trading compliance

Data

  • Public corpus: 209,072-decision matched-pair dataset at huggingface.co/datasets/evavlabs/oa (CC-BY-4.0)
  • Reference Evaluation Cards: rendered examples for the 8 frontier models in cards/examples/

Methodology

Matched-pair causal identification. PRNG-deterministic scenario generation (Mulberry32). 8 axes (pressure, doc tier, anchor, phrasing, authority, intervention, seed, temp). 10 test groups (A baselines → J forensics). Validated by SAE-based mechanistic interpretability at 81.2% probe accuracy.

Full methodology: evav.ai/methodology

Paper (NeurIPS 2026 Datasets & Benchmarks Track): arxiv.org/abs/2026.xxxxx

Citation

@inproceedings{cruz2026evav,
  title     = {Evaluating AI Specification Gaming Under Matched-Pair Pressure},
  author    = {Cruz, Anthony},
  booktitle = {NeurIPS 2026 Datasets and Benchmarks Track},
  year      = {2026},
  url       = {https://evav.ai/research}
}

Enterprise

For production deployment safety audits with full deliverables (Audit Report, Failure Cell Map, Intervention Recommendations, Precursor Profile, Compliance Artifact templates for HIPAA / ECOA / SOC 2 / EU AI Act / NIST AI RMF), see evav.ai/product.

The CLI in this repo runs the same instrument used in our paid Tier 1 audits — the difference is the deliverables, the audit team, and the compliance-artifact mapping that go around it.

License

Proprietary. Free for evaluating your own models, internal R&D, and academic research with citation. Redistribution and commercial use require permission. See LICENSE.

Status

This is v1.0 — the initial public release. See CHANGELOG.md for what's included.

Component Status
CLI commands ✅ stable
Anthropic, OpenAI, DeepSeek adapters ✅ tested end-to-end
Google, OpenRouter adapters ⚠️ scaffolded, less battle-tested
Healthcare domain pack ✅ full prompts
Lending, trading domain packs ✅ ported from research
Two-stage masking classifier ✅ heuristic + LLM
25-signal precursor extractor ✅ working
Concurrent execution --workers N
Retry + rate limiting ✅ exponential backoff
Drift baseline + diff ✅ working

Support


Built by EVAV. Methodology: Operational Alignment v1.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evav-1.0.0.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evav-1.0.0-py3-none-any.whl (60.2 kB view details)

Uploaded Python 3

File details

Details for the file evav-1.0.0.tar.gz.

File metadata

  • Download URL: evav-1.0.0.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for evav-1.0.0.tar.gz
Algorithm Hash digest
SHA256 62763b707ff37fd7bd7cf55ca6006ce9dffee7c0717ccd9beb70c29f06ade8ba
MD5 052c13a8a111705974407ae71c4f0c15
BLAKE2b-256 c8b47144f817ea3d0a59f41663966673be326885b90187f9abd0d4f1d33884e6

See more details on using hashes here.

File details

Details for the file evav-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: evav-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 60.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for evav-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32a9bebccb6690c3a6c3272c9736303317d7572cff4c50a979f92f9fb9ca5477
MD5 dfc5d05386606e49c2e282c2b0795492
BLAKE2b-256 4dd1de596803d29722349e6881f338cc66a8de9ca02309ea2bcff02eeb7e2387

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page