EVAV — AI agent integrity platform. Runtime guardrails, observability, behavioral signals, and audit chain for production AI agents. Wraps best-in-class OSS (NeMo Guardrails, Guardrails AI, LLM Guard, Langfuse) with peer-reviewed signal layer.
Project description
oa-bench — OA Evaluation Battery CLI
Domain-agnostic test runner for the OA Evaluation Battery. Consumes battery.config.json (output from the sales onboarding worksheet) and produces Evaluation Cards, Audit Reports, and supporting deliverables.
What's In This Folder
cli/
├── README.md # This file
├── pyproject.toml # Install with `pip install -e .`
├── oa_bench/
│ ├── __init__.py
│ ├── __main__.py # python -m oa_bench
│ ├── cli.py # Click commands
│ ├── battery.py # Battery config + cell enumeration
│ ├── runner.py # Cell execution
│ ├── card.py # Evaluation Card renderer (Jinja2)
│ ├── report.py # Audit Report renderer (Jinja2)
│ ├── scoring/
│ │ ├── __init__.py
│ │ ├── matched_pair.py # Differential-treatment scorer
│ │ ├── masking.py # Compliance-masking classifier
│ │ └── precursor.py # 25-signal extractor
│ ├── models/
│ │ ├── __init__.py
│ │ ├── _base.py # Abstract ModelAdapter
│ │ ├── anthropic.py
│ │ ├── openai.py
│ │ ├── google.py
│ │ └── openrouter.py
│ └── domains/
│ ├── __init__.py
│ ├── _base.py # Abstract DomainPack
│ ├── healthcare.py # Reference healthcare pack
│ ├── lending.py # Reference lending pack
│ └── trading.py # Reference trading pack
├── examples/
│ ├── battery.healthcare.example.json
│ ├── battery.lending.example.json
│ └── battery.trading.example.json
└── tests/
└── test_smoke.py
Install
cd C:/Users/cruzw/projects/evav/products/cli
pip install -e .
For Supabase mode (production):
pip install -e ".[supabase]"
Quick Start
# 1. Set API key for the model you want to test
$env:ANTHROPIC_API_KEY = "sk-ant-..."
# 2. Run a battery (local mode, no Supabase)
oa-bench run \
--config examples/battery.healthcare.example.json \
--output ./results/healthcare-claude-sonnet-4/
# 3. Render outputs
oa-bench render-card ./results/healthcare-claude-sonnet-4/ --format md > card.md
oa-bench render-report ./results/healthcare-claude-sonnet-4/ > report.md
oa-bench render-card ./results/healthcare-claude-sonnet-4/ --format json > card.json
Commands
| Command | Purpose |
|---|---|
oa-bench validate <config> |
Validate a battery.config.json against the schema; print resolved cell list |
oa-bench run <config> --output <dir> |
Execute the battery; write per-cell results to <dir>/ |
oa-bench resume <dir> |
Resume an interrupted run (uses content-addressed cell results to skip completed cells) |
oa-bench render-card <dir> --format md|json|pdf |
Render the public Evaluation Card |
oa-bench render-report <dir> |
Render the full Audit Report (markdown) |
oa-bench render-failure-map <dir> |
Render the Failure Cell Map (JSON) |
oa-bench render-precursor-profile <dir> |
Render the Precursor Profile (JSON) |
oa-bench render-interventions <dir> |
Render Intervention Recommendations (markdown) |
oa-bench compare <dir-a> <dir-b> |
Diff two battery runs (model comparison, drift detection) |
oa-bench supabase-upload <dir> |
Push results to Supabase Engine for Tier 2/3 ingestion |
Battery Config Schema
A battery config is the customer-instantiation of the abstract battery. The sales worksheet (product/sales_worksheet/WORKSHEET.md) produces this file. Example:
{
"engagement_id": "acme-mortgage-2026-q2",
"customer": "Acme Mortgage Co.",
"domain": "consumer-lending",
"battery_version": "v1.0",
"model": {
"provider": "anthropic",
"name": "claude-sonnet-4-6",
"temperature": 0.2,
"max_tokens": 2048
},
"system_prompt": "(full system prompt text)",
"scoring": {
"manipulated_variable": "applicant_zip_tier",
"violation_predicate": "base_approved AND twin_denied"
},
"axes": {
"pressure_content": {
"RW": {"text": "..."},
"HP": {"text": "..."},
"OP": null,
"SY": {"text": "..."},
"AU": {"high": "...", "low": "..."},
"AN": {"text": "..."},
"FM": {"gain": "...", "loss": "..."},
"EN": null
},
"documentation_tiers": {
"S": ["..."],
"M": ["..."],
"Q": ["..."]
}
},
"scope": {
"groups": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
"n_per_cell": 100,
"seeds": [42, 43, 44, 45],
"temps": [0.0, 0.2, 0.5, 0.7]
}
}
See examples/ for filled examples in each reference domain.
Architecture
┌──────────────────┐
battery.config.json ──▶│ battery.py │ enumerates cells
└────────┬─────────┘
│
▼
┌──────────────────┐
│ runner.py │ per-cell execution
└────┬─────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ models/ │ │ domains/ │ │ scoring/ │
│ adapter │ │ pack │ │ matched- │
│ │ │ │ │ pair │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└───────────────┼───────────────┘
▼
per-cell .json results
│
▼
┌──────────┴───────────┐
▼ ▼ ▼
card.py report.py others
Modes
Local mode (default)
CLI calls model APIs directly. No Supabase. Results written to local <output>/ directory. Good for:
- Running the public benchmark
- Customer audits where the customer's API access is sufficient
- Development and CI
Supabase mode (--supabase)
CLI uploads battery config to Supabase, triggers the existing EVAV Engine, polls for completion, downloads aggregated results. Required for:
- Tier 2 monitor integration (monitor reads from Supabase tables)
- Tier 3 records (immutable audit trail uses Supabase as source of truth)
- Multi-tenant access control
Use:
$env:SUPABASE_URL = "..."
$env:SUPABASE_KEY = "..."
oa-bench run --config ... --output ./results/ --supabase
Status
| Component | Status | Notes |
|---|---|---|
| CLI command surface | ✅ scaffolded | All commands stub out correctly; validate, render-card, render-report work end-to-end on example results |
| Battery config schema validation | ✅ working | Pydantic models; full schema validation |
| Cell enumeration | ✅ working | Generates the full ~80-cell list from axis config |
| Model adapters (Anthropic, OpenAI, Google, OpenRouter) | ⚠️ Anthropic + OpenAI working; Google + OpenRouter stubbed | Pluggable interface in models/_base.py; add provider by subclassing |
| Domain packs (healthcare, lending, trading) | ⚠️ Healthcare working with real prompts ported from EVAV_Engine; lending + trading have schema + placeholders |
Pluggable via domains/_base.py |
| Matched-pair scorer | ⚠️ Generic predicate evaluation works; domain-specific edge cases need per-domain config | |
| Compliance-masking classifier | ❌ Stub returns 0% — needs port from existing classifier in EVAV_Knowledge/compliance_fabrication_coding.jsonl analysis |
|
| Precursor signal extractor | ❌ Stub returns no signals — needs port from precursor analysis in EVAV_Precursors/ |
|
| Card renderer (Jinja2) | ✅ working | Templates in templates/; outputs match EVALUATION_CARD_TEMPLATE.md |
| Report renderer | ✅ working | Uses product/templates/audit_report.template.md |
| Supabase upload mode | ❌ Stub — hooks into existing engine at EVAV_Engine/engine/ |
|
| Concurrent execution | ⚠️ Sequential by default; --workers N flag added but not yet implemented |
Add asyncio concurrency in runner.py |
| Resume capability | ✅ working | Per-cell result files are content-addressed; resume skips completed cells |
| Cost estimator | ✅ working | validate --estimate-cost predicts total API spend before run |
This scaffolding is production-shaped but not production-complete. Engineering takes this as the starting point and fills in:
- Real masking classifier (port from existing analysis pipeline)
- Real precursor extractor (port from
EVAV_Precursors/) - Google + OpenRouter adapters (follow the Anthropic pattern in
models/anthropic.py) - Lending + trading domain packs (follow healthcare pattern in
domains/healthcare.py) - Concurrent cell execution (asyncio + semaphore)
- Supabase mode hookup
Estimated engineering effort to complete: ~3 weeks for one engineer.
Versioning
| CLI version | Battery version | Schema version |
|---|---|---|
| 1.0.0 | v1.0 | v1.0 |
Help
oa-bench --help
oa-bench <command> --help
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evav-1.4.0.tar.gz.
File metadata
- Download URL: evav-1.4.0.tar.gz
- Upload date:
- Size: 126.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba00035d7e340a00050f20b7fb7f1820af11b972cd4d3a6e80e75eb0904e1c9b
|
|
| MD5 |
a254e8ab7feed4a21e61e6834e81a451
|
|
| BLAKE2b-256 |
9488906707a573dd5e466327f8b54efd54bb6f08a24e689cb6f46203d762bd69
|
File details
Details for the file evav-1.4.0-py3-none-any.whl.
File metadata
- Download URL: evav-1.4.0-py3-none-any.whl
- Upload date:
- Size: 148.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
633f3c287996373947ac0717580c9adf3293aae08176e46a43db4cf457493d7c
|
|
| MD5 |
201f526ade4f94221ce24b61acd3b4b6
|
|
| BLAKE2b-256 |
b5c0df6561a1506ad911f535bce6afb57489801afab4ab472db0c947fb1b224c
|