EVAV — AI agent integrity platform. Runtime guardrails, observability, behavioral signals, and audit chain for production AI agents. Wraps best-in-class OSS (NeMo Guardrails, Guardrails AI, LLM Guard, Langfuse) with peer-reviewed signal layer.

These details have not been verified by PyPI

Project description

oa-bench — OA Evaluation Battery CLI

Domain-agnostic test runner for the OA Evaluation Battery. Consumes battery.config.json (output from the sales onboarding worksheet) and produces Evaluation Cards, Audit Reports, and supporting deliverables.

What's In This Folder

cli/
├── README.md                # This file
├── pyproject.toml           # Install with `pip install -e .`
├── oa_bench/
│   ├── __init__.py
│   ├── __main__.py          # python -m oa_bench
│   ├── cli.py               # Click commands
│   ├── battery.py           # Battery config + cell enumeration
│   ├── runner.py            # Cell execution
│   ├── card.py              # Evaluation Card renderer (Jinja2)
│   ├── report.py            # Audit Report renderer (Jinja2)
│   ├── scoring/
│   │   ├── __init__.py
│   │   ├── matched_pair.py  # Differential-treatment scorer
│   │   ├── masking.py       # Compliance-masking classifier
│   │   └── precursor.py     # 25-signal extractor
│   ├── models/
│   │   ├── __init__.py
│   │   ├── _base.py         # Abstract ModelAdapter
│   │   ├── anthropic.py
│   │   ├── openai.py
│   │   ├── google.py
│   │   └── openrouter.py
│   └── domains/
│       ├── __init__.py
│       ├── _base.py         # Abstract DomainPack
│       ├── healthcare.py    # Reference healthcare pack
│       ├── lending.py       # Reference lending pack
│       └── trading.py       # Reference trading pack
├── examples/
│   ├── battery.healthcare.example.json
│   ├── battery.lending.example.json
│   └── battery.trading.example.json
└── tests/
    └── test_smoke.py

Install

cd C:/Users/cruzw/projects/evav/products/cli
pip install -e .

For Supabase mode (production):

pip install -e ".[supabase]"

Quick Start

# 1. Set API key for the model you want to test
$env:ANTHROPIC_API_KEY = "sk-ant-..."

# 2. Run a battery (local mode, no Supabase)
oa-bench run \
  --config examples/battery.healthcare.example.json \
  --output ./results/healthcare-claude-sonnet-4/

# 3. Render outputs
oa-bench render-card ./results/healthcare-claude-sonnet-4/ --format md > card.md
oa-bench render-report ./results/healthcare-claude-sonnet-4/ > report.md
oa-bench render-card ./results/healthcare-claude-sonnet-4/ --format json > card.json

Commands

Command	Purpose
`oa-bench validate <config>`	Validate a `battery.config.json` against the schema; print resolved cell list
`oa-bench run <config> --output <dir>`	Execute the battery; write per-cell results to `<dir>/`
`oa-bench resume <dir>`	Resume an interrupted run (uses content-addressed cell results to skip completed cells)
`oa-bench render-card <dir> --format md\|json\|pdf`	Render the public Evaluation Card
`oa-bench render-report <dir>`	Render the full Audit Report (markdown)
`oa-bench render-failure-map <dir>`	Render the Failure Cell Map (JSON)
`oa-bench render-precursor-profile <dir>`	Render the Precursor Profile (JSON)
`oa-bench render-interventions <dir>`	Render Intervention Recommendations (markdown)
`oa-bench compare <dir-a> <dir-b>`	Diff two battery runs (model comparison, drift detection)
`oa-bench supabase-upload <dir>`	Push results to Supabase Engine for Tier 2/3 ingestion

Battery Config Schema

A battery config is the customer-instantiation of the abstract battery. The sales worksheet (product/sales_worksheet/WORKSHEET.md) produces this file. Example:

{
  "engagement_id": "acme-mortgage-2026-q2",
  "customer": "Acme Mortgage Co.",
  "domain": "consumer-lending",
  "battery_version": "v1.0",
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4-6",
    "temperature": 0.2,
    "max_tokens": 2048
  },
  "system_prompt": "(full system prompt text)",
  "scoring": {
    "manipulated_variable": "applicant_zip_tier",
    "violation_predicate": "base_approved AND twin_denied"
  },
  "axes": {
    "pressure_content": {
      "RW": {"text": "..."},
      "HP": {"text": "..."},
      "OP": null,
      "SY": {"text": "..."},
      "AU": {"high": "...", "low": "..."},
      "AN": {"text": "..."},
      "FM": {"gain": "...", "loss": "..."},
      "EN": null
    },
    "documentation_tiers": {
      "S": ["..."],
      "M": ["..."],
      "Q": ["..."]
    }
  },
  "scope": {
    "groups": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
    "n_per_cell": 100,
    "seeds": [42, 43, 44, 45],
    "temps": [0.0, 0.2, 0.5, 0.7]
  }
}

See examples/ for filled examples in each reference domain.

Architecture

                          ┌──────────────────┐
   battery.config.json ──▶│   battery.py     │ enumerates cells
                          └────────┬─────────┘
                                   │
                                   ▼
                          ┌──────────────────┐
                          │   runner.py      │ per-cell execution
                          └────┬─────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
        ┌──────────┐    ┌──────────┐    ┌──────────┐
        │ models/  │    │ domains/ │    │ scoring/ │
        │ adapter  │    │  pack    │    │ matched- │
        │          │    │          │    │  pair    │
        └────┬─────┘    └────┬─────┘    └────┬─────┘
             │               │               │
             └───────────────┼───────────────┘
                             ▼
                    per-cell .json results
                             │
                             ▼
                  ┌──────────┴───────────┐
                  ▼          ▼            ▼
              card.py    report.py    others

Modes

Local mode (default)

CLI calls model APIs directly. No Supabase. Results written to local <output>/ directory. Good for:

Running the public benchmark
Customer audits where the customer's API access is sufficient
Development and CI

Supabase mode (`--supabase`)

CLI uploads battery config to Supabase, triggers the existing EVAV Engine, polls for completion, downloads aggregated results. Required for:

Tier 2 monitor integration (monitor reads from Supabase tables)
Tier 3 records (immutable audit trail uses Supabase as source of truth)
Multi-tenant access control

Use:

$env:SUPABASE_URL = "..."
$env:SUPABASE_KEY = "..."
oa-bench run --config ... --output ./results/ --supabase

Status

Component	Status	Notes
CLI command surface	✅ scaffolded	All commands stub out correctly; `validate`, `render-card`, `render-report` work end-to-end on example results
Battery config schema validation	✅ working	Pydantic models; full schema validation
Cell enumeration	✅ working	Generates the full ~80-cell list from axis config
Model adapters (Anthropic, OpenAI, Google, OpenRouter)	⚠️ Anthropic + OpenAI working; Google + OpenRouter stubbed	Pluggable interface in `models/_base.py`; add provider by subclassing
Domain packs (healthcare, lending, trading)	⚠️ Healthcare working with real prompts ported from `EVAV_Engine`; lending + trading have schema + placeholders	Pluggable via `domains/_base.py`
Matched-pair scorer	⚠️ Generic predicate evaluation works; domain-specific edge cases need per-domain config
Compliance-masking classifier	❌ Stub returns 0% — needs port from existing classifier in `EVAV_Knowledge/compliance_fabrication_coding.jsonl` analysis
Precursor signal extractor	❌ Stub returns no signals — needs port from precursor analysis in `EVAV_Precursors/`
Card renderer (Jinja2)	✅ working	Templates in `templates/`; outputs match `EVALUATION_CARD_TEMPLATE.md`
Report renderer	✅ working	Uses `product/templates/audit_report.template.md`
Supabase upload mode	❌ Stub — hooks into existing engine at `EVAV_Engine/engine/`
Concurrent execution	⚠️ Sequential by default; `--workers N` flag added but not yet implemented	Add asyncio concurrency in `runner.py`
Resume capability	✅ working	Per-cell result files are content-addressed; resume skips completed cells
Cost estimator	✅ working	`validate --estimate-cost` predicts total API spend before run

This scaffolding is production-shaped but not production-complete. Engineering takes this as the starting point and fills in:

Real masking classifier (port from existing analysis pipeline)
Real precursor extractor (port from EVAV_Precursors/)
Google + OpenRouter adapters (follow the Anthropic pattern in models/anthropic.py)
Lending + trading domain packs (follow healthcare pattern in domains/healthcare.py)
Concurrent cell execution (asyncio + semaphore)
Supabase mode hookup

Estimated engineering effort to complete: ~3 weeks for one engineer.

Versioning

CLI version	Battery version	Schema version
1.0.0	v1.0	v1.0

Help

oa-bench --help
oa-bench <command> --help

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.0

May 20, 2026

This version

1.3.0

May 20, 2026

1.0.6

May 15, 2026

1.0.5

May 14, 2026

1.0.4

May 11, 2026

1.0.3

May 11, 2026

1.0.2

May 11, 2026

1.0.1

May 11, 2026

1.0.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evav-1.3.0.tar.gz (114.9 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evav-1.3.0-py3-none-any.whl (134.0 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file evav-1.3.0.tar.gz.

File metadata

Download URL: evav-1.3.0.tar.gz
Upload date: May 20, 2026
Size: 114.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for evav-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6a8b63f0a292efdabd68eb718d598613ef7f1d295311c4f34b11ed86b405f3f5`
MD5	`81aab858600c162e72ecd69b66264cc8`
BLAKE2b-256	`030189f3d39725f724e3288216380d97effe94a87eab39cd8b13f01c2f3a6c36`

See more details on using hashes here.

File details

Details for the file evav-1.3.0-py3-none-any.whl.

File metadata

Download URL: evav-1.3.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 134.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for evav-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68ac3765cb94977363a4be1f12505b57ac466b40aafac7c83ac20aced01443ea`
MD5	`6d882db0b9985f4a64a343ec0cab4c2d`
BLAKE2b-256	`a6e18c29909339fbf8bf0f6c73e577e75c9e5500f53c84671fa4df4e1fe8d731`

See more details on using hashes here.

evav 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

oa-bench — OA Evaluation Battery CLI

What's In This Folder

Install

Quick Start

Commands

Battery Config Schema

Architecture

Modes

Local mode (default)

Supabase mode (`--supabase`)

Status

Versioning

Help

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

evav 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

oa-bench — OA Evaluation Battery CLI

What's In This Folder

Install

Quick Start

Commands

Battery Config Schema

Architecture

Modes

Local mode (default)

Supabase mode (--supabase)

Status

Versioning

Help

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Supabase mode (`--supabase`)