Skip to main content

Verifiable Rewards for Real-World AI Agent Tasks

Project description

vr.dev — Verifiable Rewards for Real-World AI Agent Tasks

v0.4.0 — Hosted API + Async

Evidence-bearing, auditable verification of AI agent completions across filesystem, API, email, calendar, code-quality, e-commerce, git, and telecom domains. Now with async support, a hosted FastAPI service, and evidence persistence.

pip install vrdev            # core
pip install vrdev[llm]       # + OpenAI judge
pip install vrdev[mcp]       # + MCP server
pip install vrdev[all]       # everything
pip install vrdev[dev]       # + pytest, pytest-asyncio & ruff

Quick Start

Python API

from vrdev import get_verifier, VerifierInput

v = get_verifier("vr/filesystem.file_created")
result = v.verify(VerifierInput(
    completions=["I created the file"],
    ground_truth={"expected_path": "/tmp/output.txt"},
))
print(result[0].verdict)   # PASS or FAIL
print(result[0].score)     # 0.0 – 1.0
print(result[0].evidence)  # {"file_exists": True, ...}

Async API

import asyncio
from vrdev import get_verifier, VerifierInput

async def main():
    v = get_verifier("vr/filesystem.file_created")
    result = await v.async_verify(VerifierInput(
        completions=["I created the file"],
        ground_truth={"expected_path": "/tmp/output.txt"},
    ))
    print(result[0].verdict)

asyncio.run(main())

CLI

# Run a verifier
vr verify vr/filesystem.file_created \
  --completion "done" \
  --ground-truth '{"expected_path": "/tmp/out.txt"}'

# List all verifiers
vr registry list

# Search verifiers
vr registry search email

# Run test fixtures
vr test vr/filesystem.file_created

# Show config
vr config show

# Initialize config file
vr config init

MCP Server (Claude Desktop / Cursor)

vr mcp serve

Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "vrdev": {
      "command": "vr",
      "args": ["mcp", "serve"]
    }
  }
}

The MCP server exposes 5 tools:

Tool Description
list_verifiers List all registered verifier IDs
run_verifier Run a verifier with input
compose_chain Run composed verifier chain
explain_failure Get human-readable failure explanation
search_verifiers Keyword search across verifiers

Configuration

Config lives at ~/.vrdev/config.toml with VRDEV_* env var overrides.

vr config init   # create default config
vr config show   # display current config
[openai]
api_key = ""
model = "gpt-4o-mini"
temperature = 0.0
max_tokens = 1024

[imap]
host = "localhost"
port = 993
username = ""
password = ""

[http]
timeout = 15.0

Environment variable overrides (highest precedence):

export VRDEV_OPENAI_API_KEY="sk-..."
export VRDEV_OPENAI_MODEL="gpt-4o"
export VRDEV_IMAP_HOST="imap.example.com"
export VRDEV_HTTP_TIMEOUT="30.0"

Verifiers (12)

ID Tier Domain Source
vr/filesystem.file_created HARD Filesystem OSWorld
vr/tau2.retail.order_cancelled HARD Retail API τ²-bench
vr/tau2.policy.constraint_not_violated HARD Policy τ²-bench
vr/tau2.airline.rebooking_correct HARD Airline API τ²-bench
vr/tau2.telecom.plan_changed HARD Telecom CRM τ²-bench
vr/aiv.email.sent_folder_confirmed AGENTIC Email/IMAP VAGEN
vr/aiv.calendar.event_created AGENTIC Calendar API VAGEN
vr/rubric.email.tone_professional SOFT Email rubric Proofs paper
vr/rubric.code.logic_correct SOFT Code logic Proofs paper
vr/code.python.lint_ruff HARD Code quality Zeno-bench
vr/git.commit_present HARD Git history SWE-bench
vr/web.ecommerce.order_placed HARD E-commerce API WebArena

Verification Tiers

  • HARD — Deterministic, state-based checks (API calls, file existence, lint output)
  • SOFT — LLM-judged rubric evaluation (stochastic, requires vrdev[llm])
  • AGENTIC — Latent-state verification via external systems (IMAP, CalDAV)

Architecture

┌─────────────────────────────────────────────────┐
│                   Adapters                       │
│  CLI (click)  │  MCP Server  │  Python API      │
├───────────────┼──────────────┼──────────────────┤
│              Composition Engine                   │
│       compose() · z_score_normalize()            │
├──────────────────────────────────────────────────┤
│              Base Verifier (ABC)                  │
│   verify(VerifierInput) → [VerificationResult]   │
├──────────────────────────────────────────────────┤
│                   Runners                         │
│  Sandbox │ HTTP │ IMAP │ Managed IMAP │ Browser  │
│               LLM Judge (OpenAI)                  │
├──────────────────────────────────────────────────┤
│                Core Types (Pydantic)              │
│  Verdict · Tier · VerificationResult · Scorecard │
└──────────────────────────────────────────────────┘

Composition

Chain multiple verifiers with AND logic and policy control:

from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode

chain = compose(
    [get_verifier("vr/filesystem.file_created"),
     get_verifier("vr/tau2.retail.order_cancelled")],
    policy_mode=PolicyMode.FAIL_CLOSED,
)
results = chain.verify(VerifierInput(
    completions=["done"],
    ground_truth={"expected_path": "/tmp/out.txt", "order_id": "ORD-001"},
    context={"api_base_url": "http://localhost:8080"},
))

Registry Validation

Validate VERIFIER.json / SKILL.json specs against schemas:

vr registry validate path/to/VERIFIER.json
from vrdev import load_verifier_spec, validate_verifier_spec

errors = validate_verifier_spec(spec_dict)
if not errors:
    print("Valid!")

Training-Data Export

Export verification results as JSONL for GRPO / DPO pipelines:

# CLI — export to file
vr export vr/filesystem.file_created completions.txt \
  -g ground_truth.json -o train.jsonl

# CLI — pipe to stdout
vr export vr/code.python.lint_ruff code_samples.json
from vrdev import get_verifier, VerifierInput, export_jsonl

v = get_verifier("vr/filesystem.file_created")
inp = VerifierInput(completions=["done"], ground_truth={"expected_path": "/tmp/f"})
results = v.verify(inp)

with open("train.jsonl", "w") as f:
    export_jsonl(results, inp, "vr/filesystem.file_created", f)

Each JSONL line contains: completion, score, verdict, verifier_id, breakdown, provenance, ground_truth, artifact_hash, exported_at.


Hosted API (vr-api)

The packages/vr-api/ directory contains a FastAPI service that wraps the vrdev SDK with authentication, rate limiting, and evidence persistence.

Endpoints

Method Path Auth Description
GET /health No Health check
POST /verify Yes Run a verifier
POST /compose Yes Run composed chain
GET /verifiers Yes List all verifiers
POST /export Yes Verify + export JSONL
GET /evidence/{hash} Yes Retrieve stored evidence

Running locally

# With Docker
cp packages/vr-api/.env.example packages/vr-api/.env
docker compose up

# Without Docker
pip install packages/vrdev packages/vr-api
uvicorn vr_api.app:app --reload

Configuration

Env var Default Description
DATABASE_URL sqlite+aiosqlite:///:memory: PostgreSQL / NeonDB connection string
VR_API_KEYS (empty = auth disabled) Comma-separated valid API keys
VR_RATE_LIMIT_PER_MINUTE 60 Per-key rate limit
VR_EVIDENCE_TTL_DAYS 90 Evidence retention period

Development

git clone <repo>
cd vr-dev/packages/vrdev
pip install -e ".[dev]"
pytest                  # run all tests
ruff check src/         # lint

Test Suite

tests/
├── test_types.py           # Core type validation
├── test_compose.py         # Composition engine
├── test_normalize.py       # Z-score normalization
├── test_sandbox.py         # Sandbox runner
├── test_artifact.py        # Artifact hashing
├── test_router.py          # Skill router
├── test_filesystem.py      # FileCreatedVerifier
├── test_tau2.py            # τ²-bench verifiers (3)
├── test_telecom.py         # PlanChangedVerifier
├── test_aiv_email.py       # SentFolderConfirmedVerifier
├── test_rubric_email.py    # ToneProfessionalVerifier
├── test_rubric_code.py     # LogicCorrectVerifier
├── test_registry.py        # Verifier registry
├── test_llm.py             # LLM judge protocol
├── test_e2e.py             # End-to-end integration
├── test_config.py          # Config system
├── test_registry_loader.py # Registry validation
├── test_lint_ruff.py       # LintRuffVerifier
├── test_git_commit.py      # CommitPresentVerifier
├── test_webarena.py        # OrderPlacedVerifier
├── test_calendar.py        # EventCreatedVerifier
├── test_mcp.py             # MCP server
├── test_trl.py             # TRL adapter
├── test_verl.py            # veRL adapter
├── test_export.py          # JSONL export
├── test_http_runner.py     # HTTP runner
├── test_imap_runner.py     # IMAP runner
├── test_openclaw.py        # OpenClaw adapter
├── test_async.py           # Async wrappers
├── test_browser_runner.py  # Browser runner stub
├── test_managed_imap.py    # Managed IMAP pool
└── mocks/
    ├── tau2_server.py      # τ²-bench mock API
    ├── webarena_server.py  # WebArena mock API
    ├── calendar_server.py  # Calendar mock API
    ├── telecom_server.py   # Telecom CRM mock API
    └── imap_mock.py        # IMAP mock runner

Verdict Enum

  • PASS — verification succeeded
  • FAIL — verification found a deficiency
  • UNVERIFIABLE — could not determine (ambiguous state)
  • ERROR — infrastructure/config failure (not an agent failure)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vrdev-0.9.0.tar.gz (103.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vrdev-0.9.0-py3-none-any.whl (89.0 kB view details)

Uploaded Python 3

File details

Details for the file vrdev-0.9.0.tar.gz.

File metadata

  • Download URL: vrdev-0.9.0.tar.gz
  • Upload date:
  • Size: 103.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for vrdev-0.9.0.tar.gz
Algorithm Hash digest
SHA256 6ee141dc0d5cd05cdaf11d702cb81810c77fd5c03e5fe293258964fbef1cf992
MD5 44e19b0960106934577132e9f4e71dbb
BLAKE2b-256 7b39fa28552a34e282d991922c1281473f0ad98e34d4de68394f988754dff3ab

See more details on using hashes here.

File details

Details for the file vrdev-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: vrdev-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 89.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for vrdev-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e1f4ad93e02e68e01a28e309d55caa39cd4c300ada475ce20379feb5c02836fc
MD5 dbe984ccc87bde64e7678d2ff22985f3
BLAKE2b-256 0c62af088e456d705787159fd49c9bf11e9f2b23ec9d7057d14568738be535ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page