Verifiable Rewards for Real-World AI Agent Tasks
Project description
vr.dev — Verifiable Rewards for Real-World AI Agent Tasks
v1.0.0 — 30 verifiers · 14 domains · composition engine · Merkle evidence
Evidence-bearing, auditable verification of AI agent completions across filesystem, API, email, calendar, code-quality, e-commerce, git, and telecom domains. Now with async support, a hosted FastAPI service, and evidence persistence.
pip install vrdev # core
pip install vrdev[llm] # + OpenAI judge
pip install vrdev[mcp] # + MCP server
pip install vrdev[all] # everything
pip install vrdev[dev] # + pytest, pytest-asyncio & ruff
Quick Start
Python API
from vrdev import get_verifier, VerifierInput
v = get_verifier("vr/filesystem.file_created")
result = v.verify(VerifierInput(
completions=["I created the file"],
ground_truth={"expected_path": "/tmp/output.txt"},
))
print(result[0].verdict) # PASS or FAIL
print(result[0].score) # 0.0 – 1.0
print(result[0].evidence) # {"file_exists": True, ...}
Async API
import asyncio
from vrdev import get_verifier, VerifierInput
async def main():
v = get_verifier("vr/filesystem.file_created")
result = await v.async_verify(VerifierInput(
completions=["I created the file"],
ground_truth={"expected_path": "/tmp/output.txt"},
))
print(result[0].verdict)
asyncio.run(main())
CLI
# Run a verifier
vr verify vr/filesystem.file_created \
--completion "done" \
--ground-truth '{"expected_path": "/tmp/out.txt"}'
# List all verifiers
vr registry list
# Search verifiers
vr registry search email
# Run test fixtures
vr test vr/filesystem.file_created
# Show config
vr config show
# Initialize config file
vr config init
MCP Server (Claude Desktop / Cursor)
vr mcp serve
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"vrdev": {
"command": "vr",
"args": ["mcp", "serve"]
}
}
}
The MCP server exposes 5 tools:
| Tool | Description |
|---|---|
list_verifiers |
List all registered verifier IDs |
run_verifier |
Run a verifier with input |
compose_chain |
Run composed verifier chain |
explain_failure |
Get human-readable failure explanation |
search_verifiers |
Keyword search across verifiers |
Configuration
Config lives at ~/.vrdev/config.toml with VRDEV_* env var overrides.
vr config init # create default config
vr config show # display current config
[openai]
api_key = ""
model = "gpt-4o-mini"
temperature = 0.0
max_tokens = 1024
[imap]
host = "localhost"
port = 993
username = ""
password = ""
[http]
timeout = 15.0
Environment variable overrides (highest precedence):
export VRDEV_OPENAI_API_KEY="sk-..."
export VRDEV_OPENAI_MODEL="gpt-4o"
export VRDEV_IMAP_HOST="imap.example.com"
export VRDEV_HTTP_TIMEOUT="30.0"
Verifiers (19)
| ID | Tier | Domain | Source |
|---|---|---|---|
vr/filesystem.file_created |
HARD | Filesystem | OSWorld |
vr/git.commit_present |
HARD | Git | SWE-bench |
vr/code.python.lint_ruff |
HARD | Code quality | Zeno-bench |
vr/code.python.tests_pass |
HARD | Code quality | SWE-bench |
vr/tau2.retail.order_cancelled |
HARD | Retail API | τ²-bench |
vr/tau2.retail.refund_processed |
HARD | Retail API | τ²-bench |
vr/tau2.retail.inventory_updated |
HARD | Retail API | τ²-bench |
vr/tau2.policy.constraint_not_violated |
HARD | Policy | τ²-bench |
vr/tau2.airline.rebooking_correct |
HARD | Airline API | τ²-bench |
vr/tau2.telecom.plan_changed |
HARD | Telecom CRM | τ²-bench |
vr/web.ecommerce.order_placed |
HARD | E-commerce | WebArena |
vr/web.browser.element_visible |
HARD | Browser DOM | WebArena |
vr/web.browser.screenshot_match |
HARD | Browser visual | WebArena |
vr/aiv.email.sent_folder_confirmed |
AGENTIC | Email/IMAP | VAGEN |
vr/aiv.calendar.event_created |
AGENTIC | Calendar API | VAGEN |
vr/aiv.shell.state_probe |
AGENTIC | Shell | VAGEN |
vr/rubric.email.tone_professional |
SOFT | Email rubric | Proofs paper |
vr/rubric.code.logic_correct |
SOFT | Code logic | Proofs paper |
vr/rubric.summary.faithful |
SOFT | NLP | Proofs paper |
Verification Tiers
- HARD — Deterministic, state-based checks (API calls, file existence, lint output)
- SOFT — LLM-judged rubric evaluation (stochastic, requires
vrdev[llm]) - AGENTIC — Latent-state verification via external systems (IMAP, CalDAV)
Architecture
┌─────────────────────────────────────────────────┐
│ Adapters │
│ CLI (click) │ MCP Server │ Python API │
├───────────────┼──────────────┼──────────────────┤
│ Composition Engine │
│ compose() · z_score_normalize() │
├──────────────────────────────────────────────────┤
│ Base Verifier (ABC) │
│ verify(VerifierInput) → [VerificationResult] │
├──────────────────────────────────────────────────┤
│ Runners │
│ Sandbox │ HTTP │ IMAP │ Managed IMAP │ Browser │
│ LLM Judge (OpenAI) │
├──────────────────────────────────────────────────┤
│ Core Types (Pydantic) │
│ Verdict · Tier · VerificationResult · Scorecard │
└──────────────────────────────────────────────────┘
Composition
Chain multiple verifiers with AND logic and policy control:
from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode
chain = compose(
[get_verifier("vr/filesystem.file_created"),
get_verifier("vr/tau2.retail.order_cancelled")],
policy_mode=PolicyMode.FAIL_CLOSED,
)
results = chain.verify(VerifierInput(
completions=["done"],
ground_truth={"expected_path": "/tmp/out.txt", "order_id": "ORD-001"},
context={"api_base_url": "http://localhost:8080"},
))
Registry Validation
Validate VERIFIER.json / SKILL.json specs against schemas:
vr registry validate path/to/VERIFIER.json
from vrdev import load_verifier_spec, validate_verifier_spec
errors = validate_verifier_spec(spec_dict)
if not errors:
print("Valid!")
Training-Data Export
Export verification results as JSONL for GRPO / DPO pipelines:
# CLI — export to file
vr export vr/filesystem.file_created completions.txt \
-g ground_truth.json -o train.jsonl
# CLI — pipe to stdout
vr export vr/code.python.lint_ruff code_samples.json
from vrdev import get_verifier, VerifierInput, export_jsonl
v = get_verifier("vr/filesystem.file_created")
inp = VerifierInput(completions=["done"], ground_truth={"expected_path": "/tmp/f"})
results = v.verify(inp)
with open("train.jsonl", "w") as f:
export_jsonl(results, inp, "vr/filesystem.file_created", f)
Each JSONL line contains: completion, score, verdict, verifier_id,
breakdown, provenance, ground_truth, artifact_hash, exported_at.
Hosted API (vr-api)
The packages/vr-api/ directory contains a FastAPI service that wraps the
vrdev SDK with authentication, rate limiting, and evidence persistence.
Endpoints
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/health |
No | Health check |
POST |
/v1/verify |
Yes | Run a verifier |
POST |
/v1/compose |
Yes | Run composed chain |
GET |
/v1/verifiers |
Yes | List all verifiers |
POST |
/v1/export |
Yes | Verify + export JSONL |
POST |
/v1/batch |
Yes | Batch verify multiple inputs |
GET |
/v1/evidence/{hash} |
Yes | Retrieve stored evidence |
GET |
/v1/evidence |
Yes | List evidence records |
GET |
/v1/usage |
Yes | Usage statistics |
Running locally
# With Docker
cp packages/vr-api/.env.example packages/vr-api/.env
docker compose up
# Without Docker
pip install packages/vrdev packages/vr-api
uvicorn vr_api.app:app --reload
Configuration
| Env var | Default | Description |
|---|---|---|
DATABASE_URL |
sqlite+aiosqlite:///:memory: |
PostgreSQL / NeonDB connection string |
VR_API_KEYS |
(empty = auth disabled) | Comma-separated valid API keys |
VR_RATE_LIMIT_PER_MINUTE |
60 |
Per-key rate limit |
VR_EVIDENCE_TTL_DAYS |
90 |
Evidence retention period |
Development
git clone https://github.com/vr-dev-org/vr-dev.git
cd vr-dev/packages/vrdev
pip install -e ".[dev]"
pytest # run all tests
ruff check src/ # lint
Test Suite
tests/
├── test_types.py # Core type validation
├── test_compose.py # Composition engine
├── test_normalize.py # Z-score normalization
├── test_sandbox.py # Sandbox runner
├── test_artifact.py # Artifact hashing
├── test_router.py # Skill router
├── test_filesystem.py # FileCreatedVerifier
├── test_tau2.py # τ²-bench verifiers (3)
├── test_telecom.py # PlanChangedVerifier
├── test_aiv_email.py # SentFolderConfirmedVerifier
├── test_rubric_email.py # ToneProfessionalVerifier
├── test_rubric_code.py # LogicCorrectVerifier
├── test_registry.py # Verifier registry
├── test_llm.py # LLM judge protocol
├── test_e2e.py # End-to-end integration
├── test_config.py # Config system
├── test_registry_loader.py # Registry validation
├── test_lint_ruff.py # LintRuffVerifier
├── test_git_commit.py # CommitPresentVerifier
├── test_webarena.py # OrderPlacedVerifier
├── test_calendar.py # EventCreatedVerifier
├── test_mcp.py # MCP server
├── test_trl.py # TRL adapter
├── test_verl.py # veRL adapter
├── test_export.py # JSONL export
├── test_http_runner.py # HTTP runner
├── test_imap_runner.py # IMAP runner
├── test_openclaw.py # OpenClaw adapter
├── test_async.py # Async wrappers
├── test_browser_runner.py # Browser runner stub
├── test_managed_imap.py # Managed IMAP pool
└── mocks/
├── tau2_server.py # τ²-bench mock API
├── webarena_server.py # WebArena mock API
├── calendar_server.py # Calendar mock API
├── telecom_server.py # Telecom CRM mock API
└── imap_mock.py # IMAP mock runner
Verdict Enum
- PASS — verification succeeded
- FAIL — verification found a deficiency
- UNVERIFIABLE — could not determine (ambiguous state)
- ERROR — infrastructure/config failure (not an agent failure)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vrdev-1.0.0.tar.gz.
File metadata
- Download URL: vrdev-1.0.0.tar.gz
- Upload date:
- Size: 110.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a2bed08d4775111f9354755e6fbc7a7a24b33afe27a4b1f6ca127520cb9ec61
|
|
| MD5 |
477e0b67c390da9da780f0daa8e10d34
|
|
| BLAKE2b-256 |
3541bc2f86d021cd263e61ab679732a974f7298300ab4894ed743ebc9d6fea19
|
File details
Details for the file vrdev-1.0.0-py3-none-any.whl.
File metadata
- Download URL: vrdev-1.0.0-py3-none-any.whl
- Upload date:
- Size: 100.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4d478138f142ca62582d6bf88a41b061d48a3ba648d1eabc193c7c492cf9aae
|
|
| MD5 |
1e72558a2e5eca20fb08b59bfbb6ad36
|
|
| BLAKE2b-256 |
d3d3325cc25d702d4c100f5c59f2a890c0d4c4fbdd6402ffa1523c75ab10b690
|