Agent Reliability Layer. LLM-as-Judge eval, schema validation, latency tracking, and reliability scoring for AI agents.
Project description
cane-eval
The agent reliability layer. Catch what breaks in production before it ships.
pip install cane-eval
What it does
LLM-as-Judge eval + schema validation + latency tracking + reliability scoring. One tool, one score, one answer: would this break in production?
Support Agent 28.4s
Overall: [=========----------] 47
1 passed 1 warned 3 failed (5 total)
Pass rate: 20%
Latency: p50: 1.2s p95: 8.4s max: 12.1s
Schema: 3/5 valid (60%)
Reliability: [=======-----------] 52 (D)
30-Second Demo
export ANTHROPIC_API_KEY=sk-ant-...
cane-eval demo
Quick Start
1. Define tests (tests.yaml):
name: Support Agent
criteria:
- key: accuracy
weight: 40
- key: completeness
weight: 30
- key: hallucination
weight: 30
# Optional: validate response structure
schema:
type: object
required: [answer, sources]
properties:
answer: { type: string }
sources: { type: array }
# Optional: latency target for reliability scoring
latency_target_ms: 5000
tests:
- question: What is the return policy?
expected_answer: 30-day return policy for unused items with receipt
- question: How do I reset my password?
expected_answer: Go to Settings > Security > Reset Password
2. Run:
cane-eval run tests.yaml
3. Production checks:
# Validate responses against JSON schema
cane-eval run tests.yaml --schema schema.json --fail-on-schema
# Fail if p95 latency exceeds 10 seconds
cane-eval run tests.yaml --latency-p95 10000
# Both + mine failures into training data
cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --mine --export dpo
Reliability Score
Every eval run produces an Agent Reliability Score (0-100) across three pillars:
| Pillar | What it measures | How |
|---|---|---|
| Correctness | Does the answer look good? | LLM judge (accuracy, completeness, hallucination) |
| Structural | Does the response match expected format? | JSON schema validation |
| Performance | Is it fast enough for production? | p95 latency vs target |
Grades: A (90+) production-ready, B (75+) mostly reliable, C (60+) needs work, D (40+) significant gaps, F (<40) not ready.
Multi-Model Judging
Any LLM as judge. Auto-detects provider from model name.
cane-eval run tests.yaml # Claude (default)
cane-eval run tests.yaml --provider openai --model gpt-4o # OpenAI
cane-eval run tests.yaml --provider gemini --model gemini-2.0-flash # Gemini
cane-eval run tests.yaml --provider ollama --model llama3 --base-url http://localhost:11434/v1 # Local
pip install cane-eval[openai] # OpenAI
pip install cane-eval[gemini] # Google Gemini
pip install cane-eval[all-providers] # everything
CLI
cane-eval run tests.yaml # run eval
cane-eval run tests.yaml --schema schema.json # + schema validation
cane-eval run tests.yaml --latency-p95 10000 # + latency threshold
cane-eval run tests.yaml --mine --export dpo # + failure mining
cane-eval rca tests.yaml --targeted # root cause analysis
cane-eval diff old.json new.json # regression diff
cane-eval demo # try it in 30 seconds
Python API
from cane_eval import TestSuite, EvalRunner
suite = TestSuite.from_yaml("tests.yaml")
runner = EvalRunner(
schema={"type": "object", "required": ["answer"]},
latency_p95=10000,
)
summary = runner.run(suite, agent=lambda q: my_agent.ask(q))
print(f"Score: {summary.overall_score}")
print(f"Reliability: {summary.reliability_score} ({summary.reliability_grade})")
print(f"Latency p95: {summary.latency.p95_ms}ms")
print(f"Schema: {summary.schema_pass}/{summary.schema_pass + summary.schema_fail} valid")
Framework Integrations
from cane_eval import evaluate_langchain, evaluate_llamaindex, evaluate_openai, evaluate_fastapi
results = evaluate_langchain(chain, suite="qa.yaml")
results = evaluate_llamaindex(query_engine, suite="qa.yaml")
results = evaluate_openai("http://localhost:11434/v1/chat/completions", suite="qa.yaml")
results = evaluate_fastapi("http://localhost:8000/ask", suite="qa.yaml")
Eval Targets
# HTTP endpoint
target:
type: http
url: https://my-agent.com/api/ask
payload_template: '{"query": "{{question}}"}'
response_path: data.answer
# CLI tool
target:
type: command
command: python my_agent.py --query "{{question}}"
CI
# .github/workflows/eval.yml
- run: pip install cane-eval
- run: cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --quiet
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Exit code 1 on failures. Add --fail-on-warn or --fail-on-schema for stricter checks.
How It Works
YAML Suite --> Agent --> LLM Judge -----> Reliability Score (A-F)
| | |
| v |
| Schema Check |
| Latency Stats |
| | v
v v Training Data
Root Cause Failure (DPO/SFT/OpenAI)
Analysis Mining
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cane_eval-0.4.0.tar.gz.
File metadata
- Download URL: cane_eval-0.4.0.tar.gz
- Upload date:
- Size: 50.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be0918f645e18846bd2763c0307c0085ae28f22d53978a3b653b2cc9eea95963
|
|
| MD5 |
ec95acb0f9d7de432abef4abcf091360
|
|
| BLAKE2b-256 |
ea70831191843a520f50f79caf421892ab33646b7ecfbab49adede885f9fb807
|
File details
Details for the file cane_eval-0.4.0-py3-none-any.whl.
File metadata
- Download URL: cane_eval-0.4.0-py3-none-any.whl
- Upload date:
- Size: 52.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1aea0f7b7a0b712533c2bc05415008a3421eb34ba90074dd7c474ad8a732253
|
|
| MD5 |
40f3649fe8112f13f3af80f3c448c58d
|
|
| BLAKE2b-256 |
6eb6c2fc958234750e44b3c3c9d2a381f63c4b8a675261f729b65eefc35ba567
|