Skip to main content

Agent Reliability Layer. LLM-as-Judge eval, schema validation, latency tracking, and reliability scoring for AI agents.

Project description

cane-eval

The agent reliability layer. Catch what breaks in production before it ships.

PyPI

pip install cane-eval

What it does

LLM-as-Judge eval + schema validation + latency tracking + reliability scoring. One tool, one score, one answer: would this break in production?

  Support Agent                              28.4s

  Overall: [=========----------] 47

  1 passed  1 warned  3 failed  (5 total)
  Pass rate: 20%

  Latency:  p50: 1.2s  p95: 8.4s  max: 12.1s
  Schema:   3/5 valid (60%)

  Reliability: [=======-----------] 52 (D)

30-Second Demo

export ANTHROPIC_API_KEY=sk-ant-...
cane-eval demo

Quick Start

1. Define tests (tests.yaml):

name: Support Agent

criteria:
  - key: accuracy
    weight: 40
  - key: completeness
    weight: 30
  - key: hallucination
    weight: 30

# Optional: validate response structure
schema:
  type: object
  required: [answer, sources]
  properties:
    answer: { type: string }
    sources: { type: array }

# Optional: latency target for reliability scoring
latency_target_ms: 5000

tests:
  - question: What is the return policy?
    expected_answer: 30-day return policy for unused items with receipt
  - question: How do I reset my password?
    expected_answer: Go to Settings > Security > Reset Password

2. Run:

cane-eval run tests.yaml

3. Production checks:

# Validate responses against JSON schema
cane-eval run tests.yaml --schema schema.json --fail-on-schema

# Fail if p95 latency exceeds 10 seconds
cane-eval run tests.yaml --latency-p95 10000

# Both + mine failures into training data
cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --mine --export dpo

Reliability Score

Every eval run produces an Agent Reliability Score (0-100) across three pillars:

Pillar What it measures How
Correctness Does the answer look good? LLM judge (accuracy, completeness, hallucination)
Structural Does the response match expected format? JSON schema validation
Performance Is it fast enough for production? p95 latency vs target

Grades: A (90+) production-ready, B (75+) mostly reliable, C (60+) needs work, D (40+) significant gaps, F (<40) not ready.

Multi-Model Judging

Any LLM as judge. Auto-detects provider from model name.

cane-eval run tests.yaml                                                       # Claude (default)
cane-eval run tests.yaml --provider openai --model gpt-4o                      # OpenAI
cane-eval run tests.yaml --provider gemini --model gemini-2.0-flash            # Gemini
cane-eval run tests.yaml --provider ollama --model llama3 --base-url http://localhost:11434/v1  # Local
pip install cane-eval[openai]          # OpenAI
pip install cane-eval[gemini]          # Google Gemini
pip install cane-eval[all-providers]   # everything

CLI

cane-eval run tests.yaml                          # run eval
cane-eval run tests.yaml --schema schema.json     # + schema validation
cane-eval run tests.yaml --latency-p95 10000      # + latency threshold
cane-eval run tests.yaml --mine --export dpo      # + failure mining
cane-eval rca tests.yaml --targeted               # root cause analysis
cane-eval diff old.json new.json                  # regression diff
cane-eval demo                                    # try it in 30 seconds

Python API

from cane_eval import TestSuite, EvalRunner

suite = TestSuite.from_yaml("tests.yaml")
runner = EvalRunner(
    schema={"type": "object", "required": ["answer"]},
    latency_p95=10000,
)
summary = runner.run(suite, agent=lambda q: my_agent.ask(q))

print(f"Score: {summary.overall_score}")
print(f"Reliability: {summary.reliability_score} ({summary.reliability_grade})")
print(f"Latency p95: {summary.latency.p95_ms}ms")
print(f"Schema: {summary.schema_pass}/{summary.schema_pass + summary.schema_fail} valid")

Framework Integrations

from cane_eval import evaluate_langchain, evaluate_llamaindex, evaluate_openai, evaluate_fastapi

results = evaluate_langchain(chain, suite="qa.yaml")
results = evaluate_llamaindex(query_engine, suite="qa.yaml")
results = evaluate_openai("http://localhost:11434/v1/chat/completions", suite="qa.yaml")
results = evaluate_fastapi("http://localhost:8000/ask", suite="qa.yaml")

Eval Targets

# HTTP endpoint
target:
  type: http
  url: https://my-agent.com/api/ask
  payload_template: '{"query": "{{question}}"}'
  response_path: data.answer

# CLI tool
target:
  type: command
  command: python my_agent.py --query "{{question}}"

CI

# .github/workflows/eval.yml
- run: pip install cane-eval
- run: cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --quiet
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Exit code 1 on failures. Add --fail-on-warn or --fail-on-schema for stricter checks.

How It Works

YAML Suite --> Agent --> LLM Judge -----> Reliability Score (A-F)
                  |          |                    |
                  |          v                    |
                  |   Schema Check                |
                  |   Latency Stats               |
                  |          |                    v
                  v          v              Training Data
            Root Cause    Failure           (DPO/SFT/OpenAI)
            Analysis      Mining

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_eval-0.4.0.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cane_eval-0.4.0-py3-none-any.whl (52.7 kB view details)

Uploaded Python 3

File details

Details for the file cane_eval-0.4.0.tar.gz.

File metadata

  • Download URL: cane_eval-0.4.0.tar.gz
  • Upload date:
  • Size: 50.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.4.0.tar.gz
Algorithm Hash digest
SHA256 be0918f645e18846bd2763c0307c0085ae28f22d53978a3b653b2cc9eea95963
MD5 ec95acb0f9d7de432abef4abcf091360
BLAKE2b-256 ea70831191843a520f50f79caf421892ab33646b7ecfbab49adede885f9fb807

See more details on using hashes here.

File details

Details for the file cane_eval-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: cane_eval-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 52.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1aea0f7b7a0b712533c2bc05415008a3421eb34ba90074dd7c474ad8a732253
MD5 40f3649fe8112f13f3af80f3c448c58d
BLAKE2b-256 6eb6c2fc958234750e44b3c3c9d2a381f63c4b8a675261f729b65eefc35ba567

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page