Agent Reliability Layer. LLM-as-Judge eval, schema validation, latency tracking, and reliability scoring for AI agents.

These details have not been verified by PyPI

Project links

Project description

cane-eval

The agent reliability layer. Catch what breaks in production before it ships.

pip install cane-eval

What it does

LLM-as-Judge eval + schema validation + latency tracking + reliability scoring. One tool, one score, one answer: would this break in production?

  Support Agent                              28.4s

  Overall: [=========----------] 47

  1 passed  1 warned  3 failed  (5 total)
  Pass rate: 20%

  Latency:  p50: 1.2s  p95: 8.4s  max: 12.1s
  Schema:   3/5 valid (60%)

  Reliability: [=======-----------] 52 (D)

30-Second Demo

export ANTHROPIC_API_KEY=sk-ant-...
cane-eval demo

Quick Start

1. Define tests (tests.yaml):

name: Support Agent

criteria:
  - key: accuracy
    weight: 40
  - key: completeness
    weight: 30
  - key: hallucination
    weight: 30

# Optional: validate response structure
schema:
  type: object
  required: [answer, sources]
  properties:
    answer: { type: string }
    sources: { type: array }

# Optional: latency target for reliability scoring
latency_target_ms: 5000

tests:
  - question: What is the return policy?
    expected_answer: 30-day return policy for unused items with receipt
  - question: How do I reset my password?
    expected_answer: Go to Settings > Security > Reset Password

2. Run:

cane-eval run tests.yaml

3. Production checks:

# Validate responses against JSON schema
cane-eval run tests.yaml --schema schema.json --fail-on-schema

# Fail if p95 latency exceeds 10 seconds
cane-eval run tests.yaml --latency-p95 10000

# Both + mine failures into training data
cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --mine --export dpo

Reliability Score

Every eval run produces an Agent Reliability Score (0-100) across three pillars:

Pillar	What it measures	How
Correctness	Does the answer look good?	LLM judge (accuracy, completeness, hallucination)
Structural	Does the response match expected format?	JSON schema validation
Performance	Is it fast enough for production?	p95 latency vs target

Grades: A (90+) production-ready, B (75+) mostly reliable, C (60+) needs work, D (40+) significant gaps, F (<40) not ready.

Multi-Model Judging

Any LLM as judge. Auto-detects provider from model name.

cane-eval run tests.yaml                                                       # Claude (default)
cane-eval run tests.yaml --provider openai --model gpt-4o                      # OpenAI
cane-eval run tests.yaml --provider gemini --model gemini-2.0-flash            # Gemini
cane-eval run tests.yaml --provider ollama --model llama3 --base-url http://localhost:11434/v1  # Local

pip install cane-eval[openai]          # OpenAI
pip install cane-eval[gemini]          # Google Gemini
pip install cane-eval[all-providers]   # everything

CLI

cane-eval run tests.yaml                          # run eval
cane-eval run tests.yaml --schema schema.json     # + schema validation
cane-eval run tests.yaml --latency-p95 10000      # + latency threshold
cane-eval run tests.yaml --mine --export dpo      # + failure mining
cane-eval rca tests.yaml --targeted               # root cause analysis
cane-eval diff old.json new.json                  # regression diff
cane-eval demo                                    # try it in 30 seconds

Python API

from cane_eval import TestSuite, EvalRunner

suite = TestSuite.from_yaml("tests.yaml")
runner = EvalRunner(
    schema={"type": "object", "required": ["answer"]},
    latency_p95=10000,
)
summary = runner.run(suite, agent=lambda q: my_agent.ask(q))

print(f"Score: {summary.overall_score}")
print(f"Reliability: {summary.reliability_score} ({summary.reliability_grade})")
print(f"Latency p95: {summary.latency.p95_ms}ms")
print(f"Schema: {summary.schema_pass}/{summary.schema_pass + summary.schema_fail} valid")

Framework Integrations

from cane_eval import evaluate_langchain, evaluate_llamaindex, evaluate_openai, evaluate_fastapi

results = evaluate_langchain(chain, suite="qa.yaml")
results = evaluate_llamaindex(query_engine, suite="qa.yaml")
results = evaluate_openai("http://localhost:11434/v1/chat/completions", suite="qa.yaml")
results = evaluate_fastapi("http://localhost:8000/ask", suite="qa.yaml")

Eval Targets

# HTTP endpoint
target:
  type: http
  url: https://my-agent.com/api/ask
  payload_template: '{"query": "{{question}}"}'
  response_path: data.answer

# CLI tool
target:
  type: command
  command: python my_agent.py --query "{{question}}"

CI

# .github/workflows/eval.yml
- run: pip install cane-eval
- run: cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --quiet
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Exit code 1 on failures. Add --fail-on-warn or --fail-on-schema for stricter checks.

How It Works

YAML Suite --> Agent --> LLM Judge -----> Reliability Score (A-F)
                  |          |                    |
                  |          v                    |
                  |   Schema Check                |
                  |   Latency Stats               |
                  |          |                    v
                  v          v              Training Data
            Root Cause    Failure           (DPO/SFT/OpenAI)
            Analysis      Mining

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Mar 17, 2026

0.3.0

Mar 17, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_eval-0.4.0.tar.gz (50.5 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cane_eval-0.4.0-py3-none-any.whl (52.7 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file cane_eval-0.4.0.tar.gz.

File metadata

Download URL: cane_eval-0.4.0.tar.gz
Upload date: Mar 17, 2026
Size: 50.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`be0918f645e18846bd2763c0307c0085ae28f22d53978a3b653b2cc9eea95963`
MD5	`ec95acb0f9d7de432abef4abcf091360`
BLAKE2b-256	`ea70831191843a520f50f79caf421892ab33646b7ecfbab49adede885f9fb807`

See more details on using hashes here.

File details

Details for the file cane_eval-0.4.0-py3-none-any.whl.

File metadata

Download URL: cane_eval-0.4.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 52.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1aea0f7b7a0b712533c2bc05415008a3421eb34ba90074dd7c474ad8a732253`
MD5	`40f3649fe8112f13f3af80f3c448c58d`
BLAKE2b-256	`6eb6c2fc958234750e44b3c3c9d2a381f63c4b8a675261f729b65eefc35ba567`

See more details on using hashes here.

cane-eval 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cane-eval

What it does

30-Second Demo

Quick Start

Reliability Score

Multi-Model Judging

CLI

Python API

Framework Integrations

Eval Targets

CI

How It Works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes