Skip to main content

LLM-as-Judge evaluation for AI agents. Multi-model judging (Anthropic, OpenAI, Gemini, Ollama), YAML test suites, failure mining, and training data export.

Project description

cane-eval

Eval toolkit for AI agents. YAML test suites, multi-model judging, failure mining, root cause analysis, and training data export.

PyPI

pip install cane-eval

30-Second Demo

export ANTHROPIC_API_KEY=sk-ant-...
cane-eval demo
  Running a deliberately flawed support agent against 5 test cases...

   FAIL  52  What is your return policy?
         Missing all specific policy details customers need.
   FAIL   0  How do I reset my password?
         Entirely fabricated -- contradicts the expected process.
   WARN  66  Do you offer international shipping?
   FAIL  16  What payment methods do you accept?
         Fabricated a competitor recommendation with false claims.
   PASS 100  How do I contact customer support?

  Overall: 47/100  (28.4s)
  1 passed  1 warned  3 failed

Three failures in 28 seconds. That's what your agents are doing without evals.

Quick Start

1. Define tests (tests.yaml):

name: Support Agent
model: claude-sonnet-4-5-20250929

criteria:
  - key: accuracy
    label: Accuracy
    weight: 40
  - key: completeness
    label: Completeness
    weight: 30
  - key: hallucination
    label: Hallucination Check
    weight: 30

tests:
  - question: What is the return policy?
    expected_answer: 30-day return policy for unused items with receipt
  - question: How do I reset my password?
    expected_answer: Go to Settings > Security > Reset Password

2. Run:

cane-eval run tests.yaml

3. Mine failures into training data:

cane-eval run tests.yaml --mine --export dpo

Multi-Model Judging

Use any LLM as judge. Auto-detects provider from model name.

# Anthropic (default)
cane-eval run tests.yaml

# OpenAI
cane-eval run tests.yaml --provider openai --model gpt-4o

# Gemini
cane-eval run tests.yaml --provider gemini --model gemini-2.0-flash

# Ollama / vLLM / any OpenAI-compatible endpoint
cane-eval run tests.yaml --provider ollama --model llama3 --base-url http://localhost:11434/v1
from cane_eval import Judge

# Auto-detects OpenAI from model name
judge = Judge(model="gpt-4o", api_key="sk-...")

# Gemini
judge = Judge(model="gemini-2.0-flash", api_key="...")

# Local Ollama
judge = Judge(provider="ollama", model="llama3", base_url="http://localhost:11434/v1")

Install provider dependencies:

pip install cane-eval[openai]          # OpenAI
pip install cane-eval[gemini]          # Google Gemini
pip install cane-eval[all-providers]   # everything

CLI

cane-eval run tests.yaml                          # run eval suite
cane-eval run tests.yaml --tags policy,account    # filter by tags
cane-eval run tests.yaml --export dpo             # export training data
cane-eval run tests.yaml --mine                   # mine failures + rewrite
cane-eval rca tests.yaml --threshold 60           # root cause analysis
cane-eval rca tests.yaml --targeted               # deep dive on worst failures
cane-eval diff results_v1.json results_v2.json    # regression diff
cane-eval validate tests.yaml                     # validate YAML
cane-eval run tests.yaml --quiet                  # CI mode (exit 1 on fail)

Python API

from cane_eval import TestSuite, EvalRunner, FailureMiner, RootCauseAnalyzer

suite = TestSuite.from_yaml("tests.yaml")
runner = EvalRunner()
summary = runner.run(suite, agent=lambda q: my_agent.ask(q))

print(f"Score: {summary.overall_score}")

# Root cause analysis
analyzer = RootCauseAnalyzer()
rca = analyzer.analyze(summary, max_score=60)
for rc in rca.root_causes:
    print(f"  [{rc.severity}] {rc.title} -- {rc.recommendation}")

# Mine failures into DPO training pairs
miner = FailureMiner()
mined = miner.mine(summary, max_score=60)
mined.to_file("training.jsonl", format="dpo")

Framework Integrations

One-liner eval for LangChain, LlamaIndex, OpenAI endpoints, and FastAPI agents.

from cane_eval import evaluate_langchain, evaluate_llamaindex, evaluate_openai, evaluate_fastapi

# LangChain
results = evaluate_langchain(chain, suite="qa.yaml")

# LlamaIndex
results = evaluate_llamaindex(query_engine, suite="qa.yaml")

# OpenAI-compatible (OpenAI, vLLM, Ollama, LiteLLM)
results = evaluate_openai("http://localhost:11434/v1/chat/completions", suite="qa.yaml", openai_model="llama3")

# FastAPI
results = evaluate_fastapi("http://localhost:8000/ask", suite="qa.yaml")
pip install cane-eval[langchain]      # LangChain
pip install cane-eval[llamaindex]     # LlamaIndex
pip install cane-eval[fastapi]        # FastAPI
pip install cane-eval[integrations]   # all of the above

Eval Targets

Point eval at any HTTP endpoint or CLI tool:

# HTTP
target:
  type: http
  url: https://my-agent.com/api/ask
  method: POST
  payload_template: '{"query": "{{question}}"}'
  response_path: data.answer
  headers:
    Authorization: Bearer ${AGENT_API_KEY}

# CLI
target:
  type: command
  command: python my_agent.py --query "{{question}}"

Export Formats

Format Use Case Structure
dpo Direct Preference Optimization {prompt, chosen, rejected}
sft Supervised Fine-Tuning {prompt, completion, metadata}
openai OpenAI fine-tuning API {messages: [{role, content}]}
raw Debugging Full eval result with all scores

CI

# .github/workflows/eval.yml
name: Agent Eval
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install cane-eval
      - run: cane-eval run tests/eval_suite.yaml --quiet
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

How It Works

YAML Suite --> Your Agent --> LLM Judge --> Training Data (DPO/SFT/OpenAI)
                                |
                                v
                         Root Cause Analysis --> fix recommendations
                                |
                                v
                         Failure Mining --> improved answer rewrites

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_eval-0.3.0.tar.gz (45.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cane_eval-0.3.0-py3-none-any.whl (50.4 kB view details)

Uploaded Python 3

File details

Details for the file cane_eval-0.3.0.tar.gz.

File metadata

  • Download URL: cane_eval-0.3.0.tar.gz
  • Upload date:
  • Size: 45.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.3.0.tar.gz
Algorithm Hash digest
SHA256 238410d64cdb4159c08461cecb7f48b46fe43f6d592447fd58656c37c45312a3
MD5 bbac7e8fae4c0676aaedbc5a9cc45a7a
BLAKE2b-256 30f1f4949927685cb6092deb38b578db3537de363bf1009cbb5a2da3a83b18e7

See more details on using hashes here.

File details

Details for the file cane_eval-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: cane_eval-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 50.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf393eedeecb8f6fc6348ed5beff1b41d3503809bad1c1c849684cfda44163ac
MD5 d31ad98124495cdaf4974e8c0a6dfda5
BLAKE2b-256 b6b5a29fb255c6a07fed357649da791b30778480a8528017c5d7b7b0d4910f0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page