LLM-as-Judge evaluation for AI agents. Multi-model judging (Anthropic, OpenAI, Gemini, Ollama), YAML test suites, failure mining, and training data export.

These details have not been verified by PyPI

Project links

Project description

cane-eval

Eval toolkit for AI agents. YAML test suites, multi-model judging, failure mining, root cause analysis, and training data export.

pip install cane-eval

30-Second Demo

export ANTHROPIC_API_KEY=sk-ant-...
cane-eval demo

  Running a deliberately flawed support agent against 5 test cases...

   FAIL  52  What is your return policy?
         Missing all specific policy details customers need.
   FAIL   0  How do I reset my password?
         Entirely fabricated -- contradicts the expected process.
   WARN  66  Do you offer international shipping?
   FAIL  16  What payment methods do you accept?
         Fabricated a competitor recommendation with false claims.
   PASS 100  How do I contact customer support?

  Overall: 47/100  (28.4s)
  1 passed  1 warned  3 failed

Three failures in 28 seconds. That's what your agents are doing without evals.

Quick Start

1. Define tests (tests.yaml):

name: Support Agent
model: claude-sonnet-4-5-20250929

criteria:
  - key: accuracy
    label: Accuracy
    weight: 40
  - key: completeness
    label: Completeness
    weight: 30
  - key: hallucination
    label: Hallucination Check
    weight: 30

tests:
  - question: What is the return policy?
    expected_answer: 30-day return policy for unused items with receipt
  - question: How do I reset my password?
    expected_answer: Go to Settings > Security > Reset Password

2. Run:

cane-eval run tests.yaml

3. Mine failures into training data:

cane-eval run tests.yaml --mine --export dpo

Multi-Model Judging

Use any LLM as judge. Auto-detects provider from model name.

# Anthropic (default)
cane-eval run tests.yaml

# OpenAI
cane-eval run tests.yaml --provider openai --model gpt-4o

# Gemini
cane-eval run tests.yaml --provider gemini --model gemini-2.0-flash

# Ollama / vLLM / any OpenAI-compatible endpoint
cane-eval run tests.yaml --provider ollama --model llama3 --base-url http://localhost:11434/v1

from cane_eval import Judge

# Auto-detects OpenAI from model name
judge = Judge(model="gpt-4o", api_key="sk-...")

# Gemini
judge = Judge(model="gemini-2.0-flash", api_key="...")

# Local Ollama
judge = Judge(provider="ollama", model="llama3", base_url="http://localhost:11434/v1")

Install provider dependencies:

pip install cane-eval[openai]          # OpenAI
pip install cane-eval[gemini]          # Google Gemini
pip install cane-eval[all-providers]   # everything

CLI

cane-eval run tests.yaml                          # run eval suite
cane-eval run tests.yaml --tags policy,account    # filter by tags
cane-eval run tests.yaml --export dpo             # export training data
cane-eval run tests.yaml --mine                   # mine failures + rewrite
cane-eval rca tests.yaml --threshold 60           # root cause analysis
cane-eval rca tests.yaml --targeted               # deep dive on worst failures
cane-eval diff results_v1.json results_v2.json    # regression diff
cane-eval validate tests.yaml                     # validate YAML
cane-eval run tests.yaml --quiet                  # CI mode (exit 1 on fail)

Python API

from cane_eval import TestSuite, EvalRunner, FailureMiner, RootCauseAnalyzer

suite = TestSuite.from_yaml("tests.yaml")
runner = EvalRunner()
summary = runner.run(suite, agent=lambda q: my_agent.ask(q))

print(f"Score: {summary.overall_score}")

# Root cause analysis
analyzer = RootCauseAnalyzer()
rca = analyzer.analyze(summary, max_score=60)
for rc in rca.root_causes:
    print(f"  [{rc.severity}] {rc.title} -- {rc.recommendation}")

# Mine failures into DPO training pairs
miner = FailureMiner()
mined = miner.mine(summary, max_score=60)
mined.to_file("training.jsonl", format="dpo")

Framework Integrations

One-liner eval for LangChain, LlamaIndex, OpenAI endpoints, and FastAPI agents.

from cane_eval import evaluate_langchain, evaluate_llamaindex, evaluate_openai, evaluate_fastapi

# LangChain
results = evaluate_langchain(chain, suite="qa.yaml")

# LlamaIndex
results = evaluate_llamaindex(query_engine, suite="qa.yaml")

# OpenAI-compatible (OpenAI, vLLM, Ollama, LiteLLM)
results = evaluate_openai("http://localhost:11434/v1/chat/completions", suite="qa.yaml", openai_model="llama3")

# FastAPI
results = evaluate_fastapi("http://localhost:8000/ask", suite="qa.yaml")

pip install cane-eval[langchain]      # LangChain
pip install cane-eval[llamaindex]     # LlamaIndex
pip install cane-eval[fastapi]        # FastAPI
pip install cane-eval[integrations]   # all of the above

Eval Targets

Point eval at any HTTP endpoint or CLI tool:

# HTTP
target:
  type: http
  url: https://my-agent.com/api/ask
  method: POST
  payload_template: '{"query": "{{question}}"}'
  response_path: data.answer
  headers:
    Authorization: Bearer ${AGENT_API_KEY}

# CLI
target:
  type: command
  command: python my_agent.py --query "{{question}}"

Export Formats

Format	Use Case	Structure
`dpo`	Direct Preference Optimization	`{prompt, chosen, rejected}`
`sft`	Supervised Fine-Tuning	`{prompt, completion, metadata}`
`openai`	OpenAI fine-tuning API	`{messages: [{role, content}]}`
`raw`	Debugging	Full eval result with all scores

CI

# .github/workflows/eval.yml
name: Agent Eval
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install cane-eval
      - run: cane-eval run tests/eval_suite.yaml --quiet
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

How It Works

YAML Suite --> Your Agent --> LLM Judge --> Training Data (DPO/SFT/OpenAI)
                                |
                                v
                         Root Cause Analysis --> fix recommendations
                                |
                                v
                         Failure Mining --> improved answer rewrites

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Mar 17, 2026

This version

0.3.0

Mar 17, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_eval-0.3.0.tar.gz (45.4 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cane_eval-0.3.0-py3-none-any.whl (50.4 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file cane_eval-0.3.0.tar.gz.

File metadata

Download URL: cane_eval-0.3.0.tar.gz
Upload date: Mar 17, 2026
Size: 45.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`238410d64cdb4159c08461cecb7f48b46fe43f6d592447fd58656c37c45312a3`
MD5	`bbac7e8fae4c0676aaedbc5a9cc45a7a`
BLAKE2b-256	`30f1f4949927685cb6092deb38b578db3537de363bf1009cbb5a2da3a83b18e7`

See more details on using hashes here.

File details

Details for the file cane_eval-0.3.0-py3-none-any.whl.

File metadata

Download URL: cane_eval-0.3.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 50.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_eval-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf393eedeecb8f6fc6348ed5beff1b41d3503809bad1c1c849684cfda44163ac`
MD5	`d31ad98124495cdaf4974e8c0a6dfda5`
BLAKE2b-256	`b6b5a29fb255c6a07fed357649da791b30778480a8528017c5d7b7b0d4910f0a`

See more details on using hashes here.

cane-eval 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cane-eval

30-Second Demo

Quick Start

Multi-Model Judging

CLI

Python API

Framework Integrations

Eval Targets

Export Formats

CI

How It Works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes