AI evaluation for teams that ship models to production

These details have not been verified by PyPI

Project links

Project description

multivon-eval

Docs · Website · PyPI · Changelog · Benchmark vs DeepEval + RAGAS

AI evaluation for teams that ship models to production.

Why we exist — the credibility story

The three popular eval frameworks (multivon-eval, DeepEval, RAGAS) agree on a binary hallucination judgment only 56% of the time on the same dataset and labels. Cohen's κ = 0.03 — statistically indistinguishable from chance. If your CI gate flips on which framework you adopted, your "regression" is framework noise, not model quality. We ran this study and published the raw data in eval-framework-benchmark.

On the cross-distribution held-out test we hold ourselves to — Hallucination evaluator calibrated on HaluEval-QA, tested without re-tuning on HaluEval-Sum (n=60) — multivon-eval scores F1 0.830 [0.70–0.92]. The lower bound of our CI (0.71) clears DeepEval's upper bound (0.68) on the in-distribution comparison (F1 0.804 [0.71–0.88] vs 0.586 [0.48–0.68]). The full methodology + raw counts are in benchmarks/README.md Benchmark 4.

The release sequence 0.9.4 → 0.9.5 → 0.9.6 → 0.9.7 is the audit trail. A peer-review round caught a "held-out" claim in 0.9.4 that was actually in-distribution. 0.9.5 corrected the framing and added an actually-held-out test. 0.9.6 fixed three runtime blockers in the bootstrap template. 0.9.7 caught a threshold-vs-default mismatch that was inflating the held-out F1 from 0.830 (calibrated threshold 0.55) to 0.852 (init-time default 0.7). Four releases in eight hours. Every prior release left on PyPI as the historical record. The framework's discipline matches what we ask users to apply to their own systems — that is the pitch.

Run structured evals over your AI outputs — from simple string checks to LLM-as-judge scoring to agent trace validation — with a clean Python API, beautiful terminal reports, and CI/CD integration out of the box.

Quickstart — 30 seconds, no API key

pip install multivon-eval
python -m multivon_eval                       # runs a demo eval — no setup
multivon-eval init -t quickstart -d my-eval   # scaffold your own (offline)
cd my-eval && python eval.py

That's it. The quickstart template uses only deterministic evaluators (NotEmpty, Contains, WordCount) so the first eval runs without an API key.

Pick your path

You're…	Run this	Needs API key?
Brand new — just kicking the tires	`python -m multivon_eval`	No (LLM judges activate if a key is set)
Beginner writing your first eval	`multivon-eval init -t quickstart`	No — fully offline
Building an agent (hand-rolled or any framework)	`multivon-eval init -t agent`	No for default eval, optional for richer judging
Building a LangGraph agent	`multivon-eval init -t agent-langgraph`	Yes (or local Ollama via `ChatOpenAI(base_url=...)`)
Building an agent with the OpenAI Agents SDK	`multivon-eval init -t agent-openai-sdk`	Yes (OpenAI)
Building a RAG / QA system	`multivon-eval init -t rag`	Yes (or local Ollama)
Working a regulated domain	`multivon-eval init -t regulated`	Yes (or local Ollama)
Multi-turn dialogue eval	`multivon-eval init -t conversation`	Yes (or local Ollama)

LLM-judge evaluators auto-activate when ANTHROPIC_API_KEY, OPENAI_API_KEY, or a local server (Ollama on :11434, LM Studio on :1234, or OPENAI_BASE_URL) is detected — but every template runs without one in some form.

What's new in 0.9.x

multivon-eval install-skills (new in 0.9.8) — one-command installer for the three bundled Claude Code skills (eval-bootstrap, eval-audit, eval-explain). The wheel ships them under multivon_eval/_skills/; this CLI symlinks them into ~/.claude/skills/ so pip install -U multivon-eval automatically propagates SKILL.md edits.
```
multivon-eval install-skills              # symlinks the three skills
multivon-eval install-skills --dry-run    # preview without touching anything
multivon-eval install-skills --force      # replace existing entries

ls ~/.claude/skills/
# eval-audit  eval-bootstrap  eval-explain
```
See multivon_eval/_skills/README.md for the full skill catalog and what each one does. Pairs with multivon-eval bootstrap (which eval-bootstrap wraps as a Claude Code workflow) and the eval-action GitHub Action (which eval-audit complements on the pre-PR side).
Bootstrap CLI expansions —
- --judge-provider ollama + --judge-provider litellm for fully-local bootstrap (was cloud-only before 0.9.4).
- --judge-base-url (0.9.4) for vLLM / LM Studio / custom Ollama endpoints — injects a dummy API key when paired with --judge-provider openai so OpenAI-shim servers Just Work.
- --validate (0.9.0) runs the N-shot judge-noise filter (auto.validate_adversarial_cases) on the generated seed cases — drops anything outside the (0.5, 1.0) hardness band. Adds ~$0.03 but removes 20–40% of synthetic noise.
- --validate-n-shots controls the rerun count for --validate (default 3).
multivon-eval doctor (new in 0.9.0) — preflight your setup. Reports detected API keys, local-judge availability (Ollama / LM Studio / OpenAI-compat base URL), Python + package versions, ~/.multivon/ writeability. --json for CI consumers, exit codes 0 / 1 / 2 for hard/soft failures.
Self-correction audit trail (0.9.4 → 0.9.7) — the four-release cadence that produced the F1 0.830 [0.70–0.92] held-out number is documented release-by-release in CHANGELOG.md. 0.9.5 corrected the "held-out" framing on a Faithfulness number that was actually in-distribution. 0.9.6 fixed three runtime blockers in the bootstrap-generated template. 0.9.7 caught a threshold-vs-default mismatch that inflated the held-out F1 from 0.830 (calibrated 0.55) to 0.852 (init-time default 0.7) — only the 0.830 figure is defensible as "held-out at the calibrated threshold." See benchmarks/README.md Benchmark 4 for the reproducibility note on resolving thresholds at runtime.

Carried forward from 0.8.x

multivon-eval bootstrap — cold-start eval generator. Describe your LLM product + hand over a JSONL of sample traces, get back a runnable EvalSuite + 30 adversarial seed cases + thresholds calibrated from your data + a forwardable DISCOVERY_REPORT.md. ~60 seconds, ~$0.12 per run. PII / secrets redacted locally before any LLM call. Best documented path is the bootstrap guide.
```
multivon-eval bootstrap --product PRODUCT.md --traces TRACES.jsonl --output ./eval-bootstrap/
```
multivon_eval.auto module — the programmatic primitives the bootstrap CLI composes:
- auto_evaluators(case) — pure-heuristic, infers the recommended evaluator set from EvalCase shape. 0 LLM cost, microseconds.
- generate_adversarial_cases(seed, mode, n) — LLM-generated stress cases across 10 named failure modes (ungrounded_claim, jailbreak, prompt_injection_direct/indirect, tool_injection, pii_leakage_invitation, etc.).
- validate_adversarial_cases(cases, baseline, n_shots=3) — N-shot judge-noise filter. Validated +0.80 mean failure-rate separation between weak vs strong baselines.
Reproducible head-to-head — multivon-eval F1 0.804 [0.71–0.88] vs DeepEval F1 0.586 [0.48–0.68] on HaluEval-QA, same N=100, same labels, same judge family. The lower bound of our CI clears DeepEval's upper bound. RAGAS errored on the same input. Run it yourself: eval-framework-benchmark.

Carried forward from 0.7.x

CaseResult.status enum distinguishes judge_error / model_error / evaluator_error from quality failures. pass_rate excludes errors from the denominator.
Per-evaluator error isolation — one judge outage no longer crashes the case.
JUnit XML output + multivon-eval view <report.json> HTML dashboard + multivon-eval init starter templates + EvalReport.assert_budget(...) cost/latency gates.

See CHANGELOG.md for the complete release history.

The Multivon ecosystem

Five public + one early-access package, all built on a shared evaluation engine:

Repo	What it is
multivon-eval (you are here)	Python SDK — 44 evaluators + `bootstrap` CLI + `multivon_eval.auto`
pdfhell	Adversarial PDFs that break AI document readers — procedural ground truth, not LLM-as-judge
multivon-mcp	MCP server exposing 22 evaluation tools to Claude / Cursor / Cline / OpenCode
eval-action	GitHub Action — run a suite on every PR, post a comment, gate the merge on regressions
eval-framework-benchmark	Reproducible head-to-head benchmark vs DeepEval + RAGAS
multivon-guard (early access)	Local proxy that catches LLM coding agents leaking secrets / PII before the request hits the wire. `hello@multivon.ai`.

When NOT to use multivon-eval

You want…	Use
To call evals from inside Claude Code via SKILL files	bundled Claude Code skills — `multivon-eval install-skills`
To call evals from Cursor / Cline / Claude Desktop mid-edit	multivon-mcp
To gate every PR on eval regressions automatically	eval-action
Adversarial PDF benchmarking with code-based ground truth	pdfhell
To see how multivon-eval stacks up against DeepEval / RAGAS	eval-framework-benchmark
Just to gate on a single LLM judge call without a suite	call `Faithfulness(...).evaluate(case, output)` directly — overkill to spin up an `EvalSuite`

Three agent-facing surfaces, one engine. Claude Code skills run inside Claude Code; the MCP server runs alongside any MCP-compatible client (Cursor / Cline / Claude Desktop / OpenCode); the GitHub Action runs on every PR. All three call the same multivon-eval evaluators against the same calibration table — they differ only in where the agent lives.

# pip install multivon-eval anthropic
# export ANTHROPIC_API_KEY=sk-ant-...

import anthropic
from multivon_eval import EvalSuite, EvalCase

client = anthropic.Anthropic()

def support_bot(prompt: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

suite = EvalSuite("Support Bot Eval")
suite.add_check("Response explains how to resolve the issue")
suite.add_check("Tone is professional and not defensive", threshold=0.8)
suite.add_cases([
    EvalCase(
        input="How do I reset my password?",
        context="Users can reset their password by clicking 'Forgot Password' on the login page.",
    ),
])
report = suite.run(support_bot)

─────────────────────── Support Bot Eval ───────────────────────
  #  Input                      Output                   Score  Status    Latency
  1  How do I reset my pas...   Click 'Forgot Passwor…   0.92   PASS      843ms

                           By Evaluator
  Evaluator           Avg Score    Pass Rate
  response_explains      0.92        100%
  tone_is_profess…       0.88         88%

╭────────────────────────────────── Summary ───────────────────────────────────╮
│ Total: 1   Passed: 1   Failed: 0                                              │
│ Pass Rate: 100% [20%–100% 95% CI]   Avg Score: 0.90 [0.82–0.96]             │
╰──────────────────────────────────────────────────────────────────────────────╯
  ⚡ Power warning: 1 case(s) — minimum detectable change at 80% power is ~100%.
  Add ≥291 cases to reliably detect a 10pp shift.

Why multivon-eval

Every team building AI products hits the same problem: how do you know if your model is getting better or worse?

Feature	multivon-eval	DeepEval	RAGAS	Promptfoo
Plain-English checks (`add_check`)	✓	—	—	—
Multi-run + flakiness detection	✓	—	—	—
CI on every report (Wilson + bootstrap)	✓	—	—	—
Multiple-comparison correction (BH)	✓	—	—	—
Power warning + dataset size guidance	✓	—	—	—
Judge calibration against human labels	✓	—	—	—
QAG scoring (binary questions, not 1-10)	✓	—	—	—
Agent-native evaluators (8 metrics)	✓	✓	partial	—
LangChain / LangSmith integration	✓	✓	✓	partial
Compliance audit trail (EU AI Act / NIST)	✓	—	—	—
Local PII detection (zero API calls)	✓	partial	—	—
HTML reports (self-contained, shareable)	✓	—	—	—
Local-first, no account needed	✓	✓	✓	✓
Synthetic data generation	✓	✓	✓	—
Open source (Apache 2.0)	✓	✓	✓	✓

Comparison based on each project's public documentation as of June 2026 (last reviewed 2026-06-03; revisit every minor release). We host these benchmarks open: see benchmarks/ for code + datasets and benchmarks/results/ for the raw output JSON. Found something wrong? Open an issue — we'll fix it.

Numbers, not adjectives

Hallucination detection, HaluEval QA, N=100, claude-haiku-4-5 judge, human labels:

Evaluator	Precision	False positives	F1
multivon-eval (QAG)	0.788	11	0.804
DeepEval (GPT-4o-mini)	0.456	49	0.586
Simple LLM judge (1-10)	0.617	31	0.763
Keyword overlap	0.605	15	0.523

Multi-judge agreement on the same task, N=50, all judges temperature=0:

Judge	Accuracy vs human	Precision	F1
gemini-2.5-flash	0.860	0.950	0.844
gpt-4o-mini	0.820	0.900	0.800
claude-haiku-4-5	0.800	0.895	0.773
gpt-4o	0.780	0.792	0.776
claude-sonnet-4-6	0.720	0.720	0.720

Pairwise Cohen's κ across the 5 judges: 0.60–0.80 (substantial on most pairs). Calibration provenance + per-(judge × evaluator) thresholds ship in multivon_eval/_calibration_data/v2.json. gemini-2.5-flash leads on every metric in this run; claude-haiku-4-5 and gpt-4o-mini are close seconds with cheaper tokens. Pick by your latency / cost / sovereignty constraints — all three are first-class providers.

Cost / latency (benchmarks/results/cost_latency.json) — 50 HaluEval QA cases × 4 LLM-judge evaluators with claude-haiku-4-5, workers=1:

Metric	Value
Cost per case (4 evaluators)	$0.00127
Total cost for the run	$0.0635
Judge calls per case	17.1 (QAG produces 3 questions × 4 evaluators + verification)
Wall clock for 50 cases	15 min
Linear extrapolation to 5,000 cases	$6.35

Cache hit speedup (benchmarks/results/reproducibility.json) — same suite, sequential reruns with set_cache(JudgeCache(...)) installed:

Run	Wall clock	Judge calls
Rep 1 (cold)	2.9 s	4
Rep 2 (hot)	0 ms	0

Cache speedup on the rep-1→rep-2 transition: 2,271×. Cache hits also produce identical scores by construction — flake-proof reruns. set_cache() auto-enables caching for every subsequent JudgeConfig; no need to thread cache=True through every evaluator.

What makes `multivon-eval` different

	What it is	One-line why
QAG scoring	Binary yes/no questions instead of 1-10 ratings	Eliminates scale ambiguity, fully auditable — every score traces to specific questions that passed or failed
Plain-English checks	`suite.add_check("Response explains the return policy")`	No evaluator class to pick, no prompt to craft. Questions auto-generated; pin them for reproducible CI
Bootstrap CLI	`multivon-eval bootstrap` (new in 0.8.0)	Cold-start from product description + traces → tuned suite in 60s
Agent-native	Tool-call accuracy, plan quality, step faithfulness, task completion	Works with traces from any framework (LangChain, LlamaIndex, OpenAI Agents SDK, custom)
Four tiers	Deterministic / LLM-judge / agent-trace / conversation	Mix freely; pay for LLM calls only where they matter
Reliability + flakiness	`suite.run(runs=5)` + statistical significance	Detect cases that pass sometimes and fail others; tells you regressions from noise
Statistical rigor	Wilson CIs, bootstrap, p10/p50/p90, power warnings, BH correction	NAACL 2025: single-run eval scores are unreliable. CIs ship by default
No cold-start	`generate_from_file("docs/")` synthesises cases	No labeled data required to start
Local-first compliance	`PIIEvaluator` + `SchemaEvaluator` + `ComplianceReporter`	Hash-chained audit trails, EU AI Act / NIST AI RMF mappings, `EvalSuite.eu_ai_act_high_risk()` factory
Experiment tracking	`Experiment.record(report)` + `compare(a, b)`	p-values, CIs, McNemar across runs
Cache	`set_cache(JudgeCache(...))` — once	2,271× speedup on rep-2 (4 judge calls → 0), identical scores guaranteed

Install

pip install multivon-eval

cp .env.example .env
# Add ANTHROPIC_API_KEY and/or OPENAI_API_KEY

Claude Code skills (optional)

If you use Claude Code, wire up the three bundled skills with one command:

multivon-eval install-skills        # symlinks eval-bootstrap / eval-audit / eval-explain into ~/.claude/skills/

What each one does:

eval-bootstrap — auto-invoked when Claude Code detects an LLM-touching codebase without an eval directory. Wraps the bootstrap CLI in a Claude Code workflow that fills in the stub model from the project's existing call sites.
eval-audit — auto-invoked between /review and /ship on diffs touching prompts / model calls / tool defs. Runs only the eval cases that stress the changed surface, blocks safety-class regressions.
eval-explain — auto-invoked after /eval-bootstrap (and on phrases like "why did multivon pick X"). Answers in three sentences using the DISCOVERY_REPORT.md rationale.

Full details in multivon_eval/_skills/README.md. Run multivon-eval install-skills --help for the --dry-run / --force flags.

Core concepts

Three primitives, one runner:

from multivon_eval import EvalSuite, EvalCase, Faithfulness, NotEmpty

case = EvalCase(
    input="What caused the 2008 financial crisis?",
    expected_output="Subprime mortgage collapse...",
    context="The 2008 crisis was triggered by widespread mortgage defaults...",
    tags=["finance"],
)

suite = EvalSuite("My eval")
suite.add_cases([case])
suite.add_evaluators(NotEmpty(), Faithfulness(threshold=0.7))

# Serial / parallel / async / multi-run — pick what fits
report = suite.run(model_fn, fail_threshold=0.85)
report = suite.run(model_fn, workers=8)
report = suite.run(model_fn, runs=5)                 # flakiness detection
report = await suite.run_async(model_fn, concurrency=10)

report.save_json("results.json")    # also save_csv, save_html, save_junit_xml

Agent cases use agent_trace=[AgentStep(...)] + expected_tool_calls=[...]. Conversation cases use conversation=[{"role": ..., "content": ...}]. Load existing datasets with load("cases.jsonl") or load("cases.csv").

ToolCallAccuracy three-shape semantics (0.9.0): expected_tool_calls=None skips the case (no expectation set), expected_tool_calls=[] asserts "no tools should have been called" (and a non-empty trace fails), and expected_tool_calls=[...] checks the trace contains the named calls in order. The skip variant is treated as skipped-pass in the report, not 0.0 — see the integrations/ tracers (LangGraphTracer, OpenAIAgentsTracer, ManualTracer) for how each tracer populates agent_trace.

Evaluators — 44 across 7 tiers

Tier	Examples	Cost
Deterministic	`NotEmpty`, `ExactMatch`, `Contains`, `RegexMatch`, `JSONSchemaEval`, `WordCount`, `BLEU`, `ROUGE`, `Latency`, `BERTScore`, `Levenshtein`, `ChrfScore`	Free, instant
LLM-judge (QAG)	`Faithfulness`, `Hallucination`, `Relevance`, `Coherence`, `Toxicity`, `Bias`, `AnswerAccuracy`, `ContextPrecision`, `ContextRecall`, `CustomRubric`, `GEval`, `CheckEvaluator`	~$0.001 / case
Agent-trace	`ToolCallAccuracy`, `ToolArgumentAccuracy`, `ToolCallNecessity`, `TrajectoryEfficiency`, `AgentMemoryEval`, `PlanQuality`, `TaskCompletion`, `StepFaithfulness`	LLM-judge subset
Compliance	`PIIEvaluator` (zero API calls, multi-jurisdiction), `SchemaEvaluator` (Pydantic + JSON Schema)	Free
Conversation	`ConversationRelevance`, `KnowledgeRetention`, `ConversationCompleteness`, `TurnConsistency`	LLM-judge
Multimodal	`VQAFaithfulness`, `DocumentGrounding`	LLM-judge
Consistency	`SelfConsistency`	LLM-judge

Full reference + signatures + examples per evaluator: docs.multivon.ai/evaluators.

Compliance & privacy

For regulated industries (healthcare, finance, legal) where traces can't leave your environment.

PIIEvaluator — local regex-only detection across GDPR, CCPA, HIPAA, DPDP (India), PIPEDA jurisdictions. Email, phone, SSN, credit card (Luhn), passport, IBAN, Aadhaar (Verhoeff), PAN. redact=True masks in the report. Zero LLM calls.
SchemaEvaluator — validates outputs against Pydantic models or JSON Schema with per-field failures. Based on StructEval (2025): GPT-4 fails complex structured extraction ~12% of the time even with explicit format instructions.
ComplianceReporter — hash-chained NDJSON audit log (prev_hash linked, SHA-256). Each result annotated with EU AI Act articles (9(2)(b), 10, 15) or NIST AI RMF subcategories. reporter.coverage(suite) surfaces uncovered controls before you ship. EvalSuite.eu_ai_act_high_risk() factory + for_regulated(jurisdiction="hipaa").

from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
reporter = ComplianceReporter(output_dir="./audit", framework="eu-ai-act")
reporter.record(suite.run(model_fn, runs=5))
reporter.verify(suite.name)  # tamper-evident chain check

Full reference: docs.multivon.ai/compliance — jurisdictions, Article mappings, audit-pack generation, sample-audit-pack download.

Statistical rigor

Backed by NAACL 2025: single-run eval scores are unreliable — variance is large enough to reverse model rankings.

Pass Rate: 80% [69%–89% 95% CI]   Avg Score: 0.82 [0.74–0.90]
Score distribution  p10:0.41  p50:0.88  p90:0.96
⚡ Power warning: 12 cases — minimum detectable change at 80% power is ~45%.

What ships by default in every report:

Wilson 95% CI on pass rate · bootstrap 95% CI on avg score
p10 / p50 / p90 percentiles — exposes bimodal distributions that avg_score hides
Power warning when your test set is too small to detect the shift you care about
runs_needed(delta=0.10) + min_detectable_effect(n=50) for sample-size sizing
Benjamini-Hochberg correction auto-applied in exp.compare() for multi-evaluator runs
Judge calibration — suite.calibrate(labeled_pairs) reports F1 vs human labels per evaluator. Shipped calibration table in _calibration_data/v2.json with per-(judge × evaluator) thresholds (F1 0.66–1.00 range)
Judge reliability check — JudgeConfig(reliability_check=True) flags non-determinism in the judge itself

Full reference: docs.multivon.ai/guides/statistical-rigor.

Synthetic dataset generation

No labeled data? Point generate_from_file() at your docs:

from multivon_eval import generate_from_file, generate_hallucination_pairs

cases = generate_from_file("docs/faq.md", n=20, task="qa")
cases = generate_from_file("docs/whitepaper.txt", n=10, task="summarization")
pairs = generate_hallucination_pairs(my_docs, n=20)

CLI: multivon-eval generate --from docs/faq.md --n 20 --task qa --output cases.jsonl.

For more sophisticated cold-start, the multivon-eval bootstrap CLI composes generation + heuristic anchoring + N-shot judge-noise filtering into one command — see What's new in 0.9.x above for the full flag set (including 0.9.4's --judge-base-url and 0.9.0's --validate) and the bootstrap guide. Run multivon-eval bootstrap --help for the canonical flag reference.

Experiment tracking

Record every run, compare across model / prompt versions, surface regressions before they ship. Stored locally in ~/.multivon/experiments/ — no cloud, no account.

from multivon_eval import Experiment

exp = Experiment("rag-pipeline")
run_a = exp.record(suite.run(old_model_fn), tags={"prompt_v": "2"})
run_b = exp.record(suite.run(new_model_fn), tags={"prompt_v": "3"})
exp.compare(run_a, run_b)  # prints CIs + McNemar p + BH-corrected per-evaluator deltas

CLI: multivon-eval experiments list / history / compare.

Full reference: docs.multivon.ai/guides/experiments.

CLI

multivon-eval init -t <template> -d <dir>     # scaffold a starter eval suite (templates: quickstart, agent, rag, regulated, conversation, agent-langgraph, agent-openai-sdk)
multivon-eval run eval.py                     # execute an eval file
multivon-eval report results.json             # print a saved JSON report
multivon-eval view results.json [--open]      # render the JSON as an HTML dashboard
multivon-eval compare a.json b.json           # diff two reports, McNemar + BH-corrected per-evaluator deltas
multivon-eval generate --from docs/ --n 20    # synthetic case generation from a file/dir
multivon-eval bootstrap --product PRODUCT.md --traces TRACES.jsonl   # cold-start a tuned suite
multivon-eval doctor [--json]                 # preflight: API keys, local judges, versions, dirs
multivon-eval install-skills [--dry-run] [--force]    # symlink the three Claude Code skills
multivon-eval experiments list | history <name> | compare <run_a> <run_b>
multivon-eval attribution scan <repo> | diff <base> <head>   # Phase 1 prompt-fingerprint diff

multivon-eval --help enumerates every flag. Each subcommand has its own --help with examples.

CI/CD integration

# eval.py
report = suite.run(model_fn, fail_threshold=0.85)  # exits 1 if < 85% pass

# .github/workflows/eval.yml
- name: Run evals
  run: python eval.py
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Architecture

EvalSuite.run(model_fn)
  → for each case: model_fn(case.input) → output
  → for each evaluator: deterministic | LLM-judge (QAG) | agent-trace | conversation
  → EvalReport (CaseResults + per-evaluator scores + CIs + rich terminal report)
  → save_json / save_csv / save_html / save_junit_xml

Judges: claude-haiku-4-5 by default (configurable via JUDGE_MODEL + JUDGE_PROVIDER). Local + self-hosted models supported via OPENAI_BASE_URL (Ollama, LM Studio, vLLM, any OpenAI-compatible server). Per-(judge × evaluator) thresholds calibrated against human-labeled benchmarks — see _calibration_data/v2.json for the shipped table with provenance.

Examples

File	What it shows
`basic_eval.py`	Deterministic evaluators only — zero API cost, instant sanity check
`rag_eval.py`	Faithfulness + hallucination for RAG pipelines
`ci_eval.py`	CI/CD integration — `fail_threshold` exits 1 on regression
`check_eval.py`	`add_check()` — write criteria in English, no evaluator class needed
`agent_eval.py`	Agent tool call accuracy with `ManualTracer` — surfaces flaky tool selection

Tests

pip install -e ".[dev]"
pytest tests/ -v

Roadmap

See ROADMAP.md for the full shipped + in-flight list. The headline open items: LlamaIndex / CrewAI tracers, pytest plugin, LiteLLM adapter, tiered cost optimizer, agent simulation. File an issue if you want one prioritized.

Contributing

Issues and PRs welcome.

Small changes (docs, bug fixes): open a PR directly. Large changes (new evaluators, architecture): open an issue first.

git clone https://github.com/multivon-ai/multivon-eval
cd multivon-eval
pip install -e ".[dev]"
pytest tests/

License

Apache 2.0 — built by Multivon

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.12.0

Jun 11, 2026

This version

0.11.1

Jun 11, 2026

0.11.0

Jun 11, 2026

0.10.1

Jun 11, 2026

0.10.0

Jun 10, 2026

0.9.8

Jun 2, 2026

0.9.7

Jun 2, 2026

0.9.6

Jun 2, 2026

0.9.5

Jun 2, 2026

0.9.4

Jun 2, 2026

0.9.3

May 27, 2026

0.9.2

May 26, 2026

0.9.1

May 24, 2026

0.9.0

May 22, 2026

0.8.2

May 19, 2026

0.8.1

May 19, 2026

0.8.0

May 19, 2026

0.7.8

May 17, 2026

0.7.7

May 17, 2026

0.7.6

May 17, 2026

0.7.5

May 17, 2026

0.7.4

May 17, 2026

0.7.3

May 16, 2026

0.7.2

May 16, 2026

0.7.1

May 16, 2026

0.7.0

May 16, 2026

0.6.1

May 14, 2026

0.6.0

May 14, 2026

0.5.0

May 12, 2026

0.4.0

Apr 29, 2026

0.3.0

Apr 26, 2026

0.2.0

Apr 26, 2026

0.1.3

Apr 26, 2026

0.1.2

Apr 26, 2026

0.1.1

Apr 26, 2026

0.1.0

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multivon_eval-0.11.1.tar.gz (590.2 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multivon_eval-0.11.1-py3-none-any.whl (337.3 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file multivon_eval-0.11.1.tar.gz.

File metadata

Download URL: multivon_eval-0.11.1.tar.gz
Upload date: Jun 11, 2026
Size: 590.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for multivon_eval-0.11.1.tar.gz
Algorithm	Hash digest
SHA256	`791ab1b3342afe5538a9cdb75280bbd236600015d8c8731810733d2e991240d3`
MD5	`d054d7804b7019bba78f9be6dbaee793`
BLAKE2b-256	`b02ccc678d17df4f64c86c89d6e7d2e2777b589fbf0fbd5fefd421dc564dc852`

See more details on using hashes here.

File details

Details for the file multivon_eval-0.11.1-py3-none-any.whl.

File metadata

Download URL: multivon_eval-0.11.1-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 337.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for multivon_eval-0.11.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b21a04ca23994ce7268a2279da6a4f0c5555ae8bbb1940053ef767945867b52`
MD5	`2506d18af266438116671f42ee410df1`
BLAKE2b-256	`3a58a30c39dba3bb53e8e4efef309f3284d0dd757680d30c81bb6ded9a2ed05b`

See more details on using hashes here.

multivon-eval 0.11.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

multivon-eval

Why we exist — the credibility story

Quickstart — 30 seconds, no API key

Pick your path

What's new in 0.9.x

Carried forward from 0.8.x

Carried forward from 0.7.x

The Multivon ecosystem

When NOT to use multivon-eval

Why multivon-eval

Numbers, not adjectives

What makes multivon-eval different

Install

Claude Code skills (optional)

Core concepts

Evaluators — 44 across 7 tiers

Compliance & privacy

Statistical rigor

Synthetic dataset generation

Experiment tracking

CLI

CI/CD integration

Architecture

Examples

Tests

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What makes `multivon-eval` different