AI evaluation for teams that ship models to production

These details have not been verified by PyPI

Project links

Project description

multivon-eval

Python License PyPI

AI evaluation for teams that ship models to production.

Run structured evals over your AI outputs — from simple string checks to LLM-as-judge scoring to agent trace validation — with a clean Python API, beautiful terminal reports, and CI/CD integration out of the box.

from multivon_eval import EvalSuite, EvalCase, Relevance, Faithfulness, NotEmpty

suite = EvalSuite("Support Bot Eval")
suite.add_cases([
    EvalCase(
        input="How do I reset my password?",
        context="Users can reset their password by clicking 'Forgot Password' on the login page.",
    ),
])
suite.add_evaluators(NotEmpty(), Relevance(), Faithfulness())
report = suite.run(my_model_fn)

─────────────────────── Support Bot Eval ───────────────────────
  #  Input                      Output                   Score  Status    Latency
  1  How do I reset my pas...   Click 'Forgot Passwor…   0.92   PASS      843ms

                           By Evaluator
  Evaluator       Avg Score    Pass Rate
  not_empty          1.00        100%
  relevance          0.88         88%
  faithfulness       0.87         87%

╭─────────────────────── Summary ───────────────────────╮
│ Total: 1   Passed: 1   Failed: 0   Pass Rate: 100%   │
╰────────────────────────────────────────────────────────╯

Why multivon-eval

Every team building AI products hits the same problem: how do you know if your model is getting better or worse?

Existing tools have real limitations:

DeepEval — powerful but LLM-as-judge for everything is expensive, slow, and hard to audit
RAGAS — excellent, but RAG-only
Promptfoo — YAML-driven, feels rigid for Python teams

multivon-eval is different in three ways:

QAG scoring — Instead of asking a judge "rate this 1-10", we generate yes/no questions about the output and score by the fraction answered correctly. Binary questions are easier for LLMs to get right, fully auditable, and cheaper.

Agent-native — Built-in evaluators for tool call accuracy, plan quality, step faithfulness, and task completion. Covers agent traces from any framework (LangChain, LlamaIndex, custom).

Four tiers — Deterministic (free, instant), LLM-judge (QAG), agent-trace, and conversation evaluators. Mix and match; pay for LLM calls only where it matters.

No cold-start — Generate eval cases from your docs with generate_from_file(). No labeled data required to get started.

Experiment tracking — Record every run, compare across model versions, catch regressions before they reach users.

Install

pip install multivon-eval

cp .env.example .env
# Add ANTHROPIC_API_KEY and/or OPENAI_API_KEY

Core concepts

`EvalCase` — A test case

from multivon_eval import EvalCase

case = EvalCase(
    input="What caused the 2008 financial crisis?",          # required
    expected_output="Subprime mortgage collapse...",          # for ExactMatch, Contains
    context="The 2008 crisis was triggered by...",           # for Faithfulness, Hallucination
    tags=["finance", "history"],                             # for filtering reports
    metadata={"source": "test_set_v2", "difficulty": "hard"},
)

For agent evals:

from multivon_eval import EvalCase, AgentStep, ToolCall

case = EvalCase(
    input="search for recent AI papers and summarize",
    agent_trace=[
        AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "AI papers 2025"})]),
        AgentStep(tool_calls=[ToolCall(name="summarize")]),
    ],
    expected_tool_calls=["search", "summarize"],
)

Evaluators

Four tiers — pick what fits your use case.

Tier 1: Deterministic (free, instant, no LLM needed)

Evaluator	What it checks
`NotEmpty`	Response is non-empty
`ExactMatch`	Response matches `expected_output` exactly
`Contains(substrings)`	Response contains all required strings
`RegexMatch(pattern)`	Response matches a regex pattern
`JSONSchemaEval(schema)`	Response is valid JSON matching a schema
`WordCount(min, max)`	Word count within range
`Latency(max_ms)`	Response time under limit
`BLEU(n)`	BLEU-n score vs expected output
`ROUGE`	ROUGE-L F1 vs expected output
`StartsWith(prefix)`	Response starts with prefix

Tier 2: LLM-as-judge (QAG scoring)

Evaluator	What it measures	Requires `context`
`Faithfulness`	Response is grounded in context	Yes
`Hallucination`	Response doesn't invent facts	Yes
`Relevance`	Response addresses the question	No
`Coherence`	Response is clear and well-structured	No
`Toxicity`	Response is safe and non-harmful	No
`Bias`	Response is free of demographic bias	No
`Summarization`	Summary captures key points faithfully	Yes
`AnswerAccuracy`	Factual correctness vs expected	No
`ContextPrecision`	Relevant context retrieved	Yes
`ContextRecall`	All needed context retrieved	Yes
`CustomRubric`	Your own yes/no criteria	Optional
`GEval`	Holistic numeric quality score	Optional

from multivon_eval import Faithfulness, CustomRubric

CustomRubric(
    name="support_quality",
    criteria=[
        ("Does the response acknowledge the customer's problem?", True),
        ("Does the response provide a concrete next step?", True),
        ("Does the response use apologetic or defensive language?", False),
    ],
    threshold=0.8,
)

Tier 3: Agent trace evaluators

Evaluator	What it checks
`ToolCallAccuracy`	Expected tools called (ordered or unordered)
`ToolArgumentAccuracy`	Quality of tool arguments
`PlanQuality`	Plan logic, completeness, efficiency
`TaskCompletion`	Final output satisfies the task goal
`StepFaithfulness`	Each step follows logically from prior

from multivon_eval import ToolCallAccuracy

ToolCallAccuracy(require_order=True)  # strict ordering
ToolCallAccuracy(require_order=False) # set match (default)

Tier 4: Conversation evaluators

Evaluator	What it checks
`ConversationRelevance`	Each response stays on topic
`KnowledgeRetention`	Model remembers earlier context
`ConversationCompleteness`	Conversation resolves the original goal
`TurnConsistency`	No contradictions across turns

from multivon_eval import EvalCase

case = EvalCase(
    input="Is this product available in blue?",
    conversation=[
        {"role": "user", "content": "I need a new laptop"},
        {"role": "assistant", "content": "I can help you find a laptop. What's your budget?"},
        {"role": "user", "content": "Around $1000"},
        {"role": "assistant", "content": "Here are some options around $1000..."},
    ],
)

`EvalSuite` — The runner

from multivon_eval import EvalSuite

suite = EvalSuite("My Eval", model_id="gpt-4o")
suite.add_cases(cases)
suite.add_evaluators(NotEmpty(), Relevance(), Faithfulness(threshold=0.7))

# Serial
report = suite.run(model_fn, verbose=True, fail_threshold=0.8)

# Parallel (thread-based)
report = suite.run(model_fn, workers=8)

# Async
import asyncio
report = asyncio.run(suite.run_async(model_fn, concurrency=10))

Loading datasets

from multivon_eval import load

cases = load("tests/dataset.jsonl")  # auto-detects format
cases = load("tests/dataset.csv")

JSONL format:

{"input": "What is the capital of France?", "expected_output": "Paris", "tags": ["factual"]}
{"input": "Summarize this document.", "context": "Document text here...", "tags": ["summarization"]}

CSV format:

input,expected_output,context,tags
What is 2+2?,4,,math
Summarize this.,,Long text here,summarization

Exporting results

report.save_json("results.json")
report.save_csv("results.csv")

Synthetic dataset generation

No labeled data? No problem. Point generate_from_file() at your docs and get eval cases ready to run in seconds.

from multivon_eval import generate_from_file

# Generate QA pairs from your docs
cases = generate_from_file("docs/faq.md", n=20, task="qa")

# Generate summarization cases
cases = generate_from_file("docs/whitepaper.txt", n=10, task="summarization")

suite.add_cases(cases)
report = suite.run(my_model_fn)

From raw text:

from multivon_eval import generate_from_text

cases = generate_from_text(my_knowledge_base, n=50, task="qa")

Build a hallucination benchmark from your own content:

from multivon_eval import generate_hallucination_pairs

pairs = generate_hallucination_pairs(my_docs, n=20)
# Returns: [{question, context, faithful_answer, hallucinated_answer}, ...]

CLI:

multivon-eval generate --from docs/faq.md --n 20 --task qa --output cases.jsonl

Experiment tracking

Record every suite run and compare results across model versions, prompt changes, or time. Stored locally in ~/.multivon/experiments/ — no cloud, no account.

from multivon_eval import Experiment

exp = Experiment("rag-pipeline")

# Run A — baseline
report_a = suite.run(old_model_fn)
run_a = exp.record(report_a, tags={"model": "gpt-4o", "prompt_v": "2"})

# Run B — new version
report_b = suite.run(new_model_fn)
run_b = exp.record(report_b, tags={"model": "gpt-4o", "prompt_v": "3"})

# Compare
exp.compare(run_a, run_b)

  ============================================================
  Experiment comparison: a1b2c3d4 → e5f6g7h8
  ============================================================

  Metric                   Before           After
  ------------------------------------------------------------
  Model                    gpt-4o           gpt-4o
  Pass rate                  84.0%  →   91.0%  ↑   +7.0%
  Avg score                 0.8210  →   0.8890  ↑  +0.0680
  Passed                        42  →       46
  Failed                         8  →        4

  Evaluator scores         Before           After
  ------------------------------------------------------------
  faithfulness             0.7800  →   0.8600  ↑  +0.0800
  relevance                0.9100  →   0.9300  ↑  +0.0200

  Verdict: IMPROVED — pass rate up +7.0%

CLI:

multivon-eval experiments list
multivon-eval experiments history rag-pipeline
multivon-eval experiments compare rag-pipeline a1b2c3d4 e5f6g7h8

CLI

multivon-eval run eval.py
multivon-eval report results.json

CI/CD integration

# eval.py
report = suite.run(model_fn, fail_threshold=0.85)  # exits 1 if < 85% pass

# .github/workflows/eval.yml
- name: Run evals
  run: python eval.py
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Architecture

EvalSuite.run(model_fn)
     │
     ├── for each EvalCase:
     │     ├── call model_fn(case.input) → output
     │     └── for each Evaluator:
     │           ├── Deterministic → no LLM, instant
     │           ├── LLM Judge → QAG yes/no questions → fraction score
     │           ├── Agent → trace inspection + LLM judge
     │           └── Conversation → multi-turn analysis
     │
     └── EvalReport
           ├── CaseResult × N
           ├── per-evaluator scores
           ├── terminal report (rich)
           └── export → JSON / CSV

Judge model: Configured via JUDGE_MODEL and JUDGE_PROVIDER env vars. Defaults to claude-sonnet-4-6. The model under test and the judge model can be different providers.

Examples

File	What it shows
`basic_eval.py`	Deterministic evaluators, no LLM judge
`rag_eval.py`	Faithfulness + hallucination for RAG systems
`ci_eval.py`	CI/CD integration with pass threshold

Tests

pip install -e ".[dev]"
pytest tests/ -v

Roadmap

Deterministic evaluators (BLEU, ROUGE, regex, JSON schema, latency)
LLM-as-judge with QAG scoring
Agent trace evaluators (tool call accuracy, plan quality)
Conversation evaluators
Parallel + async runners
CLI (multivon-eval run, multivon-eval report)
HTML report export
Pytest plugin (@eval_case decorator)
Model comparison mode — diff two models on same cases
Eval versioning — track scores over time

Contributing

Issues and PRs welcome.

Small changes (docs, bug fixes): open a PR directly. Large changes (new evaluators, architecture): open an issue first.

git clone https://github.com/multivon-ai/multivon-eval
cd llm-evals
pip install -e ".[dev]"
pytest tests/

License

Apache 2.0 — built by Multivon

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.2

May 19, 2026

0.8.1

May 19, 2026

0.8.0

May 19, 2026

0.7.8

May 17, 2026

0.7.7

May 17, 2026

0.7.6

May 17, 2026

0.7.5

May 17, 2026

0.7.4

May 17, 2026

0.7.3

May 16, 2026

0.7.2

May 16, 2026

0.7.1

May 16, 2026

0.7.0

May 16, 2026

0.6.1

May 14, 2026

0.6.0

May 14, 2026

0.5.0

May 12, 2026

0.4.0

Apr 29, 2026

0.3.0

Apr 26, 2026

0.2.0

Apr 26, 2026

This version

0.1.3

Apr 26, 2026

0.1.2

Apr 26, 2026

0.1.1

Apr 26, 2026

0.1.0

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multivon_eval-0.1.3.tar.gz (39.9 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multivon_eval-0.1.3-py3-none-any.whl (39.5 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file multivon_eval-0.1.3.tar.gz.

File metadata

Download URL: multivon_eval-0.1.3.tar.gz
Upload date: Apr 26, 2026
Size: 39.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for multivon_eval-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`eb1acca9d5841d4bfe4ba899be0ad8ff868d6e6c5e1ac919ea5af4fedf19a927`
MD5	`90b13797a58c09f617a6e8b4f8b9e9a7`
BLAKE2b-256	`0d4835951eb7f5183bba42bc5ced91026521de1b97cf982d3e2668468df06b0f`

See more details on using hashes here.

File details

Details for the file multivon_eval-0.1.3-py3-none-any.whl.

File metadata

Download URL: multivon_eval-0.1.3-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 39.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for multivon_eval-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33c7040496bac2edfb79e160856bb05e566b92472bf8b5f40c8d815daa5c8d71`
MD5	`117a87cb14df43966634246d59c745bc`
BLAKE2b-256	`dd6ddd56e11bd427edd4c0072e8588220edd36bda86774abd9b45ebb1b10a6bf`

See more details on using hashes here.

multivon-eval 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

multivon-eval

Why multivon-eval

Install

Core concepts

EvalCase — A test case

Evaluators

Tier 1: Deterministic (free, instant, no LLM needed)

Tier 2: LLM-as-judge (QAG scoring)

Tier 3: Agent trace evaluators

Tier 4: Conversation evaluators

EvalSuite — The runner

Loading datasets

Exporting results

Synthetic dataset generation

Experiment tracking

CLI

CI/CD integration

Architecture

Examples

Tests

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`EvalCase` — A test case

`EvalSuite` — The runner