Skip to main content

AI evaluation for teams that ship models to production

Project description

multivon-eval

Python License PyPI

Documentation · Website · PyPI

AI evaluation for teams that ship models to production.

Run structured evals over your AI outputs — from simple string checks to LLM-as-judge scoring to agent trace validation — with a clean Python API, beautiful terminal reports, and CI/CD integration out of the box.


from multivon_eval import EvalSuite, EvalCase, Relevance, Faithfulness, NotEmpty

suite = EvalSuite("Support Bot Eval")
suite.add_cases([
    EvalCase(
        input="How do I reset my password?",
        context="Users can reset their password by clicking 'Forgot Password' on the login page.",
    ),
])
suite.add_evaluators(NotEmpty(), Relevance(), Faithfulness())
report = suite.run(my_model_fn)
─────────────────────── Support Bot Eval ───────────────────────
  #  Input                      Output                   Score  Status    Latency
  1  How do I reset my pas...   Click 'Forgot Passwor…   0.92   PASS      843ms

                           By Evaluator
  Evaluator       Avg Score    Pass Rate
  not_empty          1.00        100%
  relevance          0.88         88%
  faithfulness       0.87         87%

╭─────────────────────── Summary ───────────────────────╮
│ Total: 1   Passed: 1   Failed: 0   Pass Rate: 100%   │
╰────────────────────────────────────────────────────────╯

Why multivon-eval

Every team building AI products hits the same problem: how do you know if your model is getting better or worse?

Existing tools have real limitations:

  • DeepEval — powerful but LLM-as-judge for everything is expensive, slow, and hard to audit
  • RAGAS — excellent, but RAG-only
  • Promptfoo — YAML-driven, feels rigid for Python teams

multivon-eval is different in three ways:

QAG scoring — Instead of asking a judge "rate this 1-10", we generate yes/no questions about the output and score by the fraction answered correctly. Binary questions are easier for LLMs to get right, fully auditable, and cheaper.

Agent-native — Built-in evaluators for tool call accuracy, plan quality, step faithfulness, and task completion. Covers agent traces from any framework (LangChain, LlamaIndex, custom).

Four tiers — Deterministic (free, instant), LLM-judge (QAG), agent-trace, and conversation evaluators. Mix and match; pay for LLM calls only where it matters.


Install

pip install multivon-eval
cp .env.example .env
# Add ANTHROPIC_API_KEY and/or OPENAI_API_KEY

Core concepts

EvalCase — A test case

from multivon_eval import EvalCase

case = EvalCase(
    input="What caused the 2008 financial crisis?",          # required
    expected_output="Subprime mortgage collapse...",          # for ExactMatch, Contains
    context="The 2008 crisis was triggered by...",           # for Faithfulness, Hallucination
    tags=["finance", "history"],                             # for filtering reports
    metadata={"source": "test_set_v2", "difficulty": "hard"},
)

For agent evals:

from multivon_eval import EvalCase, AgentStep, ToolCall

case = EvalCase(
    input="search for recent AI papers and summarize",
    agent_trace=[
        AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "AI papers 2025"})]),
        AgentStep(tool_calls=[ToolCall(name="summarize")]),
    ],
    expected_tool_calls=["search", "summarize"],
)

Evaluators

Four tiers — pick what fits your use case.

Tier 1: Deterministic (free, instant, no LLM needed)

Evaluator What it checks
NotEmpty Response is non-empty
ExactMatch Response matches expected_output exactly
Contains(substrings) Response contains all required strings
RegexMatch(pattern) Response matches a regex pattern
JSONSchemaEval(schema) Response is valid JSON matching a schema
WordCount(min, max) Word count within range
Latency(max_ms) Response time under limit
BLEU(n) BLEU-n score vs expected output
ROUGE ROUGE-L F1 vs expected output
StartsWith(prefix) Response starts with prefix

Tier 2: LLM-as-judge (QAG scoring)

Evaluator What it measures Requires context
Faithfulness Response is grounded in context Yes
Hallucination Response doesn't invent facts Yes
Relevance Response addresses the question No
Coherence Response is clear and well-structured No
Toxicity Response is safe and non-harmful No
Bias Response is free of demographic bias No
Summarization Summary captures key points faithfully Yes
AnswerAccuracy Factual correctness vs expected No
ContextPrecision Relevant context retrieved Yes
ContextRecall All needed context retrieved Yes
CustomRubric Your own yes/no criteria Optional
GEval Holistic numeric quality score Optional
from multivon_eval import Faithfulness, CustomRubric

CustomRubric(
    name="support_quality",
    criteria=[
        ("Does the response acknowledge the customer's problem?", True),
        ("Does the response provide a concrete next step?", True),
        ("Does the response use apologetic or defensive language?", False),
    ],
    threshold=0.8,
)

Tier 3: Agent trace evaluators

Evaluator What it checks
ToolCallAccuracy Expected tools called (ordered or unordered)
ToolArgumentAccuracy Quality of tool arguments
PlanQuality Plan logic, completeness, efficiency
TaskCompletion Final output satisfies the task goal
StepFaithfulness Each step follows logically from prior
from multivon_eval import ToolCallAccuracy

ToolCallAccuracy(require_order=True)  # strict ordering
ToolCallAccuracy(require_order=False) # set match (default)

Tier 4: Conversation evaluators

Evaluator What it checks
ConversationRelevance Each response stays on topic
KnowledgeRetention Model remembers earlier context
ConversationCompleteness Conversation resolves the original goal
TurnConsistency No contradictions across turns
from multivon_eval import EvalCase

case = EvalCase(
    input="Is this product available in blue?",
    conversation=[
        {"role": "user", "content": "I need a new laptop"},
        {"role": "assistant", "content": "I can help you find a laptop. What's your budget?"},
        {"role": "user", "content": "Around $1000"},
        {"role": "assistant", "content": "Here are some options around $1000..."},
    ],
)

EvalSuite — The runner

from multivon_eval import EvalSuite

suite = EvalSuite("My Eval", model_id="gpt-4o")
suite.add_cases(cases)
suite.add_evaluators(NotEmpty(), Relevance(), Faithfulness(threshold=0.7))

# Serial
report = suite.run(model_fn, verbose=True, fail_threshold=0.8)

# Parallel (thread-based)
report = suite.run(model_fn, workers=8)

# Async
import asyncio
report = asyncio.run(suite.run_async(model_fn, concurrency=10))

Loading datasets

from multivon_eval import load

cases = load("tests/dataset.jsonl")  # auto-detects format
cases = load("tests/dataset.csv")

JSONL format:

{"input": "What is the capital of France?", "expected_output": "Paris", "tags": ["factual"]}
{"input": "Summarize this document.", "context": "Document text here...", "tags": ["summarization"]}

CSV format:

input,expected_output,context,tags
What is 2+2?,4,,math
Summarize this.,,Long text here,summarization

Exporting results

report.save_json("results.json")
report.save_csv("results.csv")

CLI

multivon-eval run eval.py
multivon-eval report results.json

CI/CD integration

# eval.py
report = suite.run(model_fn, fail_threshold=0.85)  # exits 1 if < 85% pass
# .github/workflows/eval.yml
- name: Run evals
  run: python eval.py
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Architecture

EvalSuite.run(model_fn)
     │
     ├── for each EvalCase:
     │     ├── call model_fn(case.input) → output
     │     └── for each Evaluator:
     │           ├── Deterministic → no LLM, instant
     │           ├── LLM Judge → QAG yes/no questions → fraction score
     │           ├── Agent → trace inspection + LLM judge
     │           └── Conversation → multi-turn analysis
     │
     └── EvalReport
           ├── CaseResult × N
           ├── per-evaluator scores
           ├── terminal report (rich)
           └── export → JSON / CSV

Judge model: Configured via JUDGE_MODEL and JUDGE_PROVIDER env vars. Defaults to claude-sonnet-4-6. The model under test and the judge model can be different providers.


Examples

File What it shows
basic_eval.py Deterministic evaluators, no LLM judge
rag_eval.py Faithfulness + hallucination for RAG systems
ci_eval.py CI/CD integration with pass threshold

Tests

pip install -e ".[dev]"
pytest tests/ -v

Roadmap

  • Deterministic evaluators (BLEU, ROUGE, regex, JSON schema, latency)
  • LLM-as-judge with QAG scoring
  • Agent trace evaluators (tool call accuracy, plan quality)
  • Conversation evaluators
  • Parallel + async runners
  • CLI (multivon-eval run, multivon-eval report)
  • HTML report export
  • Pytest plugin (@eval_case decorator)
  • Model comparison mode — diff two models on same cases
  • Eval versioning — track scores over time

Contributing

Issues and PRs welcome.

Small changes (docs, bug fixes): open a PR directly. Large changes (new evaluators, architecture): open an issue first.

git clone https://github.com/multivon-ai/multivon-eval
cd llm-evals
pip install -e ".[dev]"
pytest tests/

License

MIT — built by Multivon

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multivon_eval-0.1.1.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multivon_eval-0.1.1-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file multivon_eval-0.1.1.tar.gz.

File metadata

  • Download URL: multivon_eval-0.1.1.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for multivon_eval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5690885af4c3fa94026e4850e9e45f6539ca64c893d921b9d531b550f143fe19
MD5 bc8ba467f286d36aaab7e184fba7e11a
BLAKE2b-256 f1d77ec64031426de798a128bdc6eb9e1515fd95dd1d02b434f68e5c4e318fd3

See more details on using hashes here.

File details

Details for the file multivon_eval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: multivon_eval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for multivon_eval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 26964abad0d7bd2d1f00d71d67531c303e4c0dcb86569b01c8c759ebd32f60b0
MD5 0dcbaf4df1a3c4d3d61eb5c79ebea96a
BLAKE2b-256 26fc3f673f2076fe8ab0089a34dd5b7a16e2339782a8f384706b34b7f3df8044

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page