AI evaluation for teams that ship models to production
Project description
multivon-eval
Documentation · Website · PyPI
AI evaluation for teams that ship models to production.
Run structured evals over your AI outputs — from simple string checks to LLM-as-judge scoring to agent trace validation — with a clean Python API, beautiful terminal reports, and CI/CD integration out of the box.
from multivon_eval import EvalSuite, EvalCase, Relevance, Faithfulness, NotEmpty
suite = EvalSuite("Support Bot Eval")
suite.add_cases([
EvalCase(
input="How do I reset my password?",
context="Users can reset their password by clicking 'Forgot Password' on the login page.",
),
])
suite.add_evaluators(NotEmpty(), Relevance(), Faithfulness())
report = suite.run(my_model_fn)
─────────────────────── Support Bot Eval ───────────────────────
# Input Output Score Status Latency
1 How do I reset my pas... Click 'Forgot Passwor… 0.92 PASS 843ms
By Evaluator
Evaluator Avg Score Pass Rate
not_empty 1.00 100%
relevance 0.88 88%
faithfulness 0.87 87%
╭─────────────────────── Summary ───────────────────────╮
│ Total: 1 Passed: 1 Failed: 0 Pass Rate: 100% │
╰────────────────────────────────────────────────────────╯
Why multivon-eval
Every team building AI products hits the same problem: how do you know if your model is getting better or worse?
Existing tools have real limitations:
- DeepEval — powerful but LLM-as-judge for everything is expensive, slow, and hard to audit
- RAGAS — excellent, but RAG-only
- Promptfoo — YAML-driven, feels rigid for Python teams
multivon-eval is different in three ways:
QAG scoring — Instead of asking a judge "rate this 1-10", we generate yes/no questions about the output and score by the fraction answered correctly. Binary questions are easier for LLMs to get right, fully auditable, and cheaper.
Agent-native — Built-in evaluators for tool call accuracy, plan quality, step faithfulness, and task completion. Covers agent traces from any framework (LangChain, LlamaIndex, custom).
Four tiers — Deterministic (free, instant), LLM-judge (QAG), agent-trace, and conversation evaluators. Mix and match; pay for LLM calls only where it matters.
Install
pip install multivon-eval
cp .env.example .env
# Add ANTHROPIC_API_KEY and/or OPENAI_API_KEY
Core concepts
EvalCase — A test case
from multivon_eval import EvalCase
case = EvalCase(
input="What caused the 2008 financial crisis?", # required
expected_output="Subprime mortgage collapse...", # for ExactMatch, Contains
context="The 2008 crisis was triggered by...", # for Faithfulness, Hallucination
tags=["finance", "history"], # for filtering reports
metadata={"source": "test_set_v2", "difficulty": "hard"},
)
For agent evals:
from multivon_eval import EvalCase, AgentStep, ToolCall
case = EvalCase(
input="search for recent AI papers and summarize",
agent_trace=[
AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "AI papers 2025"})]),
AgentStep(tool_calls=[ToolCall(name="summarize")]),
],
expected_tool_calls=["search", "summarize"],
)
Evaluators
Four tiers — pick what fits your use case.
Tier 1: Deterministic (free, instant, no LLM needed)
| Evaluator | What it checks |
|---|---|
NotEmpty |
Response is non-empty |
ExactMatch |
Response matches expected_output exactly |
Contains(substrings) |
Response contains all required strings |
RegexMatch(pattern) |
Response matches a regex pattern |
JSONSchemaEval(schema) |
Response is valid JSON matching a schema |
WordCount(min, max) |
Word count within range |
Latency(max_ms) |
Response time under limit |
BLEU(n) |
BLEU-n score vs expected output |
ROUGE |
ROUGE-L F1 vs expected output |
StartsWith(prefix) |
Response starts with prefix |
Tier 2: LLM-as-judge (QAG scoring)
| Evaluator | What it measures | Requires context |
|---|---|---|
Faithfulness |
Response is grounded in context | Yes |
Hallucination |
Response doesn't invent facts | Yes |
Relevance |
Response addresses the question | No |
Coherence |
Response is clear and well-structured | No |
Toxicity |
Response is safe and non-harmful | No |
Bias |
Response is free of demographic bias | No |
Summarization |
Summary captures key points faithfully | Yes |
AnswerAccuracy |
Factual correctness vs expected | No |
ContextPrecision |
Relevant context retrieved | Yes |
ContextRecall |
All needed context retrieved | Yes |
CustomRubric |
Your own yes/no criteria | Optional |
GEval |
Holistic numeric quality score | Optional |
from multivon_eval import Faithfulness, CustomRubric
CustomRubric(
name="support_quality",
criteria=[
("Does the response acknowledge the customer's problem?", True),
("Does the response provide a concrete next step?", True),
("Does the response use apologetic or defensive language?", False),
],
threshold=0.8,
)
Tier 3: Agent trace evaluators
| Evaluator | What it checks |
|---|---|
ToolCallAccuracy |
Expected tools called (ordered or unordered) |
ToolArgumentAccuracy |
Quality of tool arguments |
PlanQuality |
Plan logic, completeness, efficiency |
TaskCompletion |
Final output satisfies the task goal |
StepFaithfulness |
Each step follows logically from prior |
from multivon_eval import ToolCallAccuracy
ToolCallAccuracy(require_order=True) # strict ordering
ToolCallAccuracy(require_order=False) # set match (default)
Tier 4: Conversation evaluators
| Evaluator | What it checks |
|---|---|
ConversationRelevance |
Each response stays on topic |
KnowledgeRetention |
Model remembers earlier context |
ConversationCompleteness |
Conversation resolves the original goal |
TurnConsistency |
No contradictions across turns |
from multivon_eval import EvalCase
case = EvalCase(
input="Is this product available in blue?",
conversation=[
{"role": "user", "content": "I need a new laptop"},
{"role": "assistant", "content": "I can help you find a laptop. What's your budget?"},
{"role": "user", "content": "Around $1000"},
{"role": "assistant", "content": "Here are some options around $1000..."},
],
)
EvalSuite — The runner
from multivon_eval import EvalSuite
suite = EvalSuite("My Eval", model_id="gpt-4o")
suite.add_cases(cases)
suite.add_evaluators(NotEmpty(), Relevance(), Faithfulness(threshold=0.7))
# Serial
report = suite.run(model_fn, verbose=True, fail_threshold=0.8)
# Parallel (thread-based)
report = suite.run(model_fn, workers=8)
# Async
import asyncio
report = asyncio.run(suite.run_async(model_fn, concurrency=10))
Loading datasets
from multivon_eval import load
cases = load("tests/dataset.jsonl") # auto-detects format
cases = load("tests/dataset.csv")
JSONL format:
{"input": "What is the capital of France?", "expected_output": "Paris", "tags": ["factual"]}
{"input": "Summarize this document.", "context": "Document text here...", "tags": ["summarization"]}
CSV format:
input,expected_output,context,tags
What is 2+2?,4,,math
Summarize this.,,Long text here,summarization
Exporting results
report.save_json("results.json")
report.save_csv("results.csv")
CLI
multivon-eval run eval.py
multivon-eval report results.json
CI/CD integration
# eval.py
report = suite.run(model_fn, fail_threshold=0.85) # exits 1 if < 85% pass
# .github/workflows/eval.yml
- name: Run evals
run: python eval.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Architecture
EvalSuite.run(model_fn)
│
├── for each EvalCase:
│ ├── call model_fn(case.input) → output
│ └── for each Evaluator:
│ ├── Deterministic → no LLM, instant
│ ├── LLM Judge → QAG yes/no questions → fraction score
│ ├── Agent → trace inspection + LLM judge
│ └── Conversation → multi-turn analysis
│
└── EvalReport
├── CaseResult × N
├── per-evaluator scores
├── terminal report (rich)
└── export → JSON / CSV
Judge model: Configured via JUDGE_MODEL and JUDGE_PROVIDER env vars. Defaults to claude-sonnet-4-6. The model under test and the judge model can be different providers.
Examples
| File | What it shows |
|---|---|
basic_eval.py |
Deterministic evaluators, no LLM judge |
rag_eval.py |
Faithfulness + hallucination for RAG systems |
ci_eval.py |
CI/CD integration with pass threshold |
Tests
pip install -e ".[dev]"
pytest tests/ -v
Roadmap
- Deterministic evaluators (BLEU, ROUGE, regex, JSON schema, latency)
- LLM-as-judge with QAG scoring
- Agent trace evaluators (tool call accuracy, plan quality)
- Conversation evaluators
- Parallel + async runners
- CLI (
multivon-eval run,multivon-eval report) - HTML report export
- Pytest plugin (
@eval_casedecorator) - Model comparison mode — diff two models on same cases
- Eval versioning — track scores over time
Contributing
Issues and PRs welcome.
Small changes (docs, bug fixes): open a PR directly. Large changes (new evaluators, architecture): open an issue first.
git clone https://github.com/multivon-ai/multivon-eval
cd llm-evals
pip install -e ".[dev]"
pytest tests/
License
MIT — built by Multivon
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multivon_eval-0.1.0.tar.gz.
File metadata
- Download URL: multivon_eval-0.1.0.tar.gz
- Upload date:
- Size: 30.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
663b7e754d1aa2114594bd426220f51e6b6e690d0d9ca0c69040194ff448a220
|
|
| MD5 |
b0d7504585c08d426180e3ccecbc0963
|
|
| BLAKE2b-256 |
b524f60cc0974207a5c5eda256ddcb9d020aa931e3c0b64c1f9f90a460e231c3
|
File details
Details for the file multivon_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: multivon_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed9132c95082c8421495397ba530a51e77e6b454d52c98be7df18df38d3c9045
|
|
| MD5 |
20cb8255680e6c7d3ba38deddef8a123
|
|
| BLAKE2b-256 |
4724a42f5a332ca35e04004243fd140b1a044646bb461b17e0fcf269c508b33c
|