MCP server exposing multivon-eval + pdfhell as agent-callable tools. Drop into Claude Desktop, Cursor, Cline, or any MCP-compatible AI coding agent.

These details have not been verified by PyPI

Project links

Project description

multivon-mcp

Docs · Website · PyPI · multivon-eval (engine) · Changelog

These 22 tools are what an autonomous eval agent needs to do its job: discover its own capabilities (eval_discover), normalize traces from any source (ingest_trace), and run calibrated evaluators against them. The framework lives behind an MCP boundary because that's the future shape of eval — a swarm of specialized eval agents coordinating through the protocol, not a SaaS dashboard.

MCP server that gives AI coding agents direct access to evaluation tools. Drop into Claude Desktop, Claude Code, Cursor, Cline, or any Model Context Protocol–compatible agent.

When the agent is helping you build an LLM product, it can:

Score a RAG output for hallucination without you writing the scaffolding
Generate an adversarial PDF on demand to test your document AI
Run the full pdfhell mini-suite against a model and analyse the results
Produce a hash-chained audit pack for procurement diligence
Discover the full evaluation capability catalog as JSON

No copy-paste, no python -c "...", no asking the agent to figure out the SDK calls.

Install

pip install multivon-mcp

Bare install pulls multivon-eval, pdfhell, and the MCP SDK. The provider SDKs (anthropic, openai, google-genai) come along too — bring your own API key in env.

Configure your agent

Claude Code

claude mcp add multivon --env ANTHROPIC_API_KEY=sk-ant-... -- multivon-mcp

(Or add the same mcpServers snippet below to a project-level .mcp.json — Claude Code does not read claude_desktop_config.json.)

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "multivon": {
      "command": "multivon-mcp",
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "OPENAI_API_KEY": "sk-proj-...",
        "GOOGLE_API_KEY": "AIza..."
      }
    }
  }
}

Restart Claude. The 22 tools become available; ask Claude "use multivon to evaluate this RAG output" and it figures out which tool to call.

Cursor

cursor.json or via Settings → MCP:

{ "mcpServers": { "multivon": { "command": "multivon-mcp" } } }

Cline / OpenCode / any MCP-compatible agent

Same shape — point at the multivon-mcp console script.

Local dev / debugging

From a clone of this repo:

mcp dev multivon_mcp/server.py

From a pip install (the file lives in site-packages, so resolve it):

mcp dev "$(python -c 'import multivon_mcp.server as s; print(s.__file__)')"

Opens the MCP Inspector UI in your browser. You can call any tool by name, see the JSON schemas, and watch the requests/responses.

The 22 tools

Discovery & document AI

Tool	What it does	API key
`eval_discover`	Full machine-readable capability catalog (evaluators, traps, suites, calibration data, versions). Call first.	No
`pdfhell_make`	Generate one adversarial PDF + its answer key.	No
`pdfhell_run`	Run the pdfhell adversarial-PDF benchmark against a vision model. Returns pass rate, per-trap CIs, suite hash.	Yes (vision)
`eval_audit_pack`	Build a hash-chained, procurement-ready ZIP from a pdfhell run.	No

RAG generation & retrieval

Tool	What it does	API key
`eval_faithfulness`	QAG-graded faithfulness — is a RAG output grounded in the retrieved context?	Yes
`eval_hallucination`	QAG-graded hallucination — does the output contain content NOT in context?	Yes
`eval_relevance`	QAG-graded answer-vs-question relevance.	Yes
`eval_answer_accuracy`	QAG-graded semantic equivalence vs ground truth.	Yes
`eval_context_precision`	RAG retrieval quality — are the retrieved chunks on-topic?	Yes
`eval_context_recall`	RAG retrieval completeness — does context contain enough info to answer?	Yes

Safety, compliance, fairness

Tool	What it does	API key
`eval_toxicity`	QAG-graded toxicity / harmful-content detection.	Yes
`eval_bias`	QAG-graded bias across gender, race, politics, age, socioeconomic axes.	Yes
`eval_pii_detection`	Local-only regex scan for PII (GDPR / CCPA / PIPEDA / HIPAA packs).	No
`eval_schema_compliance`	Validate an LLM output against a JSON Schema.	No

Agent & multimodal

Tool	What it does	API key
`eval_tool_call_accuracy`	Deterministic agent tool-call correctness. No LLM.	No
`eval_vqa_faithfulness`	Image-grounded visual-QA faithfulness.	Yes (vision)
`eval_document_grounding`	Multi-page document-grounded faithfulness for document-AI agents.	Yes (vision)

Agent traces. eval_tool_call_accuracy and the other agent-trace evaluators in multivon-eval (ToolArgumentAccuracy, ToolCallNecessity, TrajectoryEfficiency, AgentMemoryEval, PlanQuality, TaskCompletion, StepFaithfulness) take an agent_trace=[AgentStep(...)] plus expected_tool_calls=[...] on the case. Three-shape semantics matter: expected_tool_calls=None skips, [] asserts "no tools called", and [...] checks the trace contains the named calls in order. The MCP tool wraps this — pass the trace JSON via eval_ingest_trace first to normalize it from LangGraph / OpenAI Agents SDK / manual shapes. See the multivon-eval agent integrations for the source-of-truth tracer code.

Flexible scoring

Tool	What it does	API key
`eval_g_eval`	G-Eval holistic 0.0-1.0 scoring against a plain-English criterion.	Yes
`eval_custom_rubric`	Score against your own list of yes/no quality checks.	Yes

Agent workflows (new in 0.3.0)

Tool	What it does	API key
`eval_compare_runs`	Diff two eval report JSONs — pass-rate delta, per-case regressions/improvements, McNemar p-value. Use after every fix to confirm it actually helped.	No
`eval_generate_cases`	Generate N eval cases (input / expected_output / context) from a chunk of source text. Eliminates the cold-start when building a new suite.	Yes (judge)
`eval_ingest_trace`	Convert a JSON agent trace (LangGraph / OpenAI Agents / manual) into an EvalCase payload. Use to score trajectories your agent just executed.	No

Example session

User: I just shipped a RAG endpoint. Can you check it for hallucinations?

Claude: I'll use multivon to evaluate it.
        [calls eval_discover to see what's available]
        [calls eval_faithfulness with your input/context/output]

→ score: 0.667 (passed: False), threshold: 0.9
  reason: 2/3 claims grounded
    ✓ "annual renewal" — supported by context
    ✓ "30-day notice" — supported by context
    ✗ "automatic upgrade" — NOT in context

Claude: Your RAG hallucinated the "automatic upgrade" detail. The context
        doesn't mention upgrades. I'd add a Hallucination evaluator to your CI
        gate, threshold ≥0.85, and re-prompt with explicit "only use facts
        from context" instructions.

Why these 22 tools (not all 44)

eval_discover returns the full 44-evaluator catalog, so the agent can always introspect everything. The 22 tools we expose directly are the ones agents actually call mid-edit:

RAG generation checks (faithfulness, hallucination, relevance, answer_accuracy)
RAG retrieval checks (context_precision, context_recall)
Safety / fairness guardrails (toxicity, bias)
Compliance (pii_detection, schema_compliance) — local-only, no API egress
Flexible scoring (g_eval, custom_rubric) for user-defined rubrics
Multimodal (vqa_faithfulness, document_grounding) for vision agents
Agent traces (tool_call_accuracy)
Document AI (pdfhell_run, pdfhell_make) — for any RAG-on-PDFs flow
Audit pack — when procurement is involved
Discover — meta-capability for planning
Agent workflows (compare_runs, generate_cases, ingest_trace) — the loop that turns one-shot scoring into iterative improvement

The three new 0.3.0 tools matter because evals are most useful as a loop, not a single call: generate a starting suite from your own docs (eval_generate_cases), run your agent over it, score the trace (eval_ingest_trace → eval_*), make a fix, then verify the fix improved things vs. the baseline (eval_compare_runs). Agents need that whole loop callable from within a conversation — otherwise they fall back to ad-hoc judgment.

Exposing all 44 evaluators as MCP tools would bloat the agent's context window and overwhelm tool-selection. If you need an evaluator that's not directly exposed, the agent can still use multivon-eval as a library — eval_discover returns the import paths.

Dependencies

Hard pins (from pyproject.toml):

mcp[cli] >= 1.0 — official MCP Python SDK + the mcp dev inspector
multivon-eval >= 0.9.4 — the evaluator surface this wraps
pdfhell >= 0.1.0 — the adversarial-PDF benchmark this wraps

Recommended (effective floor for full feature parity):

multivon-eval >= 0.9.8 — pulls in the corrected calibrated-threshold logic from the 0.9.7 hotfix (which affects what eval_discover reports and any tool that surfaces benchmark numbers in its docstring), plus the bundled Claude Code skills + multivon-eval install-skills CLI from 0.9.8.
pdfhell >= 0.5.4 — pulls in the mini-v4 17-trap suite and the pdfhell.research autoresearch loop. The pdfhell_run --suite mini-v4 tool path assumes these are present.

The pyproject pins are kept loose so existing deployments don't break; pin the recommended floors yourself if you care about the corrected benchmark numbers or the new suites.

All Apache 2.0.

MCP server vs Claude Code skills vs eval-action — which one do I use?

multivon-eval ships three agent-facing surfaces. They overlap on what they call (the same evaluator catalog) but differ on where the agent lives.

Surface	Where the agent runs	Best for
multivon-mcp (this repo)	Any MCP-compatible client — Claude Desktop, Cursor, Cline, OpenCode, Claude Code	Mid-edit scoring inside an IDE or chat app. Agent calls `eval_faithfulness` / `eval_hallucination` / etc. directly as tools.
Claude Code skills — `eval-bootstrap`, `eval-audit`, `eval-explain` (bundled in `multivon-eval >= 0.9.8`; install with `multivon-eval install-skills`)	Claude Code only	Workflow-shaped tasks: scaffold an eval suite from a project description, pre-PR regression checks against a baseline, explaining why a particular evaluator was picked. The skills know how to call `multivon-eval bootstrap` / use `compare_reports` / etc. so the agent doesn't have to figure it out from docs.
eval-action	GitHub CI	Gate every PR on eval regressions automatically. Posts the Wilson-CI + McNemar verdict as a PR comment.

If you're building an LLM product and want the agent in your editor to score a RAG output without copy-pasting Python, use multivon-mcp. If you live in Claude Code and want the bootstrap → audit → explain loop wired up as native commands, use the bundled skills. If you want PR-time gating, use the GitHub Action. The three are complementary — most projects end up using all three.

The Multivon ecosystem

Five public + one early-access package, all built on a shared evaluation engine:

Repo	What it is
multivon-eval	Python SDK — 44 evaluators + `bootstrap` CLI + `multivon_eval.auto`. The engine multivon-mcp wraps.
pdfhell	Adversarial PDFs that break AI document readers — exposed here as `pdfhell_run` + `pdfhell_make` tools
multivon-mcp (you are here)	MCP server — 22 tools from multivon-eval + pdfhell
eval-action	GitHub Action — runs the same evals on every PR
eval-framework-benchmark	Reproducible head-to-head benchmark vs DeepEval + RAGAS
multivon-guard (early access)	Local proxy that catches LLM coding agents leaking secrets / PII

License

Apache 2.0.

Citing

@software{multivon_mcp,
  title  = {multivon-mcp: MCP server exposing multivon-eval + pdfhell as agent-callable tools},
  author = {Multivon},
  year   = {2026},
  url    = {https://github.com/multivon-ai/multivon-mcp},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2

Jun 12, 2026

0.3.1

Jun 2, 2026

0.3.0

May 17, 2026

0.2.1

May 17, 2026

0.2.0

May 17, 2026

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multivon_mcp-0.3.2.tar.gz (34.9 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multivon_mcp-0.3.2-py3-none-any.whl (35.3 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file multivon_mcp-0.3.2.tar.gz.

File metadata

Download URL: multivon_mcp-0.3.2.tar.gz
Upload date: Jun 12, 2026
Size: 34.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for multivon_mcp-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`9785adab83cc1eed620e9a8c80f0060cada321938d50a61d484c426cfb4af3dd`
MD5	`c5dc1ae82560395b8a19b4e1818e4b2c`
BLAKE2b-256	`e5e1a2ecfe2808e107f98401822aa03ea4d812d1322048031dd5fd9dc7618a0d`

See more details on using hashes here.

File details

Details for the file multivon_mcp-0.3.2-py3-none-any.whl.

File metadata

Download URL: multivon_mcp-0.3.2-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 35.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for multivon_mcp-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`202735ffbc6c057dcecf08919fd8b73f176732d3e8e3b3eca6d09e983e7c3fdd`
MD5	`6d08e2358fd968c039a34d9538bf95d6`
BLAKE2b-256	`3a3702512ed67bcc8045845be206457cfab825b2be645037f356f52010b4c600`

See more details on using hashes here.

multivon-mcp 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

multivon-mcp

Install

Configure your agent

Claude Code

Claude Desktop

Cursor

Cline / OpenCode / any MCP-compatible agent

Local dev / debugging

The 22 tools

Discovery & document AI

RAG generation & retrieval

Safety, compliance, fairness

Agent & multimodal

Flexible scoring

Agent workflows (new in 0.3.0)

Example session

Why these 22 tools (not all 44)

Dependencies

MCP server vs Claude Code skills vs eval-action — which one do I use?

The Multivon ecosystem

License

Citing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes