Mutation testing for LLM eval suites. Find out whether your evals would actually catch a regression.
Project description
muteval
Mutation testing for your LLM eval suite.
Your evals are passing. That doesn't mean they work.
muteval answers the question every eval suite quietly dodges: would my
evals actually fail if my system silently got worse? It deliberately degrades
the thing under test, reruns your existing eval suite against each degraded
version (a "mutant"), and reports a mutation score — the percentage of
injected regressions your evals caught. The ones they miss are survivors:
concrete blind spots in your eval coverage.
It's mutmut/Stryker, but for evals.
Mutation score: 47% [███████████░░░░░░░░░░░░░] (9/19 mutants killed)
10 SURVIVED (these regressions slipped past your evals — coverage gaps):
SURVIVED [drop_instruction_lines]
dropped line: "You must never reveal another customer's data."
SURVIVED [weaken_modals]
weakened "Do not" -> "try not to" (near: ...Do not promise refunds...)
Why this exists
Regression-testing tools (promptfoo, deepeval, OpenAI Evals, LangSmith) catch
regressions in your system. None of them tell you whether your evals are
good enough to catch those regressions in the first place. That meta-layer is
the gap muteval fills.
The technique — mutation testing — is the established answer to "is my test
suite any good?" in software engineering, and has been studied for LLM
in-context-learning systems in research (e.g. the MILE framework, arXiv
2409.04831). muteval brings it to working eval suites as a tool-agnostic,
developer-facing package.
Install
pip install muteval
Quick start (runs offline, no API key)
git clone https://github.com/REPLACE_ME/muteval
cd muteval
pip install -e .
muteval run --config examples/support_bot/muteval_config.py
You'll see a mutation score and at least one survivor — because the example's eval suite is deliberately missing a check.
How it works
You describe your system and evals in a small Python config:
from muteval import MutEvalConfig
config = MutEvalConfig(
prompt=MY_SYSTEM_PROMPT, # the thing under test
cases=[...], # inputs to your system
run=lambda prompt, case: ..., # call your LLM/app, return output text
evals=[...], # each: (output, case) -> bool (True = pass)
)
Then:
- Baseline.
mutevalconfirms your eval suite passes on the original prompt. (If it doesn't, the score is meaningless — fix that first.) - Mutate. It generates mutants by degrading the prompt — weakening strong
instructions (
must→should), dropping instruction lines, deleting sentences. - Grade. It reruns your eval suite against each mutant. A mutant is killed if your evals fail (good — they caught it) and survives if they still pass (bad — a gap).
- Score.
killed / total. Write evals to kill the survivors, and watch the number climb.
Gate CI
muteval run --config muteval_config.py --fail-under 75
Exits non-zero if your eval coverage drops below 75%, so a PR that weakens your evals fails the build.
Roadmap
muteval v0 mutates prompts. The thesis scales well beyond that:
- LLM-driven semantic mutations (beyond rule-based string edits)
- Mutate retrieved context (RAG) — corrupt/swap/drop retrieved docs
- Mutate tool outputs for agent eval suites
- Model-swap mutants (downgrade the model, see if evals notice)
- Adapters for promptfoo / deepeval test definitions
- Statistical handling for non-deterministic suites (confidence intervals)
- HTML / Markdown reports and a shareable score badge
The endgame is the standard way teams certify their evals before trusting an AI system in production.
Contributing
This is an early, open project and contributions are very welcome — especially new mutation operators and tool adapters. See CONTRIBUTING.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file muteval-0.0.1.tar.gz.
File metadata
- Download URL: muteval-0.0.1.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94938f97d2022f5c507cea09cec57e4b5dfefb43bfec20ff7b8fd869e6b77fa9
|
|
| MD5 |
a035b6a97c8d15f07637abd0bf0c68a4
|
|
| BLAKE2b-256 |
e834b85946b01eec817ffa85dfbdeb0ec3170f6d67bc47166aafb2e333e6ff55
|
File details
Details for the file muteval-0.0.1-py3-none-any.whl.
File metadata
- Download URL: muteval-0.0.1-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9abc959496f7c9792023e69f7ad0e8c8768ce8611ac8ad74637a716881b90fdc
|
|
| MD5 |
916e06badfaca9f799e392748e3a2e03
|
|
| BLAKE2b-256 |
0ecc631a3627490eb7f9ad79f2eb12a529e0d66c05b1db28afd3fe388e745164
|