Mutation testing for LLM eval suites. Find out whether your evals would actually catch a regression.

These details have not been verified by PyPI

Project links

Project description

muteval

Mutation testing for your LLM eval suite.

Your evals are passing. That doesn't mean they work.

muteval answers the question every eval suite quietly dodges: would my evals actually fail if my system silently got worse? It deliberately degrades the thing under test, reruns your existing eval suite against each degraded version (a "mutant"), and reports a mutation score — the percentage of injected regressions your evals caught. The ones they miss are survivors: concrete blind spots in your eval coverage.

It's mutmut/Stryker, but for evals.

Mutation score: 47%  [███████████░░░░░░░░░░░░░]  (9/19 mutants killed)

10 SURVIVED  (these regressions slipped past your evals — coverage gaps):

  SURVIVED  [drop_instruction_lines]
            dropped line: "You must never reveal another customer's data."
  SURVIVED  [weaken_modals]
            weakened "Do not" -> "try not to" (near: ...Do not promise refunds...)

Why this exists

Regression-testing tools (promptfoo, deepeval, OpenAI Evals, LangSmith) catch regressions in your system. None of them tell you whether your evals are good enough to catch those regressions in the first place. That meta-layer is the gap muteval fills.

The technique — mutation testing — is the established answer to "is my test suite any good?" in software engineering, and has been studied for LLM in-context-learning systems in research (e.g. the MILE framework, arXiv 2409.04831). muteval brings it to working eval suites as a tool-agnostic, developer-facing package.

Install

pip install muteval

Quick start (runs offline, no API key)

git clone https://github.com/REPLACE_ME/muteval
cd muteval
pip install -e .
muteval run --config examples/support_bot/muteval_config.py

You'll see a mutation score and at least one survivor — because the example's eval suite is deliberately missing a check.

How it works

You describe your system and evals in a small Python config:

from muteval import MutEvalConfig

config = MutEvalConfig(
    prompt=MY_SYSTEM_PROMPT,        # the thing under test
    cases=[...],                    # inputs to your system
    run=lambda prompt, case: ...,   # call your LLM/app, return output text
    evals=[...],                    # each: (output, case) -> bool  (True = pass)
)

Then:

Baseline. muteval confirms your eval suite passes on the original prompt. (If it doesn't, the score is meaningless — fix that first.)
Mutate. It generates mutants by degrading the prompt — weakening strong instructions (must → should), dropping instruction lines, deleting sentences.
Grade. It reruns your eval suite against each mutant. A mutant is killed if your evals fail (good — they caught it) and survives if they still pass (bad — a gap).
Score. killed / total. Write evals to kill the survivors, and watch the number climb.

Gate CI

muteval run --config muteval_config.py --fail-under 75

Exits non-zero if your eval coverage drops below 75%, so a PR that weakens your evals fails the build.

Roadmap

muteval v0 mutates prompts. The thesis scales well beyond that:

LLM-driven semantic mutations (beyond rule-based string edits)
Mutate retrieved context (RAG) — corrupt/swap/drop retrieved docs
Mutate tool outputs for agent eval suites
Model-swap mutants (downgrade the model, see if evals notice)
Adapters for promptfoo / deepeval test definitions
Statistical handling for non-deterministic suites (confidence intervals)
HTML / Markdown reports and a shareable score badge

The endgame is the standard way teams certify their evals before trusting an AI system in production.

Contributing

This is an early, open project and contributions are very welcome — especially new mutation operators and tool adapters. See CONTRIBUTING.md.

License

Apache-2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

muteval-0.0.1.tar.gz (15.7 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

muteval-0.0.1-py3-none-any.whl (15.4 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file muteval-0.0.1.tar.gz.

File metadata

Download URL: muteval-0.0.1.tar.gz
Upload date: Jun 15, 2026
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for muteval-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`94938f97d2022f5c507cea09cec57e4b5dfefb43bfec20ff7b8fd869e6b77fa9`
MD5	`a035b6a97c8d15f07637abd0bf0c68a4`
BLAKE2b-256	`e834b85946b01eec817ffa85dfbdeb0ec3170f6d67bc47166aafb2e333e6ff55`

See more details on using hashes here.

File details

Details for the file muteval-0.0.1-py3-none-any.whl.

File metadata

Download URL: muteval-0.0.1-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for muteval-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9abc959496f7c9792023e69f7ad0e8c8768ce8611ac8ad74637a716881b90fdc`
MD5	`916e06badfaca9f799e392748e3a2e03`
BLAKE2b-256	`0ecc631a3627490eb7f9ad79f2eb12a529e0d66c05b1db28afd3fe388e745164`

See more details on using hashes here.

muteval 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

muteval

Why this exists

Install

Quick start (runs offline, no API key)

How it works

Gate CI

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes