Production-grade mutation testing for Python.
Project description
Mutagen
LLM-assisted test generation, validated by mutation testing.
Mutagen ingests a Python repository, finds under-covered functions, generates pytest tests for them with an LLM, and keeps only the tests that actually kill mutants of the target. It is built on Clean Architecture with a strict dependency rule, two explicit state machines, full async I/O, SQLite-backed resume, and structured logging.
repo ──► ingest ──► select targets ──► generate tests ──► run ──► mutate ──► keep / discard ──► report
▲ │
└── repair / strengthen loops ──┘
What it does
For each selected target the pipeline:
- Generates a pytest module from the function's source, imports, and surrounding context — matching your project's existing test style.
- Runs it in an isolated subprocess sandbox (timeout + resource limits, flakiness detection via a double-run).
- Mutation-gates it with mutmut: if the tests don't kill enough mutants, the surviving mutants become feedback for a regeneration attempt (the strengthening loop). If the tests fail to run, the failure output drives a repair loop.
- Keeps or discards the tests based on the mutation-score threshold, and persists the outcome immediately so an interrupted run resumes cleanly.
Context enrichment (optional)
Generation step 1 can fold in two extra signals — both off by default, both
configured under [generation]:
-
Semantic code understanding (call graph). An AST-based
CallGraphAnalyzerbuilds a repo-wide call graph and extracts each target's execution path — its transitive callees — so the model writes tests that exercise the whole tree end-to-end rather than just the entry function:process_order ├── validate_order ├── calculate_tax └── save_order
The rendered tree and the callee sources are added to the prompt. The analyzer resolves only unambiguous in-repo calls (plain,
self/clsmethods, imported names) and omits anything it can't pin down — no misleading edges. -
Retrieval-augmented generation (RAG). Instead of seeding the prompt with the first couple of test files, an
EmbeddingTestRetrieverindexes the project's existing tests (one chunk pertest_*function) and retrieves the ones most similar to the target by embedding similarity:target function ─► vector search ─► relevant existing tests ─► prompt
The default
HashingEmbeddingProvideris dependency-free and deterministic (no model download, no API key); a real embedding model can drop in behind the same port. Retrieved examples make generated tests far more consistent with the conventions of genuinely related code.
Architecture
Mutagen follows a strict dependency rule: dependencies point inward,
toward the domain. The domain (core) knows nothing about infrastructure; the
composition root (config/container.py) is the only place concrete adapters
are imported.
┌─────────────────────────────────────────────────────────────────────┐
│ cli/ mutagen run <repo> · Rich progress UI · dashboard │
├─────────────────────────────────────────────────────────────────────┤
│ services/ orchestrator · target_processor · budget · reporting │
│ (depend only on core.interfaces — the ports) │
├─────────────────────────────────────────────────────────────────────┤
│ core/ models (frozen dataclasses) · interfaces (ports) │
│ exceptions · state_machine (run + target FSMs) │
├─────────────────────────────────────────────────────────────────────┤
│ infrastructure/ ingest · selection · generation · llm · sandbox │
│ reporting/ gate · store (SQLite) · md/json/terminal reporters │
│ (implement the ports; only layer that does real I/O) │
└─────────────────────────────────────────────────────────────────────┘
▲ │
└────────── config/container.py wires it all ─────────┘
Ports → adapters
Every infrastructure concern is an abstract port in core/interfaces/, with a
concrete adapter in infrastructure/:
| Port | Adapter | Role |
|---|---|---|
RepoIngestor |
ingest/FilesystemRepoIngestor |
Clone/copy repo → isolated workspace, venv, deps |
TargetSelector |
selection/AstTargetSelector |
Coverage-guided, AST-based target ranking |
CallGraphAnalyzer |
selection/AstCallGraphAnalyzer |
Build a repo call graph → a target's execution path |
TestGenerator |
generation/LLMTestGenerator |
Gather context → prompt → validate generated tests |
EmbeddingProvider |
retrieval/HashingEmbeddingProvider |
Embed text into vectors (dependency-free default) |
TestRetriever |
retrieval/EmbeddingTestRetriever |
Index existing tests → retrieve the most similar ones |
LLMClient |
llm/AnthropicLLMClient |
Anthropic API (retries, backoff, cost tracking) |
SandboxRunner |
sandbox/SubprocessSandboxRunner |
Run pytest isolated (timeout, rlimits, flakiness) |
MutationGate |
gate/MutmutMutationGate |
Drive mutmut, score, survivor feedback, keep/discard |
Store |
store/SqliteStore |
Persist final runs + artifacts |
CheckpointStore |
store/SqliteCheckpointStore |
Per-target progress for resume |
Reporter |
reporting/{Markdown,Json,Terminal,Composite} |
report.md + report.json + dashboard |
Two state machines
RUN lifecycle (RunStateMachine)
PENDING → INITIALIZING → INGESTING → SELECTING_TARGETS
→ GENERATING_TESTS → GATING → REPORTING → COMPLETED
(any active state → FAILED / CANCELLED)
TARGET lifecycle (TargetStateMachine), one per target
SELECTED → GENERATED → RAN → MUTATED → KEPT
(any active state → DISCARDED)
Both are data-driven tables that reject illegal transitions rather than silently proceeding.
Orchestration loop
for each selected target (skipping ones already done on a prior run):
if budget/cost exhausted: stop cleanly → PARTIAL result (resumable)
┌─ TargetProcessor ───────────────────────────────────────────┐
│ generate ──► run ──► (repair loop on failure) │
│ └─► gate ──► (strengthen loop on surviving mutants)│
│ └─► KEPT (score ≥ threshold) or DISCARDED │
└─────────────────────────────────────────────────────────────┘
persist the target's checkpoint IMMEDIATELY (resume-safe)
finalize RunResult → summarize → write report.md + report.json → save run
Project layout
mutagen/
├── cli/ # argparse CLI + Rich progress UI
├── config/ # RunConfig, TOML loader, logging, DI container
├── core/
│ ├── models/ # frozen domain dataclasses (RunResult, TargetOutcome, …)
│ ├── interfaces/ # abstract ports (ABCs)
│ ├── exceptions/ # MutagenError hierarchy
│ └── state_machine/ # run + target FSMs
├── services/ # orchestrator, target_processor, budget, reporting, progress
├── infrastructure/
│ ├── ingest/ selection/ generation/ llm/ sandbox/ gate/ store/
│ └── process.py # shared subprocess-safety helper
├── reporting/ # markdown / json / terminal / composite reporters
├── tests/ # 238 tests (unit + integration, mock-driven)
└── main.py # entrypoint
Setup
Requires Python 3.11+ and git (for ingesting remote repositories).
# Install with every integration (Anthropic + OpenAI SDKs, coverage, mutmut, …)
pip install "mutagen[all]" # or: pipx install "mutagen[all]"
Then provide an API key for whichever provider you use — via your shell or a
.env file in your project (loaded automatically):
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic (Windows: $env:ANTHROPIC_API_KEY=...)
export OPENAI_API_KEY=sk-... # OpenAI
export GEMINI_API_KEY=... # Google Gemini
export OPENROUTER_API_KEY=sk-or-... # OpenRouter
# .env (kept out of source control; never committed)
OPENAI_API_KEY=sk-...
Verify everything at once:
mutagen doctor # checks Python, git, optional deps, and which provider key is set
Lighter installs are available via extras: pip install mutagen (CLI +
reporting only), then add [llm] (Anthropic), [openai] (OpenAI / Gemini /
OpenRouter), [sandbox], [mutation], or [coverage] as needed. mutagen doctor tells you exactly which extra to install for anything missing.
Usage
# Run against a local path or a git URL
mutagen run ./path/to/project
mutagen run https://github.com/org/repo
# With a config file and a score threshold
mutagen -c mutagen.toml run ./project --threshold 0.8
# Resume an interrupted run (reuse its id)
mutagen run ./project --run-id my-run-123
# Re-render the most recent run's report
mutagen report
# Diagnose the environment (Python, git, optional deps, provider key)
mutagen doctor
mutagen run exits 0 on success, 1 on a handled failure, and 2
when the achieved mutation score is below the configured threshold (useful as a
CI gate).
Live progress & dashboard
On a TTY the CLI shows a Rich progress bar and a per-phase status line, then a
summary table. In CI / piped output it falls back to plain line logging
automatically. Use --no-progress to force plain output.
Mutagen Run a1b2c3 [succeeded]
┌──────────────────────────────────┬──────────────┐
│ Mutation score (before -> after) │ n/a -> 84% │
│ Targets kept / discarded │ 12 / 3 │
│ Tests generated │ 15 │
│ API cost │ $0.4210 │
│ Execution time │ 182.4s │
└──────────────────────────────────┴──────────────┘
Reports
Every run writes two files under <storage.root>/reports/ (default
.mutagen/reports/):
report.md— human-readable dashboard: mutation score before/after, kept vs. discarded targets, API cost, execution time, and a per-target table.report.json— the same data, machine-readable for CI and archival.
Both include: mutation score before/after, kept / discarded tests, API cost (USD + tokens + requests), execution time, and per-target statistics.
Note on "before": the after score is always measured. The before (baseline) score — what the repo's pre-existing tests already kill — is wired through the model and rendered as
n/auntil a baseline gate pass is enabled; it is best-effort by design.
Configuration
Configuration is TOML, mirroring the config dataclass tree. See
mutagen.example.toml for the fully-annotated template.
CLI flags (--threshold) override file values. Highlights:
project_root = "."
score_threshold = 0.8
[llm]
model = "claude-opus-4-8"
effort = "high"
[orchestrator] # budget & cost ceilings (0 = unlimited)
max_targets = 50
max_cost_usd = 5.0
max_repair_attempts = 2
max_strengthen_attempts = 2
max_parallel_targets = 4 # process this many targets at once (1 = sequential)
[storage]
backend = "sqlite"
root = ".mutagen"
Hitting any budget/cost limit stops the run cleanly with a PARTIAL,
resumable result — the in-flight target finishes and everything completed is
already persisted.
Parallelism
Targets are independent — each runs in its own isolated sandbox and mutation
workspace — so the orchestrator processes up to max_parallel_targets of them
at once via a bounded worker pool (default 1, i.e. sequential). Budget and
cost limits are enforced with an atomic reservation before each target is
scheduled, so concurrency never overshoots max_targets; once a limit trips,
no new targets start but those already in flight finish cleanly. Per-target
checkpoints are still written immediately, so resume works identically whether
the run was sequential or parallel.
Because the dominant cost is CPU-bound (pytest + mutmut), the practical
sweet spot for max_parallel_targets is roughly the host's core count — higher
values mostly cause subprocess thrashing rather than further speedup.
Persistence & resume
State lives in a single SQLite database at <storage.root>/mutagen.db:
runs— finalRunResultrecords (JSON payload).run_checkpoints/target_checkpoints— per-target progress, upserted the moment each target finishes.
Re-running with the same --run-id loads the checkpoint, skips targets
already in a terminal state, and carries their outcomes forward.
Docker
docker build -t mutagen .
docker run --rm \
-e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
-v "$PWD:/workspace" \
mutagen run /workspace
The image is a slim multi-stage build with git for cloning targets, runs as a
non-root user, and uses /workspace as the working directory.
Development
pip install -e ".[dev,sandbox]"
pytest # 238 tests, mock-driven (no network, no real LLM)
ruff check mutagen # lint
ruff format mutagen # format
mypy mutagen # type-check (strict; aspirational)
CI (.github/workflows/ci.yml) runs the suite on
Python 3.11 & 3.12, lints/formats with ruff, type-checks with mypy, and builds
the Docker image on every push and PR.
Testing philosophy
The whole suite runs without a network or a real LLM: ports are mocked, subprocess calls are faked, and the few genuine integration tests (the sandbox runner) drive real pytest against tiny fixtures and skip cleanly when their optional tools are absent.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mutagen_ai-0.1.0.tar.gz.
File metadata
- Download URL: mutagen_ai-0.1.0.tar.gz
- Upload date:
- Size: 149.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d6d4b95cb61043d318f953144e558a278598221757e353c9cbc0e08f60aa5e1
|
|
| MD5 |
526d5622ab6c66a2d15955d428ca8b1b
|
|
| BLAKE2b-256 |
3707fe8255ae249ccb7e5b5d3ef97719a79d859a0cdb01122667cb2a9c1549fe
|
Provenance
The following attestation bundles were made for mutagen_ai-0.1.0.tar.gz:
Publisher:
publish.yml on krish-arya/mutagen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mutagen_ai-0.1.0.tar.gz -
Subject digest:
6d6d4b95cb61043d318f953144e558a278598221757e353c9cbc0e08f60aa5e1 - Sigstore transparency entry: 1752185979
- Sigstore integration time:
-
Permalink:
krish-arya/mutagen@d8633be7a93e1f9fccfcd0ccd38cd1455ca1d924 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/krish-arya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d8633be7a93e1f9fccfcd0ccd38cd1455ca1d924 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mutagen_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mutagen_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 209.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7dafbfa3e65b45024b9ccbe71c26212cfafc1cafa89bbc4993ac3d1f06e6cff
|
|
| MD5 |
b8dc9b87ccac4f727ab81c28af367182
|
|
| BLAKE2b-256 |
6b6e3ce13548f273b9c362db6de74f00a92b44c2bcd2e67d03188bfe0e243039
|
Provenance
The following attestation bundles were made for mutagen_ai-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on krish-arya/mutagen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mutagen_ai-0.1.0-py3-none-any.whl -
Subject digest:
f7dafbfa3e65b45024b9ccbe71c26212cfafc1cafa89bbc4993ac3d1f06e6cff - Sigstore transparency entry: 1752186035
- Sigstore integration time:
-
Permalink:
krish-arya/mutagen@d8633be7a93e1f9fccfcd0ccd38cd1455ca1d924 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/krish-arya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d8633be7a93e1f9fccfcd0ccd38cd1455ca1d924 -
Trigger Event:
push
-
Statement type: