AlphaEvolve with fuzzy evaluation. Evolve anything, not just code.
Project description
fuzzyevolve
Inspired by AlphaEvolve, but designed for “fuzzy” tasks like "write a evocative sci-fi short story".
What you get in practice:
- A repeatable loop that steadily improves a draft when “good” is subjective.
- A population of diverse candidates (not 50 near-identical paraphrases).
- A full run record you can resume, audit, and browse in a TUI.
At the end you get a diverse, high quality set of outputs for your goal. Especially fun to see what "lineages" survive and which ones get pruned.
Potential applications
This is all an experiment - but here are some things I've played with/plan on playing with:
- Creative writing/short stories
- Prompts for use in downstream tasks/agents
- Prompts for image/video models where the judge actually generates -> evaluates the actual output
- Safety/jailbreaking tests, can you find a niche/diverse set of inputs that jailbreak LLMs
Quick start
export GOOGLE_API_KEY=... # default config uses google-gla:* models
uv sync
# Uses ./config.toml if present (or defaults)
uv run fuzzyevolve "This is my starting prompt."
fuzzyevolve uses pydantic-ai for LLM calls, so it should work with Google, OpenAI, or Anthropic models (and anything else pydantic-ai supports). Configure models via [llm].judge_model and [[llm.ensemble]].model in config.toml, and set the corresponding API key env var.
Included examples
config.tomlis a working example config you can start from (andfuzzyevolvewill auto-detect it if it’s in your CWD).best.mdis a real output report from a run (top individuals by fitness + per-metric μ/σ).
Example config.toml switch:
[llm]
judge_model = "openai:gpt-4o-mini"
[[llm.ensemble]]
model = "openai:gpt-4o-mini"
weight = 1.0
temperature = 1.0
Input can be a string, a file path, or stdin:
uv run fuzzyevolve seed.txt
cat seed.txt | uv run fuzzyevolve
Output goes to best.md by default (override with --output). By default it includes the top 20 individuals by fitness (override with --top).
Override the goal/metrics quickly from the CLI:
uv run fuzzyevolve \
--goal "Write a punchy, helpful README section about caching." \
--metric clarity --metric usefulness --metric concision \
"Draft text goes here..."
By default, each run is recorded under .fuzzyevolve/runs/<run_id>/ (checkpoints, events, and raw LLM prompts/outputs). Resume with:
uv run fuzzyevolve --resume .fuzzyevolve/runs/<run_id> --iterations 100
Browse runs in the TUI:
uv run fuzzyevolve tui
# or open a specific run/checkpoint:
uv run fuzzyevolve tui --run .fuzzyevolve/runs/<run_id>
Disable recording with --no-store.
Embeddings use sentence-transformers (installed by default). Configure the model via [embeddings].model in config.toml.
What it does (high level)
- Critique: a structured critique of the current parent (preserve / issues / rewrite routes).
- Mutate: multiple LLM “operators” propose children (e.g. conservative improvement vs high-variance exploration).
- Judge: an LLM ranks parent/children (and optional anchors/opponent) per metric using tiered rankings (ties allowed).
- Learn: per-metric TrueSkill updates convert rankings into ratings (μ/σ), then a conservative score selects “best so far”.
- Stay diverse: a fixed-size population is maintained using embedding-space crowding/pruning.
Mental model (the important bits)
- An “individual” is a text plus:
- an embedding (for diversity), and
- a TrueSkill rating per metric (for quality).
- The judge doesn’t assign absolute scores; it ranks candidates relative to each other per metric.
- The population is a fixed-size “portfolio” spread out in embedding space.
- Exploration is encouraged via an optimistic parent selector (
μ + β·σ), while reporting uses a conservative score (μ - c·σ).
How it works (one iteration, step by step)
- Select parent from the population (mixture policy: uniform sampling or optimistic tournament).
- Critique parent into reusable guidance: what to preserve, what to fix, distinct rewrite routes.
- Plan mutation jobs across operators (minimums + weighted sampling).
- Generate children (LLM rewrites). Exploration operators can intentionally omit the parent text to avoid “paraphrase gravity”.
- Assemble a battle: parent + children (+ optional frozen anchors) (+ optional opponent from the pool).
- Judge by ranking: the LLM returns tiered rankings for each metric (ties allowed; outputs are validated and optionally repaired).
- Update ratings with per-metric TrueSkill, freezing anchors.
- Insert children into the fixed-size pool; enforce diversity with embedding-space crowding/pruning.
Configuration
Config is a single TOML/JSON file. If config.toml or config.json exists in the current directory it’s auto-detected; pass an explicit file with --config.
See config.toml for a complete example. The structure is intentionally nested:
[task]and[metrics]define what “good” means (goal + metric names/descriptions).[mutation]defines the operator set, job budget, and per-operator uncertainty.[judging]controls judge retries + optional opponents.[rating]controls TrueSkill parameters and the score’s LCB constant.[embeddings]defines the sentence-transformers model to use for diversity.[population]defines the fixed pool size.[selection]configures the parent-selection mixture policy.[anchors]optionally injects frozen reference anchors (seed + periodic “ghosts”) into battles.[llm]chooses the judge model and the mutation ensemble.
Config Tips
- Cost/latency
- Reduce
[mutation].jobs_per_iterationand/or[mutation].max_children. - Use cheaper models in
[[llm.ensemble]]and/or for[llm].judge_model. - Disable
[critic].enabledif you want “mutate + judge” only.
- Reduce
- Diversity
- Tune
[embeddings].modelif you want a different embedding model. - Increase population size, or use
population.pruning = "knn_local_competition"to preserve niches.
- Tune
- Stability
- Increase
[judging].max_attemptsif the judge sometimes returns invalid structure. - Use anchors and/or opponents for better cross-population calibration.
- Increase
Run data
When --store is enabled (default), each run is recorded under .fuzzyevolve/runs/<run_id>/:
checkpoints/latest.jsonandcheckpoints/it000123.json(periodic checkpoints)texts/<sha256>.txt(deduped text blobs)events.jsonl(structured iteration events)stats.jsonl(best score + pool size over time)llm/+llm.jsonl(raw prompts/outputs, indexed)
This is great for debugging and iteration, but it also means your prompts and model outputs are stored locally. Avoid evolving sensitive content if you don’t want it written to disk.
CLI
run is the default command, so these are equivalent:
uv run fuzzyevolve "Seed text..."
uv run fuzzyevolve run "Seed text..."
To open the run browser:
uv run fuzzyevolve tui
run options
--config/-c: Path to TOML/JSON config--output/-o: Output path (defaultbest.md)--top: How many top individuals to include (default 20;0= all)--iterations/-i: Overriderun.iterations--goal/-g: Overridetask.goal--metric/-m: Overridemetrics.names(repeatable)--resume: Resume from a previous run directory (or checkpoint file)--store/--no-store: Enable/disable recording under.fuzzyevolve/--log-level/-l: Logging level (debug|info|warning|error|criticalor a number)--log-file: Write logs to a specific file--quiet/-q: Hide the progress bar and non-essential logging
Requirements
- Python 3.10+
- uv (recommended)
- Any model supported by
pydantic-ai(Google/OpenAI/Anthropic all work; configure via[llm].judge_modeland[[llm.ensemble]].model) - An API key for the provider you choose
export GOOGLE_API_KEY=... # e.g. google-gla:*
export OPENAI_API_KEY=... # e.g. openai:*
export ANTHROPIC_API_KEY=... # e.g. anthropic:*
Troubleshooting
ImportError: sentence-transformers is required- Run
uv sync(orpip install sentence-transformers).
- Run
- Judge returns invalid rankings / retries fail
- Increase
[judging].max_attempts, or switch to a more reliable judge model.
- Increase
- Runs are expensive
- Start with fewer metrics, fewer mutation jobs, and a smaller population. Then scale up.
- Resume isn’t picking up where you expect
- Point
--resumeat a run directory (or a checkpoint file). The latest checkpoint ischeckpoints/latest.json.
- Point
Development
uv sync --extra dev
uv run ruff format .
uv run ruff check .
uv run pytest -q
License
Apache 2.0 — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fuzzyevolve-0.2.2-py3-none-any.whl.
File metadata
- Download URL: fuzzyevolve-0.2.2-py3-none-any.whl
- Upload date:
- Size: 52.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d9546f11d7d6fa1605d47cb8a8956e000f38bd276caf07e8e15b329996d2958
|
|
| MD5 |
1e2279fe7fbfc62cc2db52aa4868a7fa
|
|
| BLAKE2b-256 |
f46b0b53d6f9331203c7bb20bd22dafd6564027abe5f15776fd7e9083c1ea059
|