Hierarchical Evolution via LLM-Informed eXploration
Project description
🧬 HELIX
Hierarchical Evolution via LLM-Informed eXploration
DNA evolves. So does your codebase.
Evolutionary optimization for full codebases using agentic coding tools as the mutation engine and git worktrees as the population pool.
HELIX brings reflective Pareto evolution out of the single-artifact setting and into real software projects: entire repositories, multi-turn agentic mutation, tool use, web research, and verification loops, all inside a single evolutionary stage. Currently supports Claude Code, with support for OpenCode, Codex CLI, and Cursor CLI coming in the next release.
Quick Start · How It Works · Configuration · CLI Reference · Results
Safety: HELIX never modifies your working branch, HEAD, staging area, or remote. All mutations live in detached worktrees under
.helix/worktrees/and branches namedhelix/*. If your checkout is dirty, HELIX snapshots the current tracked and untracked changes into the seed worktree while leaving your original checkout untouched. Runhelix cleanto remove saved state and worktrees when you are done.
Why HELIX?
HELIX is built for a setting that today's evolution systems still do not really handle, including systems like KISS and OpenEvolve: improving real, multi-file codebases where useful mutations require exploration, iteration, and tooling, not just a single blind rewrite.
Instead of treating one file or one patch as the candidate, HELIX treats the entire repository as the evolving organism. Each mutation is a full agentic coding session running inside an isolated git worktree, so a candidate can:
- Read across the codebase to understand architecture and dependencies.
- Edit multiple files coherently in one mutation.
- Use tools mid-mutation like tests, linters, shell commands, and web search.
- Take multiple turns to diagnose and self-correct before the mutation is scored.
- Stay inside one evolutionary stage rather than requiring an outer orchestration loop to get tool use or iteration.
The result is a new kind of evolutionary optimizer: one that preserves the reflective Pareto-evolutionary core while making it practical for whole repositories and realistic software engineering tasks.
Coding agent as the mutation engine
The difference between HELIX and /chat/completions-style evolvers (GEPA, DSPy-Refine, ShinkaEvolve) is that HELIX's mutation is driven by a coding agent, not a single LLM call. A GEPA-style mutation is one prompt → one completion → apply the diff. HELIX's mutation is a full agentic session bounded only by max_turns:
| GEPA / chat-completion evolvers | HELIX | |
|---|---|---|
| Mutation shape | Single request/response | Multi-step agentic session |
| Working surface | A single prompt / predictor string | The entire repository in a git worktree |
| Mid-mutation introspection | None | Read any file, grep, glob, find, follow imports |
| Mid-mutation verification | None | Run the test suite, type-checker, linter; read failures and react |
| External information | None | Fetch the web, hit GitHub API, query package indexes live |
| Self-correction | None per proposal (retries are separate generations) | Inside one mutation: diagnose a test failure, edit another file, re-run, commit only if green |
| Cost accounting | 1 LLM call = 1 proposal | 1 proposal = N turns, gated by max_turns + whatever the agent decides is enough |
This is why solver/solution.py on cap-x or a shrinkwrap of a ML kernel on GPT-OSS behave qualitatively differently than a GEPA run on the same task: HELIX's candidate is the program a team of N humans could edit over an afternoon, not a single text blob produced in one shot.
✨ Key Features
| Feature | Description | |
|---|---|---|
| 🧬 | Whole-codebase evolution | The candidate is your repository, not a single file, prompt, or patch |
| 📂 | Multi-file editing | Mutate entire directory trees — edit auth.py:42 and routes.py:18 in one coherent session |
| 🔁 | Multi-turn mutations | A single mutation can inspect, edit, test, revise, and continue before being evaluated |
| 🔧 | Tool access during mutation | Claude Code can read, grep, run tests, inspect the codebase, and use the web mid-mutation |
| ✅ | Self-verification | Mutations verify themselves by running commands before committing |
| 📊 | Pareto frontier | Instance-level Pareto selection across test cases — no single metric bottleneck |
| ⚡ | Parallel evaluation | Worktrees are isolated → parallel proposals via ThreadPoolExecutor (GEPA parity, bounded by evolution.max_workers) |
| 🔀 | Merge / crossover | Combine two frontier candidates that excel on different instances |
| 💾 | State persistence & resume | Crash-safe — resume from any generation with helix resume |
| 🚦 | Gated mutations | Train-set gating rejects regressions before Pareto evaluation |
| 📋 | Semantic mutation log | Full trajectory with root-cause analysis, changes made, and parent lineage |
🔄 How It Works
┌──────────────┐
│ Seed Code │
└──────┬───────┘
│
▼
┌──────────────┐
│ Evaluate │◄──────────────────────┐
│ (parallel) │ │
└──────┬───────┘ │
│ │
▼ │
┌──────────────┐ │
│ Select Parent│ │
│ (Pareto) │ │
└──────┬───────┘ │
│ │
▼ │
┌───────────────────────┐ │
│ Mutate via Claude │ │
│ Code in Worktree │ │
│ │ │
│ • Read files │ │
│ • Edit multi-file │ │
│ • Run tests │ │
│ • Self-correct │ │
└───────────┬───────────┘ │
│ │
▼ │
┌──────────────┐ ┌──────────┐ │
│Gate on Train │────▶│ Reject │ │
│ (regress?) │ yes └──────────┘ │
└──────┬───────┘ │
no │ │
▼ │
┌──────────────┐ │
│Pareto Update │ │
│ (val scores) │ │
└──────┬───────┘ │
│ │
┌────┴────┐ │
│ Merge? │ every N gens │
└────┬────┘ │
│ │
▼ │
┌──────────────┐ │
│ Cleanup │ │
│ dominated │───────────────────────┘
└──────────────┘
The loop in detail:
- Seed — Your starting code is copied into a git worktree and evaluated
- Evaluate — Run your evaluator command; parse scores per test/instance
- Select — Pick a parent from the Pareto frontier (weighted by instance wins)
- Mutate — Spawn Claude Code in an isolated worktree with full tool access. It reads files, diagnoses failures, makes surgical multi-file edits, and runs commands to verify
- Gate — Re-evaluate on the train set. Reject if the mutation caused regressions
- Pareto Update — Evaluate on the val set and update the Pareto frontier
- Merge — Periodically combine two complementary frontier candidates via Claude Code
- Cleanup — Remove dominated worktrees; persist state; repeat
🚀 Quick Start
Installation
# Clone and install
git clone https://github.com/KE7/helix.git
cd helix
pip install -e .
# Verify
helix --help
Initialize a Project
cd your-project/
helix init
This creates a helix.toml config file and a .helix/ directory. Edit helix.toml to set your objective and evaluator.
Whole-repo-as-candidate Model
HELIX treats your entire working tree as the candidate. There is no target_file — Claude Code may read, edit, create, or delete any file in the project tree during each mutation. A minimal project layout looks like:
my-project/
├── helix.toml # HELIX config (run `helix init` to generate)
├── evaluate.py # Your evaluator script (must output JSON with a "score" key)
├── solve.py # File(s) you want to evolve (Claude Code will find them)
└── ... # Any other files; HELIX will consider them too
To restrict what Claude Code touches, set claude.background in helix.toml:
[claude]
background = "Only modify files under src/. Do not edit tests/ or config/."
Run Evolution
helix evolve
View Results
# Show the Pareto frontier
helix frontier
# Show the best candidate
helix best
# Export best candidate to a directory
helix best --export ./best-solution
# View full mutation log
helix log
⚙️ Configuration
HELIX is configured via helix.toml in your project root.
Minimal Example
objective = "Maximize test pass rate and code coverage"
[evaluator]
command = "pytest --tb=short -q"
When your evaluator needs project dependencies, make evaluator.command use the
same environment those dependencies are installed in. Good patterns are
uv run python evaluate.py or a wrapper like bash run_eval.sh. Avoid bare
python3 evaluate.py unless that interpreter already has everything your
evaluator imports.
Full Example
# What you want the code to do better
objective = "Maximize sum of radii of 26 non-overlapping circles packed in a unit square"
# Starting directory (default: current directory)
seed = "."
# RNG seed for deterministic parent selection (default: 0)
rng_seed = 0
[evaluator]
command = "uv run python evaluate.py"
# Available parsers: "pytest" | "exitcode" | "json_accuracy" | "json_score"
score_parser = "json_score"
include_stdout = true
include_stderr = true
extra_commands = [] # additional commands to run for context
protected_files = ["evaluate.py"] # optional extra files HELIX must keep immutable
[dataset]
# Cardinality of the train / val splits. Used by HELIX's minibatch
# sampler to generate integer indices that the evaluator (running in
# the worktree) filters against its own dataset via helix_batch.json.
# Leave both unset for single-task mode (legacy full-batch path).
# train_size = 200
# val_size = 200
[seedless]
# Seedless mode: generate initial candidate from objective via LLM
enabled = false
# Optional prompt-grounding training dataset (used only in seedless
# seed generation). Accepts a JSON array file, a JSONL file, or a
# directory of JSON files. When provided, the first 3 examples are
# included in the seed-generation prompt for representative grounding.
# train_path = "puzzles/train"
# val_path = "puzzles/val"
[evolution]
max_generations = 20
perfect_score_threshold = 1.0 # skip proposals whose instance_scores all reach this
max_evaluations = -1 # evaluation budget cap (-1 = no cap)
merge_enabled = false # enable merge/crossover operations
max_merge_invocations = 5 # total merge cap across entire run
merge_val_overlap_floor = 5 # minimum val-set overlap for merge candidates
merge_subsample_size = 5 # stratified val subsample size for merge acceptance (GEPA parity)
max_workers = 8 # thread-pool cap for parent-eval + mutation pools
# (default: os.cpu_count(), or 32 if that returns None)
num_parallel_proposals = 1 # parallel mutations per generation; "auto" resolves to max_workers // minibatch_size
minibatch_size = 3 # train-set minibatch gate size
cache_evaluation = true # reuse per-instance evaluator results
acceptance_criterion = "strict_improvement"
val_stage_size = 0 # optional first-N val gate before full val
[claude]
model = "sonnet" # or "opus", "haiku", full model name
effort = "medium" # optional: "low" | "medium" | "high" | "max"
max_turns = 20
allowed_tools = ["Read", "Edit", "Write", "Bash", "Glob", "Grep"]
# background = "Only modify files under src/. Do not touch tests/ or config/."
[worktree]
base_dir = ".helix/worktrees"
Dataset Modes
HELIX splits dataset concerns across two TOML sections:
| Section | Purpose |
|---|---|
[dataset] |
Cardinality only — train_size / val_size — drives the minibatch sampler when the evaluator owns the dataset and HELIX hands off positional indices via helix_batch.json (Architecture A). |
[seedless] |
Seedless-mode toggle + optional prompt-grounding paths (train_path / val_path) — used only during seed generation to show the LLM representative inputs. |
| Mode | Config | Description |
|---|---|---|
| Single-task | neither set | Optimize for a single task. Legacy full-batch evaluator path. |
| Positional-index handoff | dataset.train_size / dataset.val_size set |
HELIX samples integer indices into range(train_size); the evaluator reads helix_batch.json from cwd and filters its own dataset. |
| Seedless multi-task | seedless.enabled = true, seedless.train_path set |
Seed generation prompt includes the first 3 training examples for grounding. |
HELIX does not own separate dataset files for train/val; your evaluator remains
the source of truth. During evolution HELIX sets HELIX_SPLIT (train or val)
so evaluator-owned datasets can switch behavior by phase, mirroring GEPA's
trainset / valset duality.
When evolution.val_stage_size is set to a positive value and dataset.val_size is also set, accepted mutation proposals run a deterministic first-N validation stage before the full validation sweep. Stage-only results are never added to the frontier; HELIX still persists only full-val results for Pareto ranking and resume stability.
Evaluator Integrity
HELIX can lock evaluator-critical files so mutations and merges cannot game the score by editing the benchmark itself.
[evaluator]
command = "uv run python evaluate.py"
score_parser = "json_accuracy"
protected_files = [
"evaluate.py",
"goldens.json",
"helpers/evaluator_utils.py",
]
At run start, HELIX hashes the evaluator command target plus any
evaluator.protected_files entries and writes the manifest to
.helix/evaluator_manifest.json. Candidates that modify any protected file are
rejected before evaluation.
Per-example Parallelism Inside the Evaluator
HELIX parallelises across proposals (num_parallel_proposals) and across
worktrees, but each evaluator invocation sees one candidate and a batch of
instance ids as a single subprocess. Per-example parallelism — evaluating
multiple ids of one candidate concurrently — lives inside the evaluator,
not inside HELIX's engine.
This is a deliberate architectural split: GEPA's reference adapter fans out
per-example in-process, which is essentially free; HELIX's subprocess model
would pay full subprocess-startup cost for each example. If you want N-way
parallelism per batch, your evaluate.py should do it directly:
from concurrent.futures import ThreadPoolExecutor
instance_ids = load_batch_from_helix() # or argv / HELIX_SPLIT path
with ThreadPoolExecutor(max_workers=4) as pool:
results = dict(zip(instance_ids, pool.map(evaluate_one, instance_ids)))
print(json.dumps({"accuracy": mean(results.values()), "instance_scores": results}))
Pick the worker count however you like (constant, CLI arg, derived from
os.cpu_count()). HELIX remains agnostic — it just consumes the per-instance
scores the evaluator returns.
Score Parsers
HELIX includes 4 built-in score parsers to extract metrics from evaluator output:
| Parser | Input | Output | Use Case |
|---|---|---|---|
| pytest | Parses pytest -q stdout |
scores: pass_rate, durationinstance_scores: per-test pass/fail |
Unit test suites |
| exitcode | Exit code only | scores: success (1.0 or 0.0) |
Simple pass/fail evaluators |
| json_accuracy | JSON stdout with accuracy field |
scores: accuracyinstance_scores: per-instance scores |
Classification and benchmark tasks |
| json_score | JSON stdout with score field |
scores: scoreinstance_scores: score |
Optimization tasks (e.g., circle packing) |
Example evaluator outputs:
# json_score parser expects:
print(json.dumps({"score": 2.63}))
# json_accuracy parser expects:
print(json.dumps({
"accuracy": 0.85,
"instance_scores": {"puzzle_001": 1.0, "puzzle_002": 0.0}
}))
📖 CLI Reference
| Command | Description |
|---|---|
helix init |
Initialize HELIX in the current directory — creates helix.toml and .helix/ |
helix evolve |
Run the evolutionary loop |
helix frontier |
Display the current Pareto frontier as a table |
helix best |
Show the best candidate; --export PATH to copy it out |
helix history |
Show the candidate lineage as a tree |
helix resume |
Resume a previously interrupted evolution run |
helix clean |
Remove all worktrees and .helix/ state (with confirmation) |
helix log |
Show semantic mutation log — full trajectory with parent lineage |
helix evolve Options
--dir PATH Project directory containing helix.toml (default: .)
--config PATH Path to config file (default: helix.toml)
--objective TEXT Override the objective string
--evaluator TEXT Override the evaluator command
--generations INT Override max_generations
--no-merge Disable merge operations
--model TEXT Claude model (e.g. sonnet, opus, claude-sonnet-4-5)
--effort LEVEL Reasoning effort: low | medium | high | max
🧪 Results
🔵 Circle Packing
Pack 26 non-overlapping circles in a unit square, maximizing sum of radii.
| Score | Config | |
|---|---|---|
| Seed (naive concentric grid) | 0.9798 | — |
| HELIX best (gen 14 of 30) | 2.6360 | haiku · low effort · max_turns=20 |
| GEPA optimize_anything (blog) | 2.635 | gemini-3-flash |
Note: HELIX beat the GEPA blog benchmark (2.6360 vs 2.635) using Claude Haiku with low reasoning effort and a 20-turn per mutation budget. See
examples/circle_packing/for the full fixture includingsolve_optimized.py(the best evolved solution).
🏗️ Architecture
.helix/
├── config.toml # Snapshot of helix.toml at run start
├── evaluator_manifest.json # Protected evaluator file hashes
├── state.json # Generation, frontier, budget
├── lineage.json # Full ancestry graph
├── log/ # Semantic mutation logs
│ ├── g1-m0.json
│ └── g2-x0.json
├── worktrees/
│ ├── g0-s0/ # Seed
│ ├── g1-m1/ # Gen 1 Mutation 1
│ └── g2-x1/ # Gen 2 Merge 1
└── evaluations/
└── g0-s0.json # EvalResult per candidate
Module Overview
| Module | Role |
|---|---|
cli.py |
Click CLI — init, evolve, frontier, best, history, resume, clean, log |
config.py |
TOML config parsing via Pydantic v2 |
evolution.py |
Main generation loop with gating, merge, and termination on max_generations / max_evaluations |
population.py |
Candidate, EvalResult, ParetoFrontier |
worktree.py |
Git worktree lifecycle (create, clone, snapshot, remove) |
executor.py |
Run evaluator commands |
mutator.py |
Claude Code mutation invocation with autonomous system prompt |
merger.py |
Claude Code merge/crossover between complementary candidates |
lineage.py |
Ancestry graph tracking |
state.py |
Atomic state persistence and resume |
display.py |
Rich terminal UI with phase tracking |
📚 Citation
@software{helix2026,
title={HELIX: Hierarchical Evolution via LLM-Informed eXploration},
author={Elmaaroufi, Karim and OMAR},
year={2026},
url={https://github.com/KE7/helix}
}
📄 License
BSD 3-Clause License. See LICENSE for details.
🙏 Acknowledgments
HELIX's core evolutionary algorithm is based on GEPA optimize_anything by Agrawal, Lee, Ma, Elmaaroufi, Tan, Seshia, Sen, Klein, Stoica, Gonzalez, Khattab, Dimakis, and Zaharia. Their work on applying reflective Pareto evolution to any text made HELIX possible — we extended their algorithm to full codebases and agentic mutation but the foundation is theirs.
- GEPA optimize_anything — The algorithmic foundation: minibatch-gated Pareto evolution with reflective LLM mutation
- Claude Code — The agentic coding tool powering HELIX's mutation and merge engine
- OMAR — The multi-agent orchestration system used to build HELIX
@article{gepa_optimize_anything2026,
title={Introducing optimize\_anything},
author={Agrawal, Lakshya A and Lee, Donghyun and Ma, Wenjie and Elmaaroufi, Karim and Tan, Shangyin and Seshia, Sanjit A. and Sen, Koushik and Klein, Dan and Stoica, Ion and Gonzalez, Joseph E. and Khattab, Omar and Dimakis, Alexandros G. and Zaharia, Matei},
year={2026},
url={https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file helix_evo-0.1.2.tar.gz.
File metadata
- Download URL: helix_evo-0.1.2.tar.gz
- Upload date:
- Size: 224.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ef9ef06a4359af893a46cad8daf4d964daf9e39f4fec41c2a8af560c54cffcf
|
|
| MD5 |
847a2c609bf61708a01d341ca577cdca
|
|
| BLAKE2b-256 |
47dedfccde2da1a22bb7c54ace7ad5c8fc0cb805af660e6954c0ddb8a5b710da
|
Provenance
The following attestation bundles were made for helix_evo-0.1.2.tar.gz:
Publisher:
publish.yml on KE7/helix
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
helix_evo-0.1.2.tar.gz -
Subject digest:
7ef9ef06a4359af893a46cad8daf4d964daf9e39f4fec41c2a8af560c54cffcf - Sigstore transparency entry: 1340420913
- Sigstore integration time:
-
Permalink:
KE7/helix@cc498626cbff850ac3f7365e288bf50f7e1d042c -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/KE7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cc498626cbff850ac3f7365e288bf50f7e1d042c -
Trigger Event:
push
-
Statement type:
File details
Details for the file helix_evo-0.1.2-py3-none-any.whl.
File metadata
- Download URL: helix_evo-0.1.2-py3-none-any.whl
- Upload date:
- Size: 97.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcc87393c69c0f146a0493d961897783a0b2f5d9f9496b5f39c5c43718455cb2
|
|
| MD5 |
3ea1cc5f00c9c33895ce734d0bf279f6
|
|
| BLAKE2b-256 |
dc0e9512dbe2b2be29fa18e5e233da73811e871462200ae2229b494382119a37
|
Provenance
The following attestation bundles were made for helix_evo-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on KE7/helix
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
helix_evo-0.1.2-py3-none-any.whl -
Subject digest:
fcc87393c69c0f146a0493d961897783a0b2f5d9f9496b5f39c5c43718455cb2 - Sigstore transparency entry: 1340420916
- Sigstore integration time:
-
Permalink:
KE7/helix@cc498626cbff850ac3f7365e288bf50f7e1d042c -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/KE7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cc498626cbff850ac3f7365e288bf50f7e1d042c -
Trigger Event:
push
-
Statement type: