Self-improvement engine for AI agents — evolve any agent autonomously using the Darwin Gödel Machine algorithm.
Project description
darwinloop
darwinloop — self-improvement engine for AI agents
Point darwinloop at any Python agent, define what "good" looks like with benchmark tasks, and darwinloop will autonomously improve the code — iteration by iteration — using an LLM, without you writing a single patch.
Based on the Darwin Gödel Machine (Zhang et al., ICLR 2026).
Why darwinloop?
- Measurable gains. The football-score-agent router went from 51% → 80% accuracy in 5 iterations — 16 more teams recognised, pronoun follow-ups fixed, competition-vs-team routing corrected. Zero manual patches.
- Fully auditable. Every change is recorded as a unified diff. Every generation is preserved in an immutable JSON archive. Roll back anytime.
- Works on any Python agent. uAgents, LangChain, LangGraph, raw Python — if it's a
.pyfile, darwinloop can evolve it.
Quickstart
pip install darwinloop
from darwinloop import DarwinLoop, BenchmarkTask
tasks = [
BenchmarkTask(id="t1", name="live_scores",
input="live scores now", expected="live"),
BenchmarkTask(id="t2", name="vs_competition",
input="Arsenal vs Chelsea result", expected="competition"),
]
dl = DarwinLoop(target="my_agent/router.py", tasks=tasks, model="asi1")
result = dl.run(iterations=5)
print(f"Score: {result.base_score:.2f} → {result.best_score:.2f} (+{result.score_delta:.2f})")
result.apply() # write best version back to router.py
result.save_report() # save darwinloop_report.md
Expected output:
darwinloop — self-improvement engine for AI agents
Evaluating base agent (router.py)…
✓ agent_0000: score=0.51 (5/10 tasks passed)
── ITERATION 1/5 ──────────────────────────────────────────
Selected parents: ['agent_0000']
Proposal: Add pronoun (they/them/their) follow-up handling using ctx.last_team
Evaluating… score=0.60
agent_0000 → agent_0001 Score: 0.51 → 0.60 (+0.09)
[… 4 more iterations …]
Evolution complete! Score: 0.51 → 0.80 (+0.29 best: agent_0004 gen 4)
How it works
Your agent code
│
▼
┌─────────────┐
│ Benchmark │ Run tasks in isolated sandbox → score (0.0–1.0)
└──────┬──────┘
│ failures
▼
┌─────────────┐
│ Diagnose │ LLM analyses code + failures → improvement proposal
└──────┬──────┘
│ proposal
▼
┌─────────────┐
│ Improve │ LLM uses editor tools (str_replace) to apply change
└──────┬──────┘
│ new code
▼
┌─────────────┐
│ Re-score │ Run benchmarks again on new code
└──────┬──────┘
│ score > old?
YES → keep it (add to archive)
NO → discard it (archive still records it for open-ended exploration)
│
└── repeat N iterations
LLM Support
| Provider | Model | Set env var |
|---|---|---|
| ASI:One (default) | asi1 |
ASI1_API_KEY |
| Anthropic | claude-3-5-sonnet-20241022 |
ANTHROPIC_API_KEY |
| OpenAI | gpt-4o |
OPENAI_API_KEY |
| Mock (free) | — | --dry-run |
Get an ASI:One API key at asi1.ai — it's the Fetch.ai ecosystem LLM.
Benchmark Packs
Pre-built domain packs so you don't need to write tasks from scratch:
from darwinloop import DarwinLoop
from darwinloop.packs import RoutingPack, CommercePack, SupportPack
# Routing agent (intent classification)
dl = DarwinLoop(target="agent/router.py",
pack=RoutingPack(intents=["live", "team", "competition", "fixtures"]))
# Commerce agent (product search, cart, checkout)
dl = DarwinLoop(target="agent/shop.py", pack=CommercePack())
# Customer support agent
dl = DarwinLoop(target="agent/support.py", pack=SupportPack())
CLI Reference
# Evolve a specific file
darwinloop evolve agent/router.py --iterations 5 --model asi1
# Dry run (free, no API key needed)
darwinloop evolve agent/ --dry-run --auto
# Use a built-in benchmark pack
darwinloop evolve agent/router.py --pack routing --iterations 5
# Load benchmarks from a file
darwinloop evolve agent/router.py --tasks benchmarks.py --iterations 10
# Auto-generate benchmarks from agent code
darwinloop scaffold agent/router.py --output benchmarks.py
# View a previous run report
darwinloop report darwinloop_output/
# Diff two generations
darwinloop diff darwinloop_output/ --from agent_0000 --to agent_0004
Real Example: Football Agent
The examples/football/ directory contains the real football-score-agent router and its benchmark tasks.
darwinloop evolve examples/football/football_router.py \
--tasks examples/football/benchmarks.py \
--iterations 5 --model asi1
DGM-discovered improvements in 5 iterations:
| # | Improvement | Score impact |
|---|---|---|
| 1 | Pronoun follow-up (they/their/them → last team) | +0.09 |
| 2 | +16 clubs (Juventus, Atletico, Napoli, Dortmund…) | +0.08 |
| 3 | Competition-signal priority (vs, result, score) |
+0.07 |
| 4 | Fixture regex expansion (next game, upcoming game) |
+0.05 |
Total: 0.51 → 0.80 (+0.29)
Safety
darwinloop is designed to be the most trustworthy self-improvement library available.
| Guarantee | Implementation |
|---|---|
| AST validation before execution | sandbox/validator.py blocks eval, exec, shell=True |
| Subprocess isolation | All agent code runs in a child process, never in the darwinloop process |
| Hard timeouts | Sandbox default 30s, configurable via sandbox_timeout |
| No network in sandbox | Network imports trigger warnings; calls fail at runtime |
| Immutable archive | AgentEntry records are never modified after creation |
| Diff transparency | Every change recorded as unified diff |
| Revert anytime | All generations preserved; load archive and roll back |
| Dry run mode | MockLLMClient tests full pipeline at zero cost |
| Score regression protection | New code kept only if score strictly improves |
| Human checkpoints | In non---auto mode, prompts before each iteration |
See SECURITY.md for full details.
Contributing
See CONTRIBUTING.md. PRs welcome.
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file darwinloop-0.1.0.tar.gz.
File metadata
- Download URL: darwinloop-0.1.0.tar.gz
- Upload date:
- Size: 49.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22f5dde8e931c115f61bc96e3410d70dcdad38a46e0fc855d59aa30f5cd293d1
|
|
| MD5 |
0c7aa44e3400ea652eb0762df0466a24
|
|
| BLAKE2b-256 |
ab933486c952533de0008c9920b06c681eef434077da24afe9309193ad5c75c1
|
File details
Details for the file darwinloop-0.1.0-py3-none-any.whl.
File metadata
- Download URL: darwinloop-0.1.0-py3-none-any.whl
- Upload date:
- Size: 48.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d6dbc1b6bb27cb1659ed9a65b260a8c3d0324100bd8cfcbb357020fe2620446
|
|
| MD5 |
24d3c11883b229d8f3c8e7de818dfe2e
|
|
| BLAKE2b-256 |
3dcb622e1e7e47bf5b583c0beabbad5ad6a384d5555d628fea7bfa8208c74a8b
|