Skip to main content

Self-improvement engine for AI agents — evolve any agent autonomously using the Darwin Gödel Machine algorithm.

Project description

darwinloop

PyPI version Python 3.11+ License Apache 2.0 Tests

darwinloop — self-improvement engine for AI agents

Point darwinloop at any Python agent, define what "good" looks like with benchmark tasks, and darwinloop will autonomously improve the code — iteration by iteration — using an LLM, without you writing a single patch.

Based on the Darwin Gödel Machine (Zhang et al., ICLR 2026).


Why darwinloop?

  • Measurable gains. The football-score-agent router went from 51% → 80% accuracy in 5 iterations — 16 more teams recognised, pronoun follow-ups fixed, competition-vs-team routing corrected. Zero manual patches.
  • Fully auditable. Every change is recorded as a unified diff. Every generation is preserved in an immutable JSON archive. Roll back anytime.
  • Works on any Python agent. uAgents, LangChain, LangGraph, raw Python — if it's a .py file, darwinloop can evolve it.

Quickstart

pip install darwinloop
from darwinloop import DarwinLoop, BenchmarkTask

tasks = [
    BenchmarkTask(id="t1", name="live_scores",
                  input="live scores now", expected="live"),
    BenchmarkTask(id="t2", name="vs_competition",
                  input="Arsenal vs Chelsea result", expected="competition"),
]

dl = DarwinLoop(target="my_agent/router.py", tasks=tasks, model="asi1")
result = dl.run(iterations=5)

print(f"Score: {result.base_score:.2f}{result.best_score:.2f} (+{result.score_delta:.2f})")
result.apply()            # write best version back to router.py
result.save_report()      # save darwinloop_report.md

Expected output:

darwinloop  — self-improvement engine for AI agents

Evaluating base agent (router.py)…
  ✓ agent_0000: score=0.51 (5/10 tasks passed)

── ITERATION 1/5 ──────────────────────────────────────────
  Selected parents: ['agent_0000']
  Proposal: Add pronoun (they/them/their) follow-up handling using ctx.last_team
    Evaluating… score=0.60
  agent_0000 → agent_0001  Score: 0.51 → 0.60 (+0.09)

[… 4 more iterations …]

Evolution complete!  Score: 0.51 → 0.80  (+0.29  best: agent_0004  gen 4)

How it works

Your agent code
      │
      ▼
┌─────────────┐
│  Benchmark  │  Run tasks in isolated sandbox → score (0.0–1.0)
└──────┬──────┘
       │ failures
       ▼
┌─────────────┐
│  Diagnose   │  LLM analyses code + failures → improvement proposal
└──────┬──────┘
       │ proposal
       ▼
┌─────────────┐
│   Improve   │  LLM uses editor tools (str_replace) to apply change
└──────┬──────┘
       │ new code
       ▼
┌─────────────┐
│  Re-score   │  Run benchmarks again on new code
└──────┬──────┘
       │ score > old?
      YES → keep it (add to archive)
       NO → discard it (archive still records it for open-ended exploration)
       │
       └── repeat N iterations

LLM Support

Provider Model Set env var
ASI:One (default) asi1 ASI1_API_KEY
Anthropic claude-3-5-sonnet-20241022 ANTHROPIC_API_KEY
OpenAI gpt-4o OPENAI_API_KEY
Mock (free) --dry-run

Get an ASI:One API key at asi1.ai — it's the Fetch.ai ecosystem LLM.


Benchmark Packs

Pre-built domain packs so you don't need to write tasks from scratch:

from darwinloop import DarwinLoop
from darwinloop.packs import RoutingPack, CommercePack, SupportPack

# Routing agent (intent classification)
dl = DarwinLoop(target="agent/router.py",
                pack=RoutingPack(intents=["live", "team", "competition", "fixtures"]))

# Commerce agent (product search, cart, checkout)
dl = DarwinLoop(target="agent/shop.py", pack=CommercePack())

# Customer support agent
dl = DarwinLoop(target="agent/support.py", pack=SupportPack())

CLI Reference

# Evolve a specific file
darwinloop evolve agent/router.py --iterations 5 --model asi1

# Dry run (free, no API key needed)
darwinloop evolve agent/ --dry-run --auto

# Use a built-in benchmark pack
darwinloop evolve agent/router.py --pack routing --iterations 5

# Load benchmarks from a file
darwinloop evolve agent/router.py --tasks benchmarks.py --iterations 10

# Auto-generate benchmarks from agent code
darwinloop scaffold agent/router.py --output benchmarks.py

# View a previous run report
darwinloop report darwinloop_output/

# Diff two generations
darwinloop diff darwinloop_output/ --from agent_0000 --to agent_0004

Real Example: Football Agent

The examples/football/ directory contains the real football-score-agent router and its benchmark tasks.

darwinloop evolve examples/football/football_router.py \
    --tasks examples/football/benchmarks.py \
    --iterations 5 --model asi1

DGM-discovered improvements in 5 iterations:

# Improvement Score impact
1 Pronoun follow-up (they/their/them → last team) +0.09
2 +16 clubs (Juventus, Atletico, Napoli, Dortmund…) +0.08
3 Competition-signal priority (vs, result, score) +0.07
4 Fixture regex expansion (next game, upcoming game) +0.05

Total: 0.51 → 0.80 (+0.29)


Safety

darwinloop is designed to be the most trustworthy self-improvement library available.

Guarantee Implementation
AST validation before execution sandbox/validator.py blocks eval, exec, shell=True
Subprocess isolation All agent code runs in a child process, never in the darwinloop process
Hard timeouts Sandbox default 30s, configurable via sandbox_timeout
No network in sandbox Network imports trigger warnings; calls fail at runtime
Immutable archive AgentEntry records are never modified after creation
Diff transparency Every change recorded as unified diff
Revert anytime All generations preserved; load archive and roll back
Dry run mode MockLLMClient tests full pipeline at zero cost
Score regression protection New code kept only if score strictly improves
Human checkpoints In non---auto mode, prompts before each iteration

See SECURITY.md for full details.


Contributing

See CONTRIBUTING.md. PRs welcome.


License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darwinloop-0.1.0.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

darwinloop-0.1.0-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file darwinloop-0.1.0.tar.gz.

File metadata

  • Download URL: darwinloop-0.1.0.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for darwinloop-0.1.0.tar.gz
Algorithm Hash digest
SHA256 22f5dde8e931c115f61bc96e3410d70dcdad38a46e0fc855d59aa30f5cd293d1
MD5 0c7aa44e3400ea652eb0762df0466a24
BLAKE2b-256 ab933486c952533de0008c9920b06c681eef434077da24afe9309193ad5c75c1

See more details on using hashes here.

File details

Details for the file darwinloop-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: darwinloop-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for darwinloop-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d6dbc1b6bb27cb1659ed9a65b260a8c3d0324100bd8cfcbb357020fe2620446
MD5 24d3c11883b229d8f3c8e7de818dfe2e
BLAKE2b-256 3dcb622e1e7e47bf5b583c0beabbad5ad6a384d5555d628fea7bfa8208c74a8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page