Agents compete, the best solution survives

Project description

Welcome to the Darwin Derby

This project started from Andrej Karpathy's autoresearch — a single AI agent in a loop, optimizing a GPT training script against validation bits-per-byte on one GPU. The agent would modify train.py, run training, check the score, and keep the change if it improved. A kind of "intelligent" evolutionary search.

darwin-derby

Darwin Derby is a generalization of the Karpathy loop. Use any set of files as the state. A scoring function is anything that outputs a number. As long as your score measures something you want to maximize or minimize, the Darwin Derby can auto-tune any metric across iterative or swarm experiments while you sleep.

In the Darwin Derby, you can optimize anything. Even squishy things like "the quality of my essay", so long as you have an agent that can score it numerically on criteria.

The name comes from a track on Vulpeck's excellent and exhuberant Hill Climber album.

GPT Training (val BPB)	Rastrigin Function (10-D)

Traveling Salesman (20 cities)	Rectangle Packing (12 rects)

Improve my Essay	Make Website Better

Agents propose changes (grey dots), the evaluator keeps only improvements (green dots), and the best score ratchets monotonically in one direction.

Install

Note: The Python package is called darwinderby. The CLI command it installs is derby.

From PyPI:

uv tool install darwinderby[llm]

From source:

git clone https://github.com/kousun12/darwin-derby
cd darwin-derby
uv tool install -e ".[llm]"

Quick start

Try an example problem in one command:

derby try fib              # built-in demo agent, generates progress chart
derby try rastrigin --claude  # use Claude as the agent

Or create your own problem:

# Create a new problem
derby init my-problem --direction minimize
cd my-problem

# Edit the scaffolded files
#   problem.yaml       — describe the problem
#   state/             — set up the initial mutable state (any files)
#   scoring/score.py   — implement your score() function

# Check everything is wired up
derby validate

# Run scoring once as a sanity check
derby score

# Run a local optimization loop — single machine, one agent
derby run -a "claude -p 'read agent_instructions.md and improve the solution'"

The run command handles everything: it runs your agent, scores the result, keeps improvements, updates the leaderboard, and loops. The scoring directory is hidden from the agent during execution.

How it works

flowchart TD
    agent["Agent modifies state/ files"] --> score["Run score.py"]
    score --> check{"Improved?"}
    check -- Yes --> keep["Keep changes + update leaderboard"]
    check -- No --> revert["Revert changes"]
    keep --> agent
    revert --> agent

derby run handles the entire loop: it invokes your agent, scores the result, keeps improvements, reverts failures, updates the leaderboard, and repeats. The scoring directory is hidden from the agent during execution — agents never see the scoring code. The agent can be any command — a shell script, a Python program, a call to Claude.

To scale to multiple agents working in parallel, you can use the git-based evaluator instead — agents push proposal branches and the evaluator scores and merges them.

Problem structure

A problem is a self-contained directory:

my-problem/
├── problem.yaml            # Problem definition + framework config
├── agent_instructions.md   # Protocol for agents (generated by init)
├── state/                  # Mutable files — agents can create, modify, or delete
│   └── ...                 # Any files; the scoring function decides how to read them
├── context/                # Read-only background for agents
├── scoring/                # GITIGNORED — private scoring code
│   └── score.py            # Implement score() → dict
├── leaderboard.md          # Auto-updated by the evaluator
└── .derby/                 # GITIGNORED — local evaluator state
    └── history.db          # SQLite evaluation history

flowchart LR
    subgraph visible ["Agents See"]
        A["problem.yaml"]
        B["state/"]
        C["context/"]
        D["leaderboard.md"]
    end
    subgraph hidden ["Agents Never See"]
        E["scoring/score.py"]
        F[".derby/history.db"]
    end
    B -->|"modified"| E
    E -->|"results"| D

The scoring/ directory is gitignored and hidden from agents during execution. Agents see the metric name and direction (from problem.yaml) and previous scores (from leaderboard.md), but never the scoring implementation.

CLI reference

Command	Description
`derby try <problem>`	Try an example problem (demo agent, generates chart)
`derby try <problem> --claude`	Try an example with Claude as the agent
`derby init <name>`	Scaffold a new problem directory
`derby validate`	Check that the problem directory is well-formed
`derby score`	Run `scoring/score.py` once and print the result
`derby run -a "<cmd>"`	Run the local optimization loop with an agent command
`derby evaluate`	Start the polling evaluator (watches for proposal branches)
`derby evaluate --baseline-only`	Establish baseline score and exit
`derby serve`	Start the webhook server (receives PR events)
`derby history`	Print evaluation history from the DB
`derby leaderboard`	Regenerate `leaderboard.md` from history
`derby plot`	Generate a progress chart from evaluation history

All commands operate on the current directory by default (overridable with --dir).

Local loop

The default way to run. Single machine, one agent, fully automated:

derby run -a "./my_agent.sh"                           # run until stopped
derby run -a "python optimize.py" -n 50                # limit to 50 iterations
derby run -a "claude -p 'improve the solution'" -n 10  # use any command as the agent

Progress charts

derby plot                         # chart from .derby/history.db
derby plot --db path/to/history.db  # chart from a specific database
derby plot -o chart.png            # save to a specific path

Running agents

ralph-science

An agent is any command that reads the problem and modifies files in state/. Point it at the problem directory and let derby run handle the rest:

derby run -a "claude -p 'read agent_instructions.md and improve the solution'" -n 10
derby run -a "./my_agent.sh"
derby run -a "python optimize.py" -n 50

Agent environment variables

The framework sets these environment variables before each agent invocation:

Variable	Description	Example
`DERBY_ITERATION`	Current iteration number (1-indexed)	`3`
`DERBY_SCORE`	Current best score	`169.743`
`DERBY_DIRECTION`	Optimization direction	`minimize`
`DERBY_METRIC`	Name of the score metric	`score`
`DERBY_PROBLEM`	Problem name from `problem.yaml`	`rastrigin`

Writing a custom agent

An agent can be any command — a shell script, a Python script, a call to an AI tool. The agent runs in the problem directory, modifies files in state/, and exits. The framework handles scoring, keeping improvements, and looping.

A minimal shell script agent:

#!/bin/bash
# agent.sh — read the current score, tweak state/solution.py
echo "Iteration $DERBY_ITERATION, current best: $DERBY_SCORE"

python3 -c "
import random
# Read current state, make a random perturbation
exec(open('state/solution.py').read())
x = [v + random.gauss(0, 0.5) for v in x]
with open('state/solution.py', 'w') as f:
    f.write(f'x = {x}\n')
"

derby run -a "./agent.sh" -n 20

For AI-powered agents, the command can be anything that reads the problem and modifies state:

derby run -a "claude -p 'read agent_instructions.md and improve the solution'" -n 10

Example problems

The examples/ directory contains five reference problems showing the structure:

Problem	Description	Starting → Optimum	Requirements
`rastrigin`	Minimize 10-D Rastrigin function	~169.7 → 0.0	None
`tsp`	Shortest tour of 20 cities	~1914 → ~680	None
`packing`	Pack 12 rectangles into smallest box	13250 → ~6975	None
`fib`	Optimize Fibonacci for speed	~1.0s → ~0.000001s	None
`gpt`	Optimize GPT training (val_bpb)	~1.15 → ?	NVIDIA GPU

The first four score instantly or near-instantly and need no GPU.

For runnable problems with evaluator support and simulated test runs, see derby-examples. See examples/README.md for details on each problem's structure.

Scaling and Swarming with git

For running many agents in parallel — a swarm — you can use the git-based evaluator. Agents clone the repo, push proposal branches (proposals/<name>/<description>) or open PRs, and the evaluator scores and merges them serially.

Polling — watches for proposal branches:

derby evaluate --baseline-only   # establish baseline
derby evaluate                   # start evaluation loop
derby evaluate --push            # push leaderboard updates to origin

Webhook — receives GitHub PR events via HTTP:

derby evaluate --baseline-only   # establish baseline first
derby serve --push               # start webhook server

# Configure the GitHub webhook:
#   URL: https://<your-domain>/webhook
#   Content type: application/json
#   Secret: (set matching WEBHOOK_SECRET env var on the server)
#   Events: Pull requests only

Evaluation is serial — one proposal at a time, so the comparison is always clean. Proposal generation is massively parallel: hundreds of agents can push branches simultaneously, and the evaluator processes them one by one. Anything that can git push can be an agent — no SDK, no registration, no custom API.

Creating your own problem

The fastest way:

derby init my-problem --direction minimize
cd my-problem

This scaffolds the full directory structure, initializes a git repo, and sets up .gitignore to exclude scoring/ and .derby/. Then:

Edit problem.yaml — describe the problem.
Edit files in state/ — set up the initial mutable state. You can rename, add, or remove files here; the scoring function decides what to read.
Edit scoring/score.py — implement your score() function. It must return a dict with at least the primary metric key (default: "score").
Run derby validate — check everything is wired up.
Run derby score — run scoring once as a sanity check.

A minimal problem.yaml:

name: my-problem
description: Minimize the cost function.
score:
  direction: minimize

For a full walkthrough with a complete runnable example, see docs/create-problem.md. For guidance on writing scoring functions (including LLM-as-judge patterns), see docs/scoring.md.

Design principles

Minimum time to optimization

The hardest part of any optimization problem isn't the search — it's defining what "better" means. Darwin Derby is designed so the time between "I have a problem" and "agents are working on it" is as short as possible. init scaffolds the structure. You fill in three things: what the problem is, what the starting state looks like, and how to score it. Then run handles everything else — scoring, keeping improvements, updating the leaderboard, looping. No infrastructure to set up, no agents to configure, no evaluation pipeline to build.

The goal is that your time goes to the only part that requires human judgment: thinking carefully about the scoring function and what values it encodes. Once that's right, the system runs without oversight. Agents propose, the evaluator decides, and the score ratchets forward.

Blind scoring

Agents never see the scoring code. This is the single most important design decision.

If an optimizer can see the evaluation function, it will overfit to it — exploiting quirks in the metric, hardcoding known-good outputs, gaming the test set. This is the same reason you don't let students write the exam.

The separation is structural, not conventional. The scoring code is never committed to the problem repo. It exists only on the evaluation machine. Agents know what metric they're optimizing and what scores others have achieved, but they have zero information about how the score is computed. They modify state, and a number comes back.

Only forward, only better

When a proposal doesn't improve the score, it's discarded forever. No second chances, no combining near-misses. The best score only moves forward — a ratchet that clicks in one direction.

This works because the search space is infinite. Revisiting failed proposals is worse than trying new ideas. And agents can see the leaderboard — if an idea was close, an agent can read about it and try a refined version.

What you could optimize

Anything with a scoring function:

A prompt template (scored by LLM-as-judge accuracy)
A web app's Lighthouse performance score
A compiler optimization pass (scored by benchmark runtime)
A trading strategy (scored by backtested Sharpe ratio)
A game AI (scored by win rate against a baseline)
An ML training script (scored by validation loss)

But the more interesting frontier is things that don't have a natural number yet. Now that LLMs can act as judges, you can define a rubric across multiple dimensions — clarity, originality, tone, argument strength — have an LLM score each one, apply hidden weights, and collapse it into a single number. The agents never see the rubric or the weights. They just modify state and get back a score.

This means you can optimize subjective artifacts the same way:

An essay (scored across argument structure, evidence quality, readability, originality)
A short story (scored on narrative tension, character voice, prose style)
A product landing page (scored on persuasiveness, clarity, emotional resonance)
An API design (scored on consistency, discoverability, naming conventions)

The weights encode values the agents can't see. Weight originality at 3x and the swarm converges on bold writing. Change the weights and the same agents produce something entirely different — without changing any agent instructions. The values live in the scoring function, not in the agents.

The Goodhart warning

"When a measure becomes a target, it ceases to be a good measure."

The quality of the scoring function is the ceiling on the quality of the results. A bad metric optimized ruthlessly produces paperclips — a system that scores well but misses the point. Whatever number you pick, agents will exploit every degree of freedom it leaves open.

This is a feature, not a bug. It forces you to think hard about what "better" means before you start. And if your metric is good, relentless optimization is exactly what you want.

Docs

Document	Description
Getting started	Install, try a demo, create your first problem
Create a problem	Step-by-step walkthrough with a runnable example
Scoring	Writing scoring functions, LLM-as-judge patterns
Agent protocol	How agents participate in a problem
Design	Philosophy and principles behind the framework

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darwinderby-0.1.0.tar.gz (28.2 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

darwinderby-0.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file darwinderby-0.1.0.tar.gz.

File metadata

Download URL: darwinderby-0.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 28.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.2

File hashes

Hashes for darwinderby-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4bd202a1b9a3151e779d7564c3f66fed5dcda804680ff0dc4e0f994796afbfec`
MD5	`804c57044ef5017addd9acc7e92226e6`
BLAKE2b-256	`77fc1fbeb0e44ee1824a8e153cc87406b0b716b35ff1faf4318c3bb1160d49d1`

See more details on using hashes here.

File details

Details for the file darwinderby-0.1.0-py3-none-any.whl.

File metadata

Download URL: darwinderby-0.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 35.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.2

File hashes

Hashes for darwinderby-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1cc86b4b2327b5f4bfcb440488c8fdbea90be398bc5fe954082ffb788355d1e`
MD5	`771b87ce1ea5f228ea8c5d9028469943`
BLAKE2b-256	`8a8709e2caa6f69457068f0f732cdeaa86f1033b4221bf18e1dd45d89fdd3249`

See more details on using hashes here.

darwinderby 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Welcome to the Darwin Derby

Install

Quick start

How it works

Problem structure

CLI reference

Local loop

Progress charts

Running agents

Agent environment variables

Writing a custom agent

Example problems

Scaling and Swarming with git

Creating your own problem

Design principles

Minimum time to optimization

Blind scoring

Only forward, only better

What you could optimize

The Goodhart warning

Docs

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes