autoresearch for everything — autonomous iterative improvement for any system

These details have not been verified by PyPI

Project links

Project description

autoloop 🔄

autoresearch for everything.

Karpathy's autoresearch showed us the loop: point an AI agent at a problem, give it a metric, let it run 100 experiments overnight. Wake up to a better system.

That loop was hardcoded to ML training. autoloop generalizes it to any domain.

from autoloop import AutoLoop

loop = AutoLoop(
    target="optimize.py",        # what the agent edits
    metric=my_eval_function,     # returns a float
    directives="program.md",     # research goals in plain English
    budget_seconds=300,          # per experiment (default: 5 min)
)

loop.run(experiments=100)        # go to sleep
# wake up to a git log of 100 experiments and a better system

It Works — Here's a Real Test Run

We ran autoloop on a naive recursive fibonacci function, giving it 4 experiments to find a faster implementation. No human involved after the initial setup:

📊 Baseline score: -0.1717s  (naive recursion, fibonacci(30))

🔬 Experiment 1/4
✅ KEPT     | Score: -0.0249 (+0.1467) | Add memoization with dict cache

🔬 Experiment 2/4
❌ DISCARDED | Score: -0.0280 (-0.0030) | Switch to iterative approach

🔬 Experiment 3/4
❌ DISCARDED | Score: -999.000 (-998.97) | Wrong shortcut — should be discarded

🔬 Experiment 4/4
✅ KEPT     | Score: -0.0217 (+0.0032) | Use functools.lru_cache decorator

🏁 Run complete: 4 experiments | 2 improvements | Best: -0.0217s

6.9x speedup from baseline. Broken code (exp 3) was automatically detected and discarded via the correctness check in the metric. The loop kept every genuine improvement and rejected everything else.

Why This Exists

autoresearch works because of three design decisions:

Single file to modify — keeps scope manageable, diffs reviewable
Fixed time/compute budget — makes experiments directly comparable
One unambiguous metric — enables full autonomy, no human judgment needed

These decisions aren't specific to ML training. They apply to any system you want to improve autonomously. autoloop is just the abstraction.

What You Can Optimize

Domain	Target file	Metric
Prompt optimization	`prompt.md`	LLM-as-judge score / task accuracy
SQL queries	`query.sql`	Execution time / rows returned
Trading strategies	`strategy.py`	Sharpe ratio / win rate
API pipelines	`pipeline.py`	Latency / success rate
Test suites	`tests.py`	Coverage / mutation score
Compiler flags	`build.sh`	Binary size / compile time
Agent system prompts	`system_prompt.md`	Task completion rate
RAG pipelines	`retrieval.py`	RAGAS score / hit rate

Install

pip install autoloop

Requires Python 3.10+. Works with any LLM agent backend (Claude Code, Codex, local models via Ollama).

Quickstart

1. Define your target

The file your agent will edit. Start small — one function, one prompt, one query.

# optimize.py — your agent edits this
SYSTEM_PROMPT = """You are a helpful assistant."""

2. Define your metric

A Python function that returns a float. Lower or higher = better (you configure which).

def my_metric(target_path: str) -> float:
    """Run eval and return score. autoloop calls this after every experiment."""
    result = run_eval(target_path)
    return result.accuracy  # higher is better

3. Write your directives

Plain English research goals in program.md. This is what you iterate on over time.

# Research Directives

## Goal
Improve the system prompt to increase task completion rate on customer support queries.

## Hypotheses to explore
- More specific role definition
- Explicit handling of edge cases
- Chain-of-thought instructions
- Tone adjustments for different query types

## Constraints
- Keep under 500 tokens
- Must pass safety checks

4. Run

from autoloop import AutoLoop

loop = AutoLoop(
    target="optimize.py",
    metric=my_metric,
    directives="program.md",
    budget_seconds=300,
    agent="claude",           # "claude", "codex", "ollama"
    higher_is_better=True,
)

loop.run(experiments=100)

5. Review

autoloop history          # git log of all experiments
autoloop best             # show the best-performing version
autoloop diff 12 best     # compare experiment 12 to best
autoloop rollback 12      # restore experiment 12

How It Works

┌─────────────────────────────────────────────────────────┐
│                      autoloop                           │
│                                                         │
│  Read directives.md                                     │
│         │                                               │
│         ▼                                               │
│  Agent proposes modification to target file             │
│         │                                               │
│         ▼                                               │
│  Apply modification                                     │
│         │                                               │
│         ▼                                               │
│  Run metric() with fixed budget                         │
│         │                                               │
│         ▼                                               │
│  Score improved? ──YES──▶ git commit + update best      │
│         │                                               │
│        NO                                               │
│         │                                               │
│         ▼                                               │
│  Discard + log                                          │
│         │                                               │
│         ▼                                               │
│  Repeat N times                                         │
└─────────────────────────────────────────────────────────┘

Each experiment is logged with: timestamp, modification description, score delta, and the full diff. The git history is your research log.

Advanced Usage

Parallel experiments

loop.run(experiments=100, parallel=4)  # 4 agents running simultaneously

Custom agent backends

from autoloop.backends import OllamaBackend

loop = AutoLoop(
    target="prompt.md",
    metric=my_metric,
    directives="program.md",
    backend=OllamaBackend(model="llama3.1:70b"),
)

Warm starts

# Resume from a previous run's best result
loop.run(experiments=50, warm_start="./autoloop-results/best.py")

Metric composition

from autoloop import CompositeMetric

metric = CompositeMetric([
    (accuracy_metric, 0.7),   # 70% weight
    (latency_metric, 0.3),    # 30% weight
])

Examples

examples/prompt_optimization/ — optimize a Claude system prompt for customer support
examples/sql_optimization/ — optimize a slow SQL query
examples/trading_strategy/ — evolve a trading strategy (inspired by AutoStrategy)
examples/rag_pipeline/ — optimize a RAG retrieval pipeline

Comparison to autoresearch

	autoresearch	autoloop
Domain	ML training only	Any
Target	`train.py`	Any file
Metric	`val_bpb`	Any Python function
Budget	5-min wall clock	Configurable
Agent	Claude Code / Codex	Any
Parallel	No	Yes

autoloop is autoresearch with the ML-specific parts removed and replaced with a general interface.

Philosophy

The insight from autoresearch isn't about ML. It's about loop design:

Unambiguous feedback — the metric must be objective and quantitative
Fixed budget — experiments must be comparable
Narrow scope — one file, reviewable diffs
Overnight scale — 100 experiments while you sleep

Wherever you can satisfy these four conditions, you can run autonomous improvement. autoloop makes that loop accessible without writing the scaffolding yourself.

Roadmap

Web UI for experiment visualization
Multi-file optimization with dependency tracking
MCP server (use autoloop as a tool inside Claude Code)
Hosted experiment tracking (autoloop cloud)
Pre-built metric libraries (RAGAS, finance, code quality)

Contributing

PRs welcome. See CONTRIBUTING.md.

License

MIT

Inspired by karpathy/autoresearch. autoloop generalizes the loop.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoloop_ai-0.1.0.tar.gz (15.8 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoloop_ai-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file autoloop_ai-0.1.0.tar.gz.

File metadata

Download URL: autoloop_ai-0.1.0.tar.gz
Upload date: Apr 1, 2026
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for autoloop_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`328cfc1a3fdc457ec9470344dfc3189f9287a504f022c7ba4c4226f876a4e1db`
MD5	`ad2ef3d35e97e3fa5bd0a9ba04016290`
BLAKE2b-256	`5703be9d51566d8f3ac999e86dc8d3361adf5a32449dd6b6c4e8a4fa202de1c7`

See more details on using hashes here.

File details

Details for the file autoloop_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: autoloop_ai-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for autoloop_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa4c2280fa0662176f7d33a4cb269825adf297235e0406430cd959672c9db8b0`
MD5	`ffacea9fc4fdb746a9a3a914b84eaa89`
BLAKE2b-256	`df8d6df6cb11a04c24903fc8169140b601ef162b0855801297aac938fca4bbe9`

See more details on using hashes here.

autoloop-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autoloop 🔄

It Works — Here's a Real Test Run

Why This Exists

What You Can Optimize

Install

Quickstart

1. Define your target

2. Define your metric

3. Write your directives

4. Run

5. Review

How It Works

Advanced Usage

Parallel experiments

Custom agent backends

Warm starts

Metric composition

Examples

Comparison to autoresearch

Philosophy

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes