Skip to main content

autoresearch for everything โ€” autonomous iterative improvement for any system

Project description

autoloop ๐Ÿ”„

autoresearch for everything.

Karpathy's autoresearch showed us the loop: point an AI agent at a problem, give it a metric, let it run 100 experiments overnight. Wake up to a better system.

That loop was hardcoded to ML training. autoloop generalizes it to any domain.

from autoloop import AutoLoop

loop = AutoLoop(
    target="optimize.py",        # what the agent edits
    metric=my_eval_function,     # returns a float
    directives="program.md",     # research goals in plain English
    budget_seconds=300,          # per experiment (default: 5 min)
)

loop.run(experiments=100)        # go to sleep
# wake up to a git log of 100 experiments and a better system

It Works โ€” Here's a Real Test Run

We ran autoloop on a naive recursive fibonacci function, giving it 4 experiments to find a faster implementation. No human involved after the initial setup:

๐Ÿ“Š Baseline score: -0.1717s  (naive recursion, fibonacci(30))

๐Ÿ”ฌ Experiment 1/4
โœ… KEPT     | Score: -0.0249 (+0.1467) | Add memoization with dict cache

๐Ÿ”ฌ Experiment 2/4
โŒ DISCARDED | Score: -0.0280 (-0.0030) | Switch to iterative approach

๐Ÿ”ฌ Experiment 3/4
โŒ DISCARDED | Score: -999.000 (-998.97) | Wrong shortcut โ€” should be discarded

๐Ÿ”ฌ Experiment 4/4
โœ… KEPT     | Score: -0.0217 (+0.0032) | Use functools.lru_cache decorator

๐Ÿ Run complete: 4 experiments | 2 improvements | Best: -0.0217s

6.9x speedup from baseline. Broken code (exp 3) was automatically detected and discarded via the correctness check in the metric. The loop kept every genuine improvement and rejected everything else.

Why This Exists

autoresearch works because of three design decisions:

  1. Single file to modify โ€” keeps scope manageable, diffs reviewable
  2. Fixed time/compute budget โ€” makes experiments directly comparable
  3. One unambiguous metric โ€” enables full autonomy, no human judgment needed

These decisions aren't specific to ML training. They apply to any system you want to improve autonomously. autoloop is just the abstraction.

What You Can Optimize

Domain Target file Metric
Prompt optimization prompt.md LLM-as-judge score / task accuracy
SQL queries query.sql Execution time / rows returned
Trading strategies strategy.py Sharpe ratio / win rate
API pipelines pipeline.py Latency / success rate
Test suites tests.py Coverage / mutation score
Compiler flags build.sh Binary size / compile time
Agent system prompts system_prompt.md Task completion rate
RAG pipelines retrieval.py RAGAS score / hit rate

Install

pip install autoloop

Requires Python 3.10+. Works with any LLM agent backend (Claude Code, Codex, local models via Ollama).

Quickstart

1. Define your target

The file your agent will edit. Start small โ€” one function, one prompt, one query.

# optimize.py โ€” your agent edits this
SYSTEM_PROMPT = """You are a helpful assistant."""

2. Define your metric

A Python function that returns a float. Lower or higher = better (you configure which).

def my_metric(target_path: str) -> float:
    """Run eval and return score. autoloop calls this after every experiment."""
    result = run_eval(target_path)
    return result.accuracy  # higher is better

3. Write your directives

Plain English research goals in program.md. This is what you iterate on over time.

# Research Directives

## Goal
Improve the system prompt to increase task completion rate on customer support queries.

## Hypotheses to explore
- More specific role definition
- Explicit handling of edge cases
- Chain-of-thought instructions
- Tone adjustments for different query types

## Constraints
- Keep under 500 tokens
- Must pass safety checks

4. Run

from autoloop import AutoLoop

loop = AutoLoop(
    target="optimize.py",
    metric=my_metric,
    directives="program.md",
    budget_seconds=300,
    agent="claude",           # "claude", "codex", "ollama"
    higher_is_better=True,
)

loop.run(experiments=100)

5. Review

autoloop history          # git log of all experiments
autoloop best             # show the best-performing version
autoloop diff 12 best     # compare experiment 12 to best
autoloop rollback 12      # restore experiment 12

How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      autoloop                           โ”‚
โ”‚                                                         โ”‚
โ”‚  Read directives.md                                     โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚         โ–ผ                                               โ”‚
โ”‚  Agent proposes modification to target file             โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚         โ–ผ                                               โ”‚
โ”‚  Apply modification                                     โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚         โ–ผ                                               โ”‚
โ”‚  Run metric() with fixed budget                         โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚         โ–ผ                                               โ”‚
โ”‚  Score improved? โ”€โ”€YESโ”€โ”€โ–ถ git commit + update best      โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚        NO                                               โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚         โ–ผ                                               โ”‚
โ”‚  Discard + log                                          โ”‚
โ”‚         โ”‚                                               โ”‚
โ”‚         โ–ผ                                               โ”‚
โ”‚  Repeat N times                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Each experiment is logged with: timestamp, modification description, score delta, and the full diff. The git history is your research log.

Advanced Usage

Parallel experiments

loop.run(experiments=100, parallel=4)  # 4 agents running simultaneously

Custom agent backends

from autoloop.backends import OllamaBackend

loop = AutoLoop(
    target="prompt.md",
    metric=my_metric,
    directives="program.md",
    backend=OllamaBackend(model="llama3.1:70b"),
)

Warm starts

# Resume from a previous run's best result
loop.run(experiments=50, warm_start="./autoloop-results/best.py")

Metric composition

from autoloop import CompositeMetric

metric = CompositeMetric([
    (accuracy_metric, 0.7),   # 70% weight
    (latency_metric, 0.3),    # 30% weight
])

Examples

Comparison to autoresearch

autoresearch autoloop
Domain ML training only Any
Target train.py Any file
Metric val_bpb Any Python function
Budget 5-min wall clock Configurable
Agent Claude Code / Codex Any
Parallel No Yes

autoloop is autoresearch with the ML-specific parts removed and replaced with a general interface.

Philosophy

The insight from autoresearch isn't about ML. It's about loop design:

  1. Unambiguous feedback โ€” the metric must be objective and quantitative
  2. Fixed budget โ€” experiments must be comparable
  3. Narrow scope โ€” one file, reviewable diffs
  4. Overnight scale โ€” 100 experiments while you sleep

Wherever you can satisfy these four conditions, you can run autonomous improvement. autoloop makes that loop accessible without writing the scaffolding yourself.

Roadmap

  • Web UI for experiment visualization
  • Multi-file optimization with dependency tracking
  • MCP server (use autoloop as a tool inside Claude Code)
  • Hosted experiment tracking (autoloop cloud)
  • Pre-built metric libraries (RAGAS, finance, code quality)

Contributing

PRs welcome. See CONTRIBUTING.md.

License

MIT


Inspired by karpathy/autoresearch. autoloop generalizes the loop.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoloop_ai-0.1.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoloop_ai-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file autoloop_ai-0.1.0.tar.gz.

File metadata

  • Download URL: autoloop_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for autoloop_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 328cfc1a3fdc457ec9470344dfc3189f9287a504f022c7ba4c4226f876a4e1db
MD5 ad2ef3d35e97e3fa5bd0a9ba04016290
BLAKE2b-256 5703be9d51566d8f3ac999e86dc8d3361adf5a32449dd6b6c4e8a4fa202de1c7

See more details on using hashes here.

File details

Details for the file autoloop_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: autoloop_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for autoloop_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fa4c2280fa0662176f7d33a4cb269825adf297235e0406430cd959672c9db8b0
MD5 ffacea9fc4fdb746a9a3a914b84eaa89
BLAKE2b-256 df8d6df6cb11a04c24903fc8169140b601ef162b0855801297aac938fca4bbe9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page