autoresearch for everything โ autonomous iterative improvement for any system
Project description
autoloop ๐
autoresearch for everything.
Karpathy's autoresearch showed us the loop: point an AI agent at a problem, give it a metric, let it run 100 experiments overnight. Wake up to a better system.
That loop was hardcoded to ML training. autoloop generalizes it to any domain.
from autoloop import AutoLoop
loop = AutoLoop(
target="optimize.py", # what the agent edits
metric=my_eval_function, # returns a float
directives="program.md", # research goals in plain English
budget_seconds=300, # per experiment (default: 5 min)
)
loop.run(experiments=100) # go to sleep
# wake up to a git log of 100 experiments and a better system
It Works โ Here's a Real Test Run
We ran autoloop on a naive recursive fibonacci function, giving it 4 experiments to find a faster implementation. No human involved after the initial setup:
๐ Baseline score: -0.1717s (naive recursion, fibonacci(30))
๐ฌ Experiment 1/4
โ
KEPT | Score: -0.0249 (+0.1467) | Add memoization with dict cache
๐ฌ Experiment 2/4
โ DISCARDED | Score: -0.0280 (-0.0030) | Switch to iterative approach
๐ฌ Experiment 3/4
โ DISCARDED | Score: -999.000 (-998.97) | Wrong shortcut โ should be discarded
๐ฌ Experiment 4/4
โ
KEPT | Score: -0.0217 (+0.0032) | Use functools.lru_cache decorator
๐ Run complete: 4 experiments | 2 improvements | Best: -0.0217s
6.9x speedup from baseline. Broken code (exp 3) was automatically detected and discarded via the correctness check in the metric. The loop kept every genuine improvement and rejected everything else.
Why This Exists
autoresearch works because of three design decisions:
- Single file to modify โ keeps scope manageable, diffs reviewable
- Fixed time/compute budget โ makes experiments directly comparable
- One unambiguous metric โ enables full autonomy, no human judgment needed
These decisions aren't specific to ML training. They apply to any system you want to improve autonomously. autoloop is just the abstraction.
What You Can Optimize
| Domain | Target file | Metric |
|---|---|---|
| Prompt optimization | prompt.md |
LLM-as-judge score / task accuracy |
| SQL queries | query.sql |
Execution time / rows returned |
| Trading strategies | strategy.py |
Sharpe ratio / win rate |
| API pipelines | pipeline.py |
Latency / success rate |
| Test suites | tests.py |
Coverage / mutation score |
| Compiler flags | build.sh |
Binary size / compile time |
| Agent system prompts | system_prompt.md |
Task completion rate |
| RAG pipelines | retrieval.py |
RAGAS score / hit rate |
Install
pip install autoloop
Requires Python 3.10+. Works with any LLM agent backend (Claude Code, Codex, local models via Ollama).
Quickstart
1. Define your target
The file your agent will edit. Start small โ one function, one prompt, one query.
# optimize.py โ your agent edits this
SYSTEM_PROMPT = """You are a helpful assistant."""
2. Define your metric
A Python function that returns a float. Lower or higher = better (you configure which).
def my_metric(target_path: str) -> float:
"""Run eval and return score. autoloop calls this after every experiment."""
result = run_eval(target_path)
return result.accuracy # higher is better
3. Write your directives
Plain English research goals in program.md. This is what you iterate on over time.
# Research Directives
## Goal
Improve the system prompt to increase task completion rate on customer support queries.
## Hypotheses to explore
- More specific role definition
- Explicit handling of edge cases
- Chain-of-thought instructions
- Tone adjustments for different query types
## Constraints
- Keep under 500 tokens
- Must pass safety checks
4. Run
from autoloop import AutoLoop
loop = AutoLoop(
target="optimize.py",
metric=my_metric,
directives="program.md",
budget_seconds=300,
agent="claude", # "claude", "codex", "ollama"
higher_is_better=True,
)
loop.run(experiments=100)
5. Review
autoloop history # git log of all experiments
autoloop best # show the best-performing version
autoloop diff 12 best # compare experiment 12 to best
autoloop rollback 12 # restore experiment 12
How It Works
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ autoloop โ
โ โ
โ Read directives.md โ
โ โ โ
โ โผ โ
โ Agent proposes modification to target file โ
โ โ โ
โ โผ โ
โ Apply modification โ
โ โ โ
โ โผ โ
โ Run metric() with fixed budget โ
โ โ โ
โ โผ โ
โ Score improved? โโYESโโโถ git commit + update best โ
โ โ โ
โ NO โ
โ โ โ
โ โผ โ
โ Discard + log โ
โ โ โ
โ โผ โ
โ Repeat N times โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each experiment is logged with: timestamp, modification description, score delta, and the full diff. The git history is your research log.
Advanced Usage
Parallel experiments
loop.run(experiments=100, parallel=4) # 4 agents running simultaneously
Custom agent backends
from autoloop.backends import OllamaBackend
loop = AutoLoop(
target="prompt.md",
metric=my_metric,
directives="program.md",
backend=OllamaBackend(model="llama3.1:70b"),
)
Warm starts
# Resume from a previous run's best result
loop.run(experiments=50, warm_start="./autoloop-results/best.py")
Metric composition
from autoloop import CompositeMetric
metric = CompositeMetric([
(accuracy_metric, 0.7), # 70% weight
(latency_metric, 0.3), # 30% weight
])
Examples
examples/prompt_optimization/โ optimize a Claude system prompt for customer supportexamples/sql_optimization/โ optimize a slow SQL queryexamples/trading_strategy/โ evolve a trading strategy (inspired by AutoStrategy)examples/rag_pipeline/โ optimize a RAG retrieval pipeline
Comparison to autoresearch
| autoresearch | autoloop | |
|---|---|---|
| Domain | ML training only | Any |
| Target | train.py |
Any file |
| Metric | val_bpb |
Any Python function |
| Budget | 5-min wall clock | Configurable |
| Agent | Claude Code / Codex | Any |
| Parallel | No | Yes |
autoloop is autoresearch with the ML-specific parts removed and replaced with a general interface.
Philosophy
The insight from autoresearch isn't about ML. It's about loop design:
- Unambiguous feedback โ the metric must be objective and quantitative
- Fixed budget โ experiments must be comparable
- Narrow scope โ one file, reviewable diffs
- Overnight scale โ 100 experiments while you sleep
Wherever you can satisfy these four conditions, you can run autonomous improvement. autoloop makes that loop accessible without writing the scaffolding yourself.
Roadmap
- Web UI for experiment visualization
- Multi-file optimization with dependency tracking
- MCP server (use autoloop as a tool inside Claude Code)
- Hosted experiment tracking (autoloop cloud)
- Pre-built metric libraries (RAGAS, finance, code quality)
Contributing
PRs welcome. See CONTRIBUTING.md.
License
MIT
Inspired by karpathy/autoresearch. autoloop generalizes the loop.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autoloop_ai-0.1.0.tar.gz.
File metadata
- Download URL: autoloop_ai-0.1.0.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
328cfc1a3fdc457ec9470344dfc3189f9287a504f022c7ba4c4226f876a4e1db
|
|
| MD5 |
ad2ef3d35e97e3fa5bd0a9ba04016290
|
|
| BLAKE2b-256 |
5703be9d51566d8f3ac999e86dc8d3361adf5a32449dd6b6c4e8a4fa202de1c7
|
File details
Details for the file autoloop_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: autoloop_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa4c2280fa0662176f7d33a4cb269825adf297235e0406430cd959672c9db8b0
|
|
| MD5 |
ffacea9fc4fdb746a9a3a914b84eaa89
|
|
| BLAKE2b-256 |
df8d6df6cb11a04c24903fc8169140b601ef162b0855801297aac938fca4bbe9
|