Framework for generic AI research loops with measurable benchmarks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Autoresearch Lab

A framework for running automated AI research loops where an agent iteratively improves code against a measurable benchmark. Currently powered by Claude Code.

You define a pipeline (the code being optimized), a backend implemented in Python (how to evaluate it), and data (what to evaluate against). The framework handles sandboxing, orchestration, git commits/reverts, and stopping conditions.

Background

This project is inspired by Andrej Karpathy's autoresearch pattern — the idea that an AI agent can autonomously run a research loop of "change → evaluate → keep/discard" against a benchmark, accumulating improvements over time.

Autoresearch Lab implements the pattern in a somewhat generic way:

Language and domain agnostic. It treats the pipeline (the code being optimized) as a black box — it can be in written in any language and run in any environment, as long as it can be evaluated resulting in a single score to optimize. The tradeoff is that you have to implement an evaluation backend (in Python) as a bridge to your code.
Sandboxed by default. The agent runs inside a Docker container with only the pipeline code mounted as writable, protecting the host from most rogue agent behavior. An orchestrator on the host manages git commits and reverts, so the agent loop can't corrupt your repo.
Integrates into existing git repos. You can create a "lab" inside an existing git repo to persist the research loops's configuration and state. This makes it easy to stop and continue the loop, to collaborate on it with others and to keep developing a piece of code through a combination of autoresearch and human input.
Host service support. Some pipelines can't run inside a Docker container (e.g. mobile code that needs an emulator or device). The framework can manage host-side services and forward ports into the sandbox, letting the agent's code run on real hardware while still being orchestrated.

Warning: Sandbox limitations

The agent runs in a Docker container, which provides process-level isolation but is not a true security sandbox. A Docker container is not suitable for running untrusted or adversarial code. It is therefore risky to run Autoresearch Lab on a random machine. If you need stronger isolation, run Autoresearch Lab itself inside a VM.

No matter how good your isolation, the sandbox has network access because the agent needs to reach the API of the AI provider. This means the agent can make arbitrary HTTP requests, is vulnerable to prompt injection and may exfiltrate your pipeline code and data. Do not include any secrets in your pipeline code and data.

Getting started

Prerequisites

Docker running (the agent runs in a sandboxed container)
Git repository (the orchestrator commits/reverts pipeline changes)
Anthropic API key set as environment variable ANTHROPIC_API_KEY (or use --use-oauth-osx if logged in via claude login)

Quick start

This assumes you have uv for your Python dependency needs (but any other Python packaging solution will do).

# Initialize a lab in your project
cd my-project
uv init
uv add autoresearch-lab
uv run arl init --name "my-lab"

# Edit the generated files:
#   lab.toml      — configure backend and pipeline location
#   backend.py    — implement evaluation logic
#   AGENT.md      — write agent instructions

# Run the research loop
uv run arl run --data ./my-data --max-iterations 20

# Pass extra arguments to Claude Code after --
uv run arl run --data ./my-data -- --effort high

Usage

How it works

Host (arl run)                      Docker Container (Claude Code agent)
├─ Start host service (if any)       ├─ Read AGENT.md
├─ Launch container                  ├─ arl diagnose → understand failures
├─ Poll for verdict.json             ├─ Modify pipeline/ code
│   ◄── {"action": "keep", ...}  ◄──├─ arl eval → get metrics
├─ git commit or revert              ├─ Write verdict.json
├─ Delete verdict.json               ├─ Wait for deletion
├─ Check stopping conditions         ├─ Repeat
└─ Repeat                            └─ Repeat

The agent runs inside a sandboxed Docker container with restricted filesystem access. It can only modify the pipeline code and signal verdicts. Git operations happen on the host.

CLI commands

Command	Description
`arl init`	Initialize a new lab in the current directory
`arl run`	Start the autonomous research loop
`arl eval`	Evaluate pipeline (prints JSON metrics)
`arl diagnose`	Per-sample error analysis, worst first
`arl results`	Print experiment history from results.tsv

Configuration

Each lab is configured via a lab.toml file:

[lab]
name = "my-lab"
pipeline_dir = "pipeline"           # The only code the agent can modify (mounted read-write)
agent_instructions = "AGENT.md"     # What the agent reads on startup
results_file = "results.tsv"        # Experiment log

[backend]
module = "backend.py"               # Python file implementing EvalBackend
class = "MyBackend"                 # Class name within that file

# Optional: host-side service started before the container
[backend.host_service]
command = "python daemon.py --port 9100"
port = 9100                         # Forwarded into the container

# Optional: custom Dockerfile (must use FROM arl-agent-base)
[sandbox]
dockerfile = "Dockerfile"

Writing a backend

Implement EvalBackend from autoresearch_lab.harness.backend:

from pathlib import Path
from autoresearch_lab.harness.backend import EvalBackend, EvalResult, SampleResult

class MyBackend(EvalBackend):
    def evaluate(self, pipeline_dir: Path, data_dir: Path,
                 sample_ids: list[str] | None = None) -> EvalResult:
        # Run pipeline, compare against ground truth, compute your score.
        # The score is the single number the framework tracks (lower is better).
        # Extra metrics are logged but the framework only uses the score.
        #
        # If sample_ids is provided, you can optionally restrict evaluation
        # to just those samples to speed up `arl diagnose --sample <id>`.
        return EvalResult(
            score=0.042,
            metrics={"my_metric": 0.042, "latency_ms": 150.0},
            sample_results=[
                SampleResult(sample_id="img_001", score=0.03),
                SampleResult(sample_id="img_002", score=0.05),
            ],
        )

The framework is metric-agnostic — your backend defines what the score means (CER, loss, error rate, etc.) and how to compute it. Additional metrics in the metrics dict are logged to results.tsv for reference but the framework only tracks score for keep/discard decisions and stopping conditions.

The backend runs inside the Docker container by default. If evaluation requires host resources (e.g. an Android emulator), declare a [backend.host_service] in lab.toml — the framework starts it on the host, waits for the port to be reachable, and forwards it into the container.

Data

The --data argument passed to arl run, arl eval, and arl diagnose is a path to a directory containing whatever your backend needs to evaluate the pipeline — ground truth files, test inputs, reference images, benchmark configs, etc. The framework treats it as an opaque, read-only directory.

Each item in the data directory that produces a score is a sample. The backend returns per-sample results via SampleResult, which arl diagnose uses to show the worst-performing samples first. The structure of samples is entirely up to you — a sample could be a single file, a subdirectory, or an entry in a manifest file.

The data directory is not configured in lab.toml because it's common to evaluate against different datasets (e.g. a small fast set during development vs. a full set for final evaluation), and datasets they may not be a part of the code's git repo.

Custom container

By default the agent runs in a base container with Python 3.12, Node 22, uv, and Claude Code. If your lab needs extra dependencies (system packages, runtimes, tools), create a Dockerfile that extends the base:

FROM arl-agent-base

USER root
RUN apt-get update && apt-get install -y my-package
USER agent

Then point to it in lab.toml:

[sandbox]
dockerfile = "Dockerfile"

Stopping conditions

arl run stops when any condition is met:

--max-iterations — maximum number of experiments (default: 50)
--max-hours — maximum session duration in hours (default: 8)
--max-cost — maximum estimated API spend in USD (default: unlimited)
--target-score — stop when score reaches this value (default: disabled)
--plateau-threshold — consecutive discards without improvement (default: 10)
--max-restarts — container crash or timeout restarts before stopping (default: 3)

Development

uv sync              # Install dependencies (includes dev tools)
uv run pytest        # Run tests
uv run ruff check    # Lint
uv run ruff format   # Format code
uv run pyright       # Type check

License

Autoresearch Lab is distributed under an MIT license.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nikhaldi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.6

Apr 13, 2026

0.2.5

Apr 2, 2026

0.2.4

Mar 27, 2026

0.2.3

Mar 27, 2026

This version

0.2.2

Mar 27, 2026

0.2.1

Mar 27, 2026

0.2.0

Mar 26, 2026

0.1.2

Mar 25, 2026

0.1.1

Mar 24, 2026

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoresearch_lab-0.2.2.tar.gz (32.8 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoresearch_lab-0.2.2-py3-none-any.whl (27.1 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file autoresearch_lab-0.2.2.tar.gz.

File metadata

Download URL: autoresearch_lab-0.2.2.tar.gz
Upload date: Mar 27, 2026
Size: 32.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autoresearch_lab-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`861526ec88131d11dae2b8a46451778fcbce5ec0c190be7037a40d7f20cfe47f`
MD5	`0f1b11be6ca5bf782e09d8ec022bfbe3`
BLAKE2b-256	`57a9470bfa3ca035301dca38c57482c7f1978091bef3c4309ae30e044e6e3bc2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autoresearch_lab-0.2.2.tar.gz:

Publisher: publish.yml on nikhaldi/autoresearch-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autoresearch_lab-0.2.2.tar.gz
- Subject digest: 861526ec88131d11dae2b8a46451778fcbce5ec0c190be7037a40d7f20cfe47f
- Sigstore transparency entry: 1188884932
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: nikhaldi/autoresearch-lab@d25255961054ee4bac31c632bb6cf0765dcfab02
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/nikhaldi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d25255961054ee4bac31c632bb6cf0765dcfab02
- Trigger Event: release

File details

Details for the file autoresearch_lab-0.2.2-py3-none-any.whl.

File metadata

Download URL: autoresearch_lab-0.2.2-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autoresearch_lab-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c75f9479be78d007fc9f3119a2da308afa1b396e31099c11a04f06da4c49d571`
MD5	`d7ed07e5a5773c679dc4e8ffb81df288`
BLAKE2b-256	`5c3b0dee3ee92bc8100969662ec32d0f4dd060128540a8c567aa0843d7576116`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autoresearch_lab-0.2.2-py3-none-any.whl:

Publisher: publish.yml on nikhaldi/autoresearch-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autoresearch_lab-0.2.2-py3-none-any.whl
- Subject digest: c75f9479be78d007fc9f3119a2da308afa1b396e31099c11a04f06da4c49d571
- Sigstore transparency entry: 1188884956
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: nikhaldi/autoresearch-lab@d25255961054ee4bac31c632bb6cf0765dcfab02
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/nikhaldi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d25255961054ee4bac31c632bb6cf0765dcfab02
- Trigger Event: release

autoresearch-lab 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Autoresearch Lab

Background

Warning: Sandbox limitations

Getting started

Prerequisites

Quick start

Usage

How it works

CLI commands

Configuration

Writing a backend

Data

Custom container

Stopping conditions

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance