Skip to main content

Autonomous research agent CLI — automates the experiment loop: propose → implement → test → run → evaluate

Project description

ABREKA — Autonomous Research Agent

ABREKA is a CLI tool that automates the research experiment loop: propose → implement → test → run → evaluate. It runs autonomously for hours or days, building on its own results, while maintaining full auditability of every decision.

Inspired by autoresearch, ABREKA is agent-driven end-to-end and uses pi-rpc as the coding agent backend.

Core Principles

  • Experiments are the unit of work. Every action produces an experiment with a hypothesis, method, metrics, and findings.
  • Fully automated by default, human-in-the-loop optional. abreka run 24h runs autonomously. abreka run --step does one iteration and stops.
  • TDD enforced. Code must pass tests before experiments run.
  • Full auditability. Every agent session transcript is saved. Every experiment has structured metrics and narrative findings.
  • Experiments form a DAG, not a list. Ideas branch and converge through parent references.

Installation

ABREKA is a standalone tool, installed separately from your project:

uv tool install abreka

Or run directly:

uvx abreka --help

The web UI user guide is maintained in src/abreka/docs/user_guide.md and is available at /guide.

Prerequisites

  • Python 3.11+
  • uv package manager
  • pi coding agent (for agent-driven commands)

Quick Start

# 1. Inside any uv-managed Python project:
cd my-research-project/
abreka init .

# 2. Edit the research goal:
$EDITOR abreka/goal.md

# 3. Run a single experiment cycle:
abreka run --step

# 4. Or let it run autonomously:
abreka run 24h

What abreka init . Creates

ABREKA reads your pyproject.toml to discover the project name and src/ layout, then creates:

my_research_project/
  pyproject.toml              # yours — abreka reads but doesn't modify
  src/my_lib/                 # your package
  abreka/                     # created by abreka init
    config.toml               # project settings, metrics, agent config
    goal.md                   # your research objective (edit this!)
    index.json                # experiment metadata cache
    prompts/                  # agent prompt templates (editable)
      propose_exploit.md
      propose_explore.md
      implement.md
      test.md
      repair_run.md
      evaluate.md
    experiments/
      0001/
        experiment.toml       # structured metadata
        findings.md           # narrative findings
        run.py                # experiment entry point
        code/                 # experiment-specific modules
        artifacts/
          checkpoints/
          plots/
          pi_session.json     # agent transcript

Configuration

abreka/config.toml

[project]
name = "mnist-exploration"        # discovered from pyproject.toml
lib = "mnist_exp"                 # discovered from src/ layout
test_command = "uv run pytest"

[experiment]
primary_metric = "val:accuracy"   # default sorting and "best" tracking
metric_direction = "maximize"     # or "minimize"
run_template = "./run.py"

[run]
mode = "local"                  # or "remote" to offload only the run step
timeout_minutes = 30
require_tests = true
max_repair_attempts = 3          # LLM repairs after non-timeout run failures

# [run.remote]
# host = "gpu-box"
# project_dir = "/workspace/mnist-exploration"
# bootstrap_command = "uv sync --frozen"
# poll_seconds = 5
# startup_command = "./scripts/start-gpu-vm.sh"
# teardown_command = "./scripts/stop-gpu-vm.sh"

[agent]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
# Per-agent overrides (optional):
# propose_model = "claude-opus-4-20250514"

abreka/goal.md

Free-form markdown describing the research objective. Injected into all agent prompts:

# Research Goal

Achieve the highest possible test accuracy on MNIST using models
that train in under 60 seconds on a single GPU.

## Success Criteria
- val:accuracy > 0.995
- Training time < 60 seconds

CLI Reference

Project Commands

abreka init .           # Initialize in current directory
abreka goal             # View the research goal
abreka status           # Summary: experiment counts, best result

The Autonomous Loop

abreka run 24h          # Run autonomously for 24 hours
abreka run 2h           # Run for 2 hours
abreka run --step       # Single iteration (human-in-the-loop)
abreka run --recover    # Recover exactly one interrupted remote run

Each iteration: propose → implement → test → run → evaluate. If run.py exits non-zero, ABREKA preserves the failure output, asks an LLM to repair the existing experiment, and retries the run (default 3 repair attempts). Run timeouts are treated as failures without repair. If a step ultimately fails, the experiment is marked failed and the loop continues with a new proposal.

Agent-Driven Commands

Run individual steps manually:

abreka propose          # Agent proposes the next experiment
abreka implement 0003   # Agent writes code for experiment 0003
abreka test 0003        # Agent validates code, runs test suite
abreka evaluate 0003    # Agent interprets results, writes findings

Experiment Management

# Create and update
abreka exp new --hypothesis "Add dropout" --parent 0001 --tag regularization
abreka exp update 0003 --metric val:accuracy=0.9934 --metric train:loss=0.005
abreka exp finish 0003
abreka exp fail 0003 --reason "OOM on batch size 256"

# Remote-run recovery
abreka exp remote list
abreka exp remote status 0003
abreka exp remote sync 0003
abreka exp remote resume 0003
abreka exp remote stop 0003

# Query
abreka exp list                                        # all experiments
abreka exp list --status completed --sort metric:val:accuracy
abreka exp list --where "metric:val:accuracy > 0.95"
abreka exp list --parent 0001                          # children of 0001
abreka exp show 0003                                   # full detail view
abreka exp search "augmentation"                       # keyword search
abreka exp diff 0003 0005                              # side-by-side comparison
abreka exp tree                                        # DAG visualization

Experiment Data Model

Status Lifecycle

proposed → implementing → testing → running → completed
                                          ↘ failed

Metrics Protocol

Experiments print metrics to stdout during training:

METRIC:train:loss=0.005
METRIC:val:accuracy=0.9934
METRIC:accuracy=0.85          # split defaults to "val"

ABREKA parses these lines and writes them into experiment.toml. Metrics are namespaced by split (train, val, test).

Experiment DAG

Experiments reference parents, forming a directed acyclic graph:

Experiments
├── 0001 completed Vanilla CNN baseline (val:accuracy=0.9912)
│   ├── 0002 completed Add dropout (val:accuracy=0.9923)
│   │   └── 0005 completed Dropout + CutMix (val:accuracy=0.9961)
│   └── 0003 completed Random augmentation (val:accuracy=0.9934)
└── 0006 running MLP baseline

Shared Library & Copy-on-Write

Your code in src/<lib_name>/ is shared across all experiments:

  1. The agent can add new modules to the shared library.
  2. The agent can modify existing code if all tests still pass.
  3. If modifications break tests, the agent copies to experiments/<id>/code/ instead.
  4. The test suite enforces this automatically.

Agent Prompts

Default prompts in abreka/prompts/ are fully editable. They use {{variable}} placeholders:

Variable Description
{{goal}} Contents of goal.md
{{exp_id}} Current experiment ID
{{hypothesis}} Experiment hypothesis
{{method}} Experiment method
{{parents}} Parent experiment IDs
{{experiment_context}} Summary of all prior experiments
{{lib_name}} Your package name
{{run_output}} Captured stdout/stderr from the run
{{artifacts_list}} List of artifact files
{{primary_metric}} Primary metric from config
{{metric_direction}} maximize or minimize

Architecture

abreka CLI
  └── propose / implement / test / evaluate
        └── pi-rpc client (spawns pi --mode rpc subprocess)
              └── pi agent uses tools: bash, read, write, edit
                    └── agent calls abreka exp ... to manage experiments

ABREKA communicates with pi via JSON lines over stdin/stdout. Every agent session is saved as an artifact for full auditability.

Development

git clone https://github.com/your-org/abreka.git
cd abreka
uv sync
uv run pytest           # 55 tests
uv run abreka --help

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abreka-0.0.1.tar.gz (249.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abreka-0.0.1-py3-none-any.whl (192.5 kB view details)

Uploaded Python 3

File details

Details for the file abreka-0.0.1.tar.gz.

File metadata

  • Download URL: abreka-0.0.1.tar.gz
  • Upload date:
  • Size: 249.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for abreka-0.0.1.tar.gz
Algorithm Hash digest
SHA256 7afbc1ad0ee912c7f10763321d4b5f7109da37e3cc9b2bd5815273bf50784bbf
MD5 d136edee46600f07e0469cb886ae0a8e
BLAKE2b-256 0e1bc10f55e6d97d93cb05a34d62f54af8b326a0258031e980634d0d60fcc4bf

See more details on using hashes here.

File details

Details for the file abreka-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: abreka-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 192.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for abreka-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb1f513627173a44bef354fed1fc222d5697f29026af4f199ca328de630e108a
MD5 ce0f592bffca301211039659837e0819
BLAKE2b-256 2e829dc078bbb0222950f607faeb0c151bbb391504289aa3f8b54a2cb93fb72d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page