Autonomous research agent CLI — automates the experiment loop: propose → implement → test → run → evaluate

Project description

ABREKA — Autonomous Research Agent

ABREKA is a CLI tool that automates the research experiment loop: propose → implement → test → run → evaluate. It runs autonomously for hours or days, building on its own results, while maintaining full auditability of every decision.

Inspired by autoresearch, ABREKA is agent-driven end-to-end and uses pi-rpc as the coding agent backend.

Core Principles

Experiments are the unit of work. Every action produces an experiment with a hypothesis, method, metrics, and findings.
Fully automated by default, human-in-the-loop optional. abreka run 24h runs autonomously. abreka run --step does one iteration and stops.
TDD enforced. Code must pass tests before experiments run.
Full auditability. Every agent session transcript is saved. Every experiment has structured metrics and narrative findings.
Experiments form a DAG, not a list. Ideas branch and converge through parent references.

Installation

ABREKA is a standalone tool, installed separately from your project:

uv tool install abreka

Or run directly:

uvx abreka --help

The web UI user guide is maintained in src/abreka/docs/user_guide.md and is available at /guide.

Prerequisites

Python 3.11+
uv package manager
pi coding agent (for agent-driven commands)

Quick Start

# 1. Inside any uv-managed Python project:
cd my-research-project/
abreka init .

# 2. Edit the research goal:
$EDITOR abreka/goal.md

# 3. Run a single experiment cycle:
abreka run --step

# 4. Or let it run autonomously:
abreka run 24h

What `abreka init .` Creates

ABREKA reads your pyproject.toml to discover the project name and src/ layout, then creates:

my_research_project/
  pyproject.toml              # yours — abreka reads but doesn't modify
  src/my_lib/                 # your package
  abreka/                     # created by abreka init
    config.toml               # project settings, metrics, agent config
    goal.md                   # your research objective (edit this!)
    index.json                # experiment metadata cache
    prompts/                  # agent prompt templates (editable)
      propose_exploit.md
      propose_explore.md
      implement.md
      test.md
      repair_run.md
      evaluate.md
    experiments/
      0001/
        experiment.toml       # structured metadata
        findings.md           # narrative findings
        run.py                # experiment entry point
        code/                 # experiment-specific modules
        artifacts/
          checkpoints/
          plots/
          pi_session.json     # agent transcript

Configuration

`abreka/config.toml`

[project]
name = "mnist-exploration"        # discovered from pyproject.toml
lib = "mnist_exp"                 # discovered from src/ layout
test_command = "uv run pytest"

[experiment]
primary_metric = "val:accuracy"   # default sorting and "best" tracking
metric_direction = "maximize"     # or "minimize"
run_template = "./run.py"

[run]
mode = "local"                  # or "remote" to offload only the run step
timeout_minutes = 30
require_tests = true
max_repair_attempts = 3          # LLM repairs after non-timeout run failures

# [run.remote]
# host = "gpu-box"
# project_dir = "/workspace/mnist-exploration"
# bootstrap_command = "uv sync --frozen"
# poll_seconds = 5
# startup_command = "./scripts/start-gpu-vm.sh"
# teardown_command = "./scripts/stop-gpu-vm.sh"

[agent]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
# Per-agent overrides (optional):
# propose_model = "claude-opus-4-20250514"

`abreka/goal.md`

Free-form markdown describing the research objective. Injected into all agent prompts:

# Research Goal

Achieve the highest possible test accuracy on MNIST using models
that train in under 60 seconds on a single GPU.

## Success Criteria
- val:accuracy > 0.995
- Training time < 60 seconds

CLI Reference

Project Commands

abreka init .           # Initialize in current directory
abreka goal             # View the research goal
abreka status           # Summary: experiment counts, best result

The Autonomous Loop

abreka run 24h          # Run autonomously for 24 hours
abreka run 2h           # Run for 2 hours
abreka run --step       # Single iteration (human-in-the-loop)
abreka run --recover    # Recover exactly one interrupted remote run

Each iteration: propose → implement → test → run → evaluate. If run.py exits non-zero, ABREKA preserves the failure output, asks an LLM to repair the existing experiment, and retries the run (default 3 repair attempts). Run timeouts are treated as failures without repair. If a step ultimately fails, the experiment is marked failed and the loop continues with a new proposal.

Agent-Driven Commands

Run individual steps manually:

abreka propose          # Agent proposes the next experiment
abreka implement 0003   # Agent writes code for experiment 0003
abreka test 0003        # Agent validates code, runs test suite
abreka evaluate 0003    # Agent interprets results, writes findings

Experiment Management

# Create and update
abreka exp new --hypothesis "Add dropout" --parent 0001 --tag regularization
abreka exp update 0003 --metric val:accuracy=0.9934 --metric train:loss=0.005
abreka exp finish 0003
abreka exp fail 0003 --reason "OOM on batch size 256"

# Remote-run recovery
abreka exp remote list
abreka exp remote status 0003
abreka exp remote sync 0003
abreka exp remote resume 0003
abreka exp remote stop 0003

# Query
abreka exp list                                        # all experiments
abreka exp list --status completed --sort metric:val:accuracy
abreka exp list --where "metric:val:accuracy > 0.95"
abreka exp list --parent 0001                          # children of 0001
abreka exp show 0003                                   # full detail view
abreka exp search "augmentation"                       # keyword search
abreka exp diff 0003 0005                              # side-by-side comparison
abreka exp tree                                        # DAG visualization

Experiment Data Model

Status Lifecycle

proposed → implementing → testing → running → completed
                                          ↘ failed

Metrics Protocol

Experiments print metrics to stdout during training:

METRIC:train:loss=0.005
METRIC:val:accuracy=0.9934
METRIC:accuracy=0.85          # split defaults to "val"

ABREKA parses these lines and writes them into experiment.toml. Metrics are namespaced by split (train, val, test).

Experiment DAG

Experiments reference parents, forming a directed acyclic graph:

Experiments
├── 0001 completed Vanilla CNN baseline (val:accuracy=0.9912)
│   ├── 0002 completed Add dropout (val:accuracy=0.9923)
│   │   └── 0005 completed Dropout + CutMix (val:accuracy=0.9961)
│   └── 0003 completed Random augmentation (val:accuracy=0.9934)
└── 0006 running MLP baseline

Shared Library & Copy-on-Write

Your code in src/<lib_name>/ is shared across all experiments:

The agent can add new modules to the shared library.
The agent can modify existing code if all tests still pass.
If modifications break tests, the agent copies to experiments/<id>/code/ instead.
The test suite enforces this automatically.

Agent Prompts

Default prompts in abreka/prompts/ are fully editable. They use {{variable}} placeholders:

Variable	Description
`{{goal}}`	Contents of `goal.md`
`{{exp_id}}`	Current experiment ID
`{{hypothesis}}`	Experiment hypothesis
`{{method}}`	Experiment method
`{{parents}}`	Parent experiment IDs
`{{experiment_context}}`	Summary of all prior experiments
`{{lib_name}}`	Your package name
`{{run_output}}`	Captured stdout/stderr from the run
`{{artifacts_list}}`	List of artifact files
`{{primary_metric}}`	Primary metric from config
`{{metric_direction}}`	`maximize` or `minimize`

Architecture

abreka CLI
  └── propose / implement / test / evaluate
        └── pi-rpc client (spawns pi --mode rpc subprocess)
              └── pi agent uses tools: bash, read, write, edit
                    └── agent calls abreka exp ... to manage experiments

ABREKA communicates with pi via JSON lines over stdin/stdout. Every agent session is saved as an artifact for full auditability.

Development

git clone https://github.com/your-org/abreka.git
cd abreka
uv sync
uv run pytest           # 55 tests
uv run abreka --help

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.0.1

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abreka-0.0.1.tar.gz (249.6 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abreka-0.0.1-py3-none-any.whl (192.5 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file abreka-0.0.1.tar.gz.

File metadata

Download URL: abreka-0.0.1.tar.gz
Upload date: May 5, 2026
Size: 249.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for abreka-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`7afbc1ad0ee912c7f10763321d4b5f7109da37e3cc9b2bd5815273bf50784bbf`
MD5	`d136edee46600f07e0469cb886ae0a8e`
BLAKE2b-256	`0e1bc10f55e6d97d93cb05a34d62f54af8b326a0258031e980634d0d60fcc4bf`

See more details on using hashes here.

File details

Details for the file abreka-0.0.1-py3-none-any.whl.

File metadata

Download URL: abreka-0.0.1-py3-none-any.whl
Upload date: May 5, 2026
Size: 192.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for abreka-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb1f513627173a44bef354fed1fc222d5697f29026af4f199ca328de630e108a`
MD5	`ce0f592bffca301211039659837e0819`
BLAKE2b-256	`2e829dc078bbb0222950f607faeb0c151bbb391504289aa3f8b54a2cb93fb72d`

See more details on using hashes here.

abreka 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ABREKA — Autonomous Research Agent

Core Principles

Installation

Prerequisites

Quick Start

What abreka init . Creates

Configuration

abreka/config.toml

abreka/goal.md

CLI Reference

Project Commands

The Autonomous Loop

Agent-Driven Commands

Experiment Management

Experiment Data Model

Status Lifecycle

Metrics Protocol

Experiment DAG

Shared Library & Copy-on-Write

Agent Prompts

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What `abreka init .` Creates

`abreka/config.toml`

`abreka/goal.md`