Autonomous research agent CLI — automates the experiment loop: propose → implement → test → run → evaluate
Project description
ABREKA — Autonomous Research Agent
ABREKA is a CLI tool that automates the research experiment loop: propose → implement → test → run → evaluate. It runs autonomously for hours or days, building on its own results, while maintaining full auditability of every decision.
Inspired by autoresearch, ABREKA is agent-driven end-to-end and uses pi-rpc as the coding agent backend.
Core Principles
- Experiments are the unit of work. Every action produces an experiment with a hypothesis, method, metrics, and findings.
- Fully automated by default, human-in-the-loop optional.
abreka run 24hruns autonomously.abreka run --stepdoes one iteration and stops. - TDD enforced. Code must pass tests before experiments run.
- Full auditability. Every agent session transcript is saved. Every experiment has structured metrics and narrative findings.
- Experiments form a DAG, not a list. Ideas branch and converge through parent references.
Installation
ABREKA is a standalone tool, installed separately from your project:
uv tool install abreka
Or run directly:
uvx abreka --help
The web UI user guide is maintained in src/abreka/docs/user_guide.md and is available at /guide.
Prerequisites
Quick Start
# 1. Inside any uv-managed Python project:
cd my-research-project/
abreka init .
# 2. Edit the research goal:
$EDITOR abreka/goal.md
# 3. Run a single experiment cycle:
abreka run --step
# 4. Or let it run autonomously:
abreka run 24h
What abreka init . Creates
ABREKA reads your pyproject.toml to discover the project name and src/ layout, then creates:
my_research_project/
pyproject.toml # yours — abreka reads but doesn't modify
src/my_lib/ # your package
abreka/ # created by abreka init
config.toml # project settings, metrics, agent config
goal.md # your research objective (edit this!)
index.json # experiment metadata cache
prompts/ # agent prompt templates (editable)
propose_exploit.md
propose_explore.md
implement.md
test.md
repair_run.md
evaluate.md
experiments/
0001/
experiment.toml # structured metadata
findings.md # narrative findings
run.py # experiment entry point
code/ # experiment-specific modules
artifacts/
checkpoints/
plots/
pi_session.json # agent transcript
Configuration
abreka/config.toml
[project]
name = "mnist-exploration" # discovered from pyproject.toml
lib = "mnist_exp" # discovered from src/ layout
test_command = "uv run pytest"
[experiment]
primary_metric = "val:accuracy" # default sorting and "best" tracking
metric_direction = "maximize" # or "minimize"
run_template = "./run.py"
[run]
mode = "local" # or "remote" to offload only the run step
timeout_minutes = 30
require_tests = true
max_repair_attempts = 3 # LLM repairs after non-timeout run failures
# [run.remote]
# host = "gpu-box"
# project_dir = "/workspace/mnist-exploration"
# bootstrap_command = "uv sync --frozen"
# poll_seconds = 5
# startup_command = "./scripts/start-gpu-vm.sh"
# teardown_command = "./scripts/stop-gpu-vm.sh"
[agent]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
# Per-agent overrides (optional):
# propose_model = "claude-opus-4-20250514"
abreka/goal.md
Free-form markdown describing the research objective. Injected into all agent prompts:
# Research Goal
Achieve the highest possible test accuracy on MNIST using models
that train in under 60 seconds on a single GPU.
## Success Criteria
- val:accuracy > 0.995
- Training time < 60 seconds
CLI Reference
Project Commands
abreka init . # Initialize in current directory
abreka goal # View the research goal
abreka status # Summary: experiment counts, best result
The Autonomous Loop
abreka run 24h # Run autonomously for 24 hours
abreka run 2h # Run for 2 hours
abreka run --step # Single iteration (human-in-the-loop)
abreka run --recover # Recover exactly one interrupted remote run
Each iteration: propose → implement → test → run → evaluate. If run.py exits non-zero, ABREKA preserves the failure output, asks an LLM to repair the existing experiment, and retries the run (default 3 repair attempts). Run timeouts are treated as failures without repair. If a step ultimately fails, the experiment is marked failed and the loop continues with a new proposal.
Agent-Driven Commands
Run individual steps manually:
abreka propose # Agent proposes the next experiment
abreka implement 0003 # Agent writes code for experiment 0003
abreka test 0003 # Agent validates code, runs test suite
abreka evaluate 0003 # Agent interprets results, writes findings
Experiment Management
# Create and update
abreka exp new --hypothesis "Add dropout" --parent 0001 --tag regularization
abreka exp update 0003 --metric val:accuracy=0.9934 --metric train:loss=0.005
abreka exp finish 0003
abreka exp fail 0003 --reason "OOM on batch size 256"
# Remote-run recovery
abreka exp remote list
abreka exp remote status 0003
abreka exp remote sync 0003
abreka exp remote resume 0003
abreka exp remote stop 0003
# Query
abreka exp list # all experiments
abreka exp list --status completed --sort metric:val:accuracy
abreka exp list --where "metric:val:accuracy > 0.95"
abreka exp list --parent 0001 # children of 0001
abreka exp show 0003 # full detail view
abreka exp search "augmentation" # keyword search
abreka exp diff 0003 0005 # side-by-side comparison
abreka exp tree # DAG visualization
Experiment Data Model
Status Lifecycle
proposed → implementing → testing → running → completed
↘ failed
Metrics Protocol
Experiments print metrics to stdout during training:
METRIC:train:loss=0.005
METRIC:val:accuracy=0.9934
METRIC:accuracy=0.85 # split defaults to "val"
ABREKA parses these lines and writes them into experiment.toml. Metrics are namespaced by split (train, val, test).
Experiment DAG
Experiments reference parents, forming a directed acyclic graph:
Experiments
├── 0001 completed Vanilla CNN baseline (val:accuracy=0.9912)
│ ├── 0002 completed Add dropout (val:accuracy=0.9923)
│ │ └── 0005 completed Dropout + CutMix (val:accuracy=0.9961)
│ └── 0003 completed Random augmentation (val:accuracy=0.9934)
└── 0006 running MLP baseline
Shared Library & Copy-on-Write
Your code in src/<lib_name>/ is shared across all experiments:
- The agent can add new modules to the shared library.
- The agent can modify existing code if all tests still pass.
- If modifications break tests, the agent copies to
experiments/<id>/code/instead. - The test suite enforces this automatically.
Agent Prompts
Default prompts in abreka/prompts/ are fully editable. They use {{variable}} placeholders:
| Variable | Description |
|---|---|
{{goal}} |
Contents of goal.md |
{{exp_id}} |
Current experiment ID |
{{hypothesis}} |
Experiment hypothesis |
{{method}} |
Experiment method |
{{parents}} |
Parent experiment IDs |
{{experiment_context}} |
Summary of all prior experiments |
{{lib_name}} |
Your package name |
{{run_output}} |
Captured stdout/stderr from the run |
{{artifacts_list}} |
List of artifact files |
{{primary_metric}} |
Primary metric from config |
{{metric_direction}} |
maximize or minimize |
Architecture
abreka CLI
└── propose / implement / test / evaluate
└── pi-rpc client (spawns pi --mode rpc subprocess)
└── pi agent uses tools: bash, read, write, edit
└── agent calls abreka exp ... to manage experiments
ABREKA communicates with pi via JSON lines over stdin/stdout. Every agent session is saved as an artifact for full auditability.
Development
git clone https://github.com/your-org/abreka.git
cd abreka
uv sync
uv run pytest # 55 tests
uv run abreka --help
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abreka-0.0.1.tar.gz.
File metadata
- Download URL: abreka-0.0.1.tar.gz
- Upload date:
- Size: 249.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7afbc1ad0ee912c7f10763321d4b5f7109da37e3cc9b2bd5815273bf50784bbf
|
|
| MD5 |
d136edee46600f07e0469cb886ae0a8e
|
|
| BLAKE2b-256 |
0e1bc10f55e6d97d93cb05a34d62f54af8b326a0258031e980634d0d60fcc4bf
|
File details
Details for the file abreka-0.0.1-py3-none-any.whl.
File metadata
- Download URL: abreka-0.0.1-py3-none-any.whl
- Upload date:
- Size: 192.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb1f513627173a44bef354fed1fc222d5697f29026af4f199ca328de630e108a
|
|
| MD5 |
ce0f592bffca301211039659837e0819
|
|
| BLAKE2b-256 |
2e829dc078bbb0222950f607faeb0c151bbb391504289aa3f8b54a2cb93fb72d
|