Continuous autoresearch RL runner: LLM-driven hyperparameter and code search for fine-tuning on cloud GPUs.

These details have not been verified by PyPI

Project links

Project description

AutoResearch-RL

Autonomous ML experiment loop. An LLM proposes hyperparameters or code changes, trains on local or cloud GPU (Basilica), evaluates, keeps or discards, and repeats.

prepare.py  -->  [data]  -->  train.py  -->  [metrics]  -->  keep/discard  -->  repeat
 (frozen)                     (mutable)       eval_score       |
                                  ^                            |
                                  |     LLM proposes next      |
                                  +------- params or diff -----+

Quickstart

uv sync --extra dev
uv run autoresearch-rl run examples/minimal-trainable-target/config.yaml

Common workflows are wrapped in a Makefile:

make help       # list targets
make check      # lint + typecheck + full tests (~95 s)
make test-fast  # tests excluding the slow integration suite (~30 s)
make showcase   # run examples/parallel-cancel-showcase end-to-end

The Two Scripts

Every experiment has two scripts connected by the filesystem, never by imports:

prepare.py (frozen) -- runs once via prepare_cmd. Produces data files, defines the evaluation protocol (answer extraction, reward computation). The LLM cannot modify this file. It is the trust boundary: evaluation integrity is guaranteed by freezing it.

train.py (mutable) -- runs each iteration. Reads the prepared data, trains the model, prints metrics to stdout. The LLM can modify this file in llm_diff or hybrid mode. This is where the training algorithm, reward function, optimizer, and generation strategy live. When hyperparameter tuning stalls, the LLM proposes code diffs to train.py -- improving the reward function, adding gradient accumulation, changing the sampling strategy -- autonomously.

The boundary is deliberate: prepare.py owns "what is correct" (data, evaluation), train.py owns "how to get there" (training algorithm, reward shaping). The LLM can evolve the "how" but never redefine the "what".

How it works

Targets. Where training runs: locally (command), against a remote API (http), or on Basilica GPU cloud (basilica). Same config, different target.type.

Policies. How the next experiment is chosen:

Policy	What it proposes	When to use
`grid`	Exhaustive param combinations	Small spaces, baselines
`random`	Uniform random params	Large spaces, baselines
`llm`	LLM-guided params from history	Medium spaces, fast convergence
`llm_diff`	Code diffs to `train.py`	Algorithmic improvements
`hybrid`	Params first, code diffs when stalled	Best of both worlds
`learned`	PPO-based policy with trajectory feedback	Long campaigns

Hybrid mode is the most powerful: it starts with param exploration (find the right learning rate and batch size), then when the no-improvement streak hits stall_threshold, it switches to code diffs. The LLM reads train.py, program.md (task guidance), and the full experiment history, then proposes a unified diff. If the diff fails validation, the error is sent back for correction (up to 2 retries). If diff proposals fail consecutively, it falls back to param mode.

Stop guards. Wall time, max iterations, no-improvement streak, failure rate (cancelled iters do not count as failures).

Checkpoint/resume. State persisted after every iteration. Survives crashes and restarts.

Cooperative cancellation (controller.intra_iteration_cancel.enabled). The trial calls from autoresearch_rl.target.progress import emit_progress per step; the engine drains progress reports and runs them through the power-law forecaster. When a trial cannot beat the current best, the engine writes a control file and the trial's next emit_progress call exits with code 42. Status becomes cancelled (graceful early-out, distinct from failed).

Parallel iterations (controller.parallel.enabled). K trials run concurrently inside a ThreadPoolExecutor, admitted by a resource pool. Diff and hybrid policies stay serial — k concurrent diffs would fight the contract. LLMParamPolicy.propose_batch issues ONE chat call asking for k diverse proposals (vs k independent calls). Reward feedback to learnable policies is buffered and drained in submission order so PPO sees a stable trial-time sequence.

Timeline export (telemetry.timeline_path). Writes a Chrome-trace JSON file openable directly in chrome://tracing or ui.perfetto.dev. Spans: policy.propose_batch, executor.execute, llm.chat_completion, all basilica.* phases.

Diff guardrails (policy.required_calls, default ["emit_progress"]). The diff validator AST-walks the post-patch source and rejects any diff that strips a required call. Used to keep load-bearing instrumentation intact across LLM-proposed code changes.

Runtime config validation runs on every validate and run. Eight checks covering reserved env-var prefixes, missing files / API keys / GPU models, unwritable dirs, budget alignment, and positive-presence of emit_progress when intra-iteration cancel is enabled. Blocking errors exit code 2 before any trial starts.

Examples

Example	Policy	Task
minimal-trainable-target	`llm_diff`	Deterministic toy (no GPU)
parallel-cancel-showcase	`random`	End-to-end demo: parallel + cancel + timeline + config validation (no GPU, ~13 s)
autoresearch-like	`llm_diff`	Synthetic training loop
basilica-grpo	`hybrid`	GRPO post-training: Qwen2.5-0.5B on GSM8K
deberta-prompt-injection	`hybrid`	DeBERTa security classifier

Each example: config.yaml, prepare.py, train.py, program.md, deploy.py, Dockerfile, run.sh, README.md.

Config

target:
  prepare_cmd: ["python3", "prepare.py"]   # frozen: runs once, produces data
  train_cmd: ["python3", "train.py"]       # mutable: runs each iteration
  type: basilica                           # or: command, http

policy:
  type: hybrid                             # param search -> code diffs on stall
  params:                                  # search space for param mode
    learning_rate: [3e-6, 5e-6, 1e-5]
  mutable_file: train.py                   # LLM can modify this in diff mode
  frozen_file: prepare.py                  # LLM cannot modify this
  program_file: program.md                 # task guidance for the LLM
  llm_api_url: "https://llm.chutes.ai/v1"
  llm_model: "deepseek-ai/DeepSeek-V3-0324"
  llm_api_key_env: "CHUTES_API_KEY"

objective:
  metric: eval_score
  direction: max

controller:
  checkpoint_path: artifacts/checkpoint.json
  no_improve_limit: 10

  # Optional: cancel doomed trials mid-flight via the power-law forecaster.
  # Trial must call emit_progress(step=, step_target=, metrics=...) per step.
  intra_iteration_cancel:
    enabled: false               # opt-in
    min_steps: 5                 # don't cancel before this many trial steps
    poll_interval_s: 5.0         # how often the guard re-evaluates
    min_reports_before_decide: 5 # need at least this many progress reports

  # Optional: run K iterations concurrently. Diff/hybrid policies stay serial.
  parallel:
    enabled: false               # opt-in
    max_concurrency: 4
    resources: {gpu: 4}          # ResourcePool admits trials by their resource_cost
    submit_poll_interval_s: 0.5

telemetry:
  trace_path: traces/events.jsonl
  ledger_path: artifacts/results.tsv
  artifacts_dir: artifacts/runs
  versions_dir: artifacts/versions
  timeline_path: traces/timeline.json   # null disables; openable in chrome://tracing

CLI

uv run autoresearch-rl run config.yaml                     # run the loop
uv run autoresearch-rl validate config.yaml                # validate config
uv run autoresearch-rl status config.yaml --last 5         # check state (JSON)
uv run autoresearch-rl run-one config.yaml \
  --params '{"learning_rate": 5e-6}'                       # single iteration
uv run autoresearch-rl run-one config.yaml \
  --diff reward_improvement.patch                          # apply a code diff
uv run autoresearch-rl upload config.yaml \
  --repo user/my-security-judge                            # push best model to HF

Output

artifacts/results.tsv          # per-iteration scores + comparability metadata
artifacts/versions/v0001/      # kept iterations (versioned artifacts)
  version.json                 # params, metrics, model_dir path
artifacts/checkpoint.json      # resumable state
artifacts/runs/run-XXXX/
  progress.jsonl               # per-step emit_progress(...) reports
  control.json                 # cancel signal (only when guard fired)
  manifest-*.json              # per-iter snapshot
traces/events.jsonl            # structured event trace (proposals, progress, iterations, summary)
traces/timeline.json           # Chrome trace JSON (when telemetry.timeline_path set)
/data/models/v0001/            # trained model checkpoint (if model_output_dir set)

Reading the timeline. Open traces/timeline.json in chrome://tracing or ui.perfetto.dev to see per-iteration spans (policy.propose_batch, executor.execute), Basilica deployment phases (create_deployment, wait_ready, poll_for_metrics, download_model, cleanup), and LLM call latencies (llm.chat_completion with attempt counts and terminal status as args).

Model persistence. When model_output_dir is set in config, the framework injects AR_MODEL_DIR into each iteration. The training script saves the model there. On Basilica, the bootstrap HTTP server exposes /model/files (listing) and /model/download/<path> (file download). The controller downloads the model from the running container before cleanup. The best model's path is recorded in version.json.

After a campaign, push the best model to HuggingFace Hub:

uv run autoresearch-rl upload config.yaml --repo user/my-model

Progress chart

Optional dependency (matplotlib) under the chart extra:

uv sync --extra dev --extra chart
uv run python scripts/progress_chart.py artifacts/results.tsv -o progress.png --direction min

Generates a Karpathy-style scatter plot: gray (discarded), green (kept), step function (running best). See examples/parallel-cancel-showcase/progress.png for an example.

Architecture and design notes

docs/ARCHITECTURE.md — module-by-module walkthrough.
docs/research/ — RLix-adoption arc: comparison, plan, remediation, deferral notes, velocity log, end-to-end reports.
CHANGELOG.md — phase-by-phase change log.

License

This project is released under the MIT License — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

May 30, 2026

This version

0.3.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoresearch_rl-0.3.0.tar.gz (173.3 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoresearch_rl-0.3.0-py3-none-any.whl (127.0 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file autoresearch_rl-0.3.0.tar.gz.

File metadata

Download URL: autoresearch_rl-0.3.0.tar.gz
Upload date: May 29, 2026
Size: 173.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for autoresearch_rl-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`81e37be4c67e96b2f2bbb91327a42e283f08e2dfc7f36d18f75f3642a426179c`
MD5	`a4e403d7befcd7364bec2c2d1932c7d5`
BLAKE2b-256	`10ac095c498237b4892f033532afbe7d40ce8871fa5b036660ee8abce76c2793`

See more details on using hashes here.

File details

Details for the file autoresearch_rl-0.3.0-py3-none-any.whl.

File metadata

Download URL: autoresearch_rl-0.3.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 127.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for autoresearch_rl-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70fd5d7e34862b81cfda3c8400b558542ec0ef16b112a734fd7c0b1ccd1da1aa`
MD5	`c5d39a5427794f0d12226ae66fd90f4e`
BLAKE2b-256	`d08c9213268a65f4e31db72b7a2a594e6c2dd32152d5614865ad2afa012cf1de`

See more details on using hashes here.

autoresearch-rl 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoResearch-RL

Quickstart

The Two Scripts

How it works

Examples

Config

CLI

Output

Progress chart

Architecture and design notes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes