Skip to main content

Continuous autoresearch RL runner: LLM-driven hyperparameter and code search for fine-tuning on cloud GPUs.

Project description

AutoResearch-RL

Autonomous ML experiment loop. An LLM proposes hyperparameters or code changes, trains on local or cloud GPU (Basilica), evaluates, keeps or discards, and repeats.

prepare.py  -->  [data]  -->  train.py  -->  [metrics]  -->  keep/discard  -->  repeat
 (frozen)                     (mutable)       eval_score       |
                                  ^                            |
                                  |     LLM proposes next      |
                                  +------- params or diff -----+

Quickstart

uv sync --extra dev
uv run autoresearch-rl run examples/minimal-trainable-target/config.yaml

Common workflows are wrapped in a Makefile:

make help       # list targets
make check      # lint + typecheck + full tests (~95 s)
make test-fast  # tests excluding the slow integration suite (~30 s)
make showcase   # run examples/parallel-cancel-showcase end-to-end

The Two Scripts

Every experiment has two scripts connected by the filesystem, never by imports:

prepare.py (frozen) -- runs once via prepare_cmd. Produces data files, defines the evaluation protocol (answer extraction, reward computation). The LLM cannot modify this file. It is the trust boundary: evaluation integrity is guaranteed by freezing it.

train.py (mutable) -- runs each iteration. Reads the prepared data, trains the model, prints metrics to stdout. The LLM can modify this file in llm_diff or hybrid mode. This is where the training algorithm, reward function, optimizer, and generation strategy live. When hyperparameter tuning stalls, the LLM proposes code diffs to train.py -- improving the reward function, adding gradient accumulation, changing the sampling strategy -- autonomously.

The boundary is deliberate: prepare.py owns "what is correct" (data, evaluation), train.py owns "how to get there" (training algorithm, reward shaping). The LLM can evolve the "how" but never redefine the "what".

How it works

Targets. Where training runs: locally (command), against a remote API (http), or on Basilica GPU cloud (basilica). Same config, different target.type.

Policies. How the next experiment is chosen:

Policy What it proposes When to use
grid Exhaustive param combinations Small spaces, baselines
random Uniform random params Large spaces, baselines
llm LLM-guided params from history Medium spaces, fast convergence
llm_diff Code diffs to train.py Algorithmic improvements
hybrid Params first, code diffs when stalled Best of both worlds
learned PPO-based policy with trajectory feedback Long campaigns

Hybrid mode is the most powerful: it starts with param exploration (find the right learning rate and batch size), then when the no-improvement streak hits stall_threshold, it switches to code diffs. The LLM reads train.py, program.md (task guidance), and the full experiment history, then proposes a unified diff. If the diff fails validation, the error is sent back for correction (up to 2 retries). If diff proposals fail consecutively, it falls back to param mode.

Stop guards. Wall time, max iterations, no-improvement streak, failure rate (cancelled iters do not count as failures).

Checkpoint/resume. State persisted after every iteration. Survives crashes and restarts.

Cooperative cancellation (controller.intra_iteration_cancel.enabled). The trial calls from autoresearch_rl.target.progress import emit_progress per step; the engine drains progress reports and runs them through the power-law forecaster. When a trial cannot beat the current best, the engine writes a control file and the trial's next emit_progress call exits with code 42. Status becomes cancelled (graceful early-out, distinct from failed).

Parallel iterations (controller.parallel.enabled). K trials run concurrently inside a ThreadPoolExecutor, admitted by a resource pool. Diff and hybrid policies stay serial — k concurrent diffs would fight the contract. LLMParamPolicy.propose_batch issues ONE chat call asking for k diverse proposals (vs k independent calls). Reward feedback to learnable policies is buffered and drained in submission order so PPO sees a stable trial-time sequence.

Timeline export (telemetry.timeline_path). Writes a Chrome-trace JSON file openable directly in chrome://tracing or ui.perfetto.dev. Spans: policy.propose_batch, executor.execute, llm.chat_completion, all basilica.* phases.

Diff guardrails (policy.required_calls, default ["emit_progress"]). The diff validator AST-walks the post-patch source and rejects any diff that strips a required call. Used to keep load-bearing instrumentation intact across LLM-proposed code changes.

Runtime config validation runs on every validate and run. Eight checks covering reserved env-var prefixes, missing files / API keys / GPU models, unwritable dirs, budget alignment, and positive-presence of emit_progress when intra-iteration cancel is enabled. Blocking errors exit code 2 before any trial starts.

Examples

Example Policy Task
minimal-trainable-target llm_diff Deterministic toy (no GPU)
parallel-cancel-showcase random End-to-end demo: parallel + cancel + timeline + config validation (no GPU, ~13 s)
autoresearch-like llm_diff Synthetic training loop
basilica-grpo hybrid GRPO post-training: Qwen2.5-0.5B on GSM8K
deberta-prompt-injection hybrid DeBERTa security classifier

Each example: config.yaml, prepare.py, train.py, program.md, deploy.py, Dockerfile, run.sh, README.md.

Config

target:
  prepare_cmd: ["python3", "prepare.py"]   # frozen: runs once, produces data
  train_cmd: ["python3", "train.py"]       # mutable: runs each iteration
  type: basilica                           # or: command, http

policy:
  type: hybrid                             # param search -> code diffs on stall
  params:                                  # search space for param mode
    learning_rate: [3e-6, 5e-6, 1e-5]
  mutable_file: train.py                   # LLM can modify this in diff mode
  frozen_file: prepare.py                  # LLM cannot modify this
  program_file: program.md                 # task guidance for the LLM
  llm_api_url: "https://llm.chutes.ai/v1"
  llm_model: "deepseek-ai/DeepSeek-V3-0324"
  llm_api_key_env: "CHUTES_API_KEY"

objective:
  metric: eval_score
  direction: max

controller:
  checkpoint_path: artifacts/checkpoint.json
  no_improve_limit: 10

  # Optional: cancel doomed trials mid-flight via the power-law forecaster.
  # Trial must call emit_progress(step=, step_target=, metrics=...) per step.
  intra_iteration_cancel:
    enabled: false               # opt-in
    min_steps: 5                 # don't cancel before this many trial steps
    poll_interval_s: 5.0         # how often the guard re-evaluates
    min_reports_before_decide: 5 # need at least this many progress reports

  # Optional: run K iterations concurrently. Diff/hybrid policies stay serial.
  parallel:
    enabled: false               # opt-in
    max_concurrency: 4
    resources: {gpu: 4}          # ResourcePool admits trials by their resource_cost
    submit_poll_interval_s: 0.5

telemetry:
  trace_path: traces/events.jsonl
  ledger_path: artifacts/results.tsv
  artifacts_dir: artifacts/runs
  versions_dir: artifacts/versions
  timeline_path: traces/timeline.json   # null disables; openable in chrome://tracing

CLI

uv run autoresearch-rl run config.yaml                     # run the loop
uv run autoresearch-rl validate config.yaml                # validate config
uv run autoresearch-rl status config.yaml --last 5         # check state (JSON)
uv run autoresearch-rl run-one config.yaml \
  --params '{"learning_rate": 5e-6}'                       # single iteration
uv run autoresearch-rl run-one config.yaml \
  --diff reward_improvement.patch                          # apply a code diff
uv run autoresearch-rl upload config.yaml \
  --repo user/my-security-judge                            # push best model to HF

Output

artifacts/results.tsv          # per-iteration scores + comparability metadata
artifacts/versions/v0001/      # kept iterations (versioned artifacts)
  version.json                 # params, metrics, model_dir path
artifacts/checkpoint.json      # resumable state
artifacts/runs/run-XXXX/
  progress.jsonl               # per-step emit_progress(...) reports
  control.json                 # cancel signal (only when guard fired)
  manifest-*.json              # per-iter snapshot
traces/events.jsonl            # structured event trace (proposals, progress, iterations, summary)
traces/timeline.json           # Chrome trace JSON (when telemetry.timeline_path set)
/data/models/v0001/            # trained model checkpoint (if model_output_dir set)

Reading the timeline. Open traces/timeline.json in chrome://tracing or ui.perfetto.dev to see per-iteration spans (policy.propose_batch, executor.execute), Basilica deployment phases (create_deployment, wait_ready, poll_for_metrics, download_model, cleanup), and LLM call latencies (llm.chat_completion with attempt counts and terminal status as args).

Model persistence. When model_output_dir is set in config, the framework injects AR_MODEL_DIR into each iteration. The training script saves the model there. On Basilica, the bootstrap HTTP server exposes /model/files (listing) and /model/download/<path> (file download). The controller downloads the model from the running container before cleanup. The best model's path is recorded in version.json.

After a campaign, push the best model to HuggingFace Hub:

uv run autoresearch-rl upload config.yaml --repo user/my-model

Progress chart

Optional dependency (matplotlib) under the chart extra:

uv sync --extra dev --extra chart
uv run python scripts/progress_chart.py artifacts/results.tsv -o progress.png --direction min

Generates a Karpathy-style scatter plot: gray (discarded), green (kept), step function (running best). See examples/parallel-cancel-showcase/progress.png for an example.

Architecture and design notes

License

This project is released under the MIT License — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoresearch_rl-0.3.0.tar.gz (173.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoresearch_rl-0.3.0-py3-none-any.whl (127.0 kB view details)

Uploaded Python 3

File details

Details for the file autoresearch_rl-0.3.0.tar.gz.

File metadata

  • Download URL: autoresearch_rl-0.3.0.tar.gz
  • Upload date:
  • Size: 173.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for autoresearch_rl-0.3.0.tar.gz
Algorithm Hash digest
SHA256 81e37be4c67e96b2f2bbb91327a42e283f08e2dfc7f36d18f75f3642a426179c
MD5 a4e403d7befcd7364bec2c2d1932c7d5
BLAKE2b-256 10ac095c498237b4892f033532afbe7d40ce8871fa5b036660ee8abce76c2793

See more details on using hashes here.

File details

Details for the file autoresearch_rl-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: autoresearch_rl-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 127.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for autoresearch_rl-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 70fd5d7e34862b81cfda3c8400b558542ec0ef16b112a734fd7c0b1ccd1da1aa
MD5 c5d39a5427794f0d12226ae66fd90f4e
BLAKE2b-256 d08c9213268a65f4e31db72b7a2a594e6c2dd32152d5614865ad2afa012cf1de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page