Domain-agnostic autonomous optimization framework

Project description

anneal

Autonomous optimization for any measurable artifact.

Define what to improve, how to measure improvement, and what's allowed to change. An LLM agent handles the rest — generating hypotheses, running experiments, keeping winners, discarding losers, compounding learnings. Overnight. Unattended.

(Artifact, Eval, Agent) → continuous improvement

How It Works

Register a target: artifact files, evaluation command or criteria, scope constraints
Run the optimization loop: agent mutates → eval scores → keep or revert → learn → repeat
Review results: experiment history, score trajectories, cost tracking

Every experiment is git-committed. Every mutation is scope-enforced. Every decision is logged.

anneal init
anneal register \
  --name api-perf \
  --artifact src/api/handler.py \
  --eval-mode deterministic \
  --run-cmd "wrk -t4 -c100 -d10s http://localhost:8080/api | grep Latency" \
  --parse-cmd "awk '{print \$2}'" \
  --direction minimize \
  --scope scope.yaml

anneal run --target api-perf --experiments 50
anneal dashboard

Two Evaluation Modes

Deterministic — a shell command produces a number. Run code, parse output, compare.

eval: pytest --cov=src | grep TOTAL | awk '{print $4}' → 72.3

Stochastic — an LLM judges N samples against K binary criteria. Supports two comparison modes:

Majority voting (default) — 3 votes per judgment, binary YES/NO, Wilcoxon signed-rank test for comparison (with Cohen's d fallback for n < 6 paired samples)
Bradley-Terry — Bayesian Beta estimation with Laplace smoothing, calibrated uncertainty, early stopping when the 95% CI clears the 0.5 threshold (saves 1–8 API calls per criterion per sample)

Bootstrap CI provides variance estimates. Position debiasing splits votes between forward and reverse criterion orderings when votes ≥ 2.

criteria:
  - "Is the text scannable?" (YES/NO)
  - "Are all claims cited?" (YES/NO)
samples: 10 test prompts × 4 criteria → score with confidence interval

Where Anneal Works

The system works when: (1) the artifact is a text file in a git repo, (2) quality is measurable as a scalar (or per-criterion vector for Pareto optimization), and (3) the feedback loop completes in under ~10 minutes.

Use Case	Eval Mode	Feedback Speed	Verdict
Prompt / SKILL.md optimization	stochastic	1–3 min	Works perfectly
API response time	deterministic	2–5 min	Works well
Bundle size reduction	deterministic	1–3 min	Works well
ML training config	deterministic	5–15 min (proxy)	Works well
RAG retrieval prompts	deterministic	1–3 min	Works well
Documentation quality	stochastic	2–5 min	Works well
Multi-agent system prompts	stochastic	2–5 min	Works well
Guardrail / safety filter tuning	deterministic	1–2 min	Works well
Config tuning (build, infra, db)	deterministic	1–10 min	Works well
Eval rubric calibration	deterministic	1–2 min	Works well
Test coverage improvement	deterministic	2–5 min	Works well
Data preprocessing pipeline	deterministic	5–15 min	Works well

Where Anneal Does NOT Work

Use Case	Reason
Non-git projects / binary artifacts	Git worktrees + text diffs are the mutation mechanism.
Live system tuning (real-time metrics)	Evaluates at experiment end, not continuously.
Database schema migrations	Multi-step stateful operations, not file edits.
Cross-service distributed optimization	Targets are scoped to one repo, one worktree.
Embedding model selection	Re-embedding a corpus isn't a file edit. One-shot comparison, not iterative.
Inter-agent protocol changes	Requires coordinated edits across multiple files simultaneously.

Architecture

anneal/engine/
  runner.py            # Experiment state machine (mutate → eval → decide → log)
  eval.py              # Deterministic + stochastic eval, Bradley-Terry, position debiasing
  eval_cache.py        # Content-hash LRU cache for eval results
  search.py            # Greedy, simulated annealing (adaptive), population (crossover), Pareto
  bayesian.py          # GP surrogate model for mutation ranking (optional scikit-learn)
  strategy_selector.py # Thompson Sampling meta-strategy over search algorithms
  archive.py           # MAP-Elites quality-diversity archive
  agent.py             # LLM mutation (Claude Code subprocess or API mode)
  scope.py             # Editable/immutable enforcement with path traversal protection
  knowledge.py         # JSONL experiment store, TF-IDF/embedding retrieval, sliding window drift
  learning_pool.py     # Cross-condition/target/project knowledge transfer with domain filtering
  context.py           # Token budget assembly with per-criterion feedback formatting
  environment.py       # Git worktree management with fsck integrity checks
  safety.py            # Budget caps, failure limits, disk checks, process time-boxing
  client.py            # Multi-provider LLM client with configurable pricing (TOML overlay)
  scheduler.py         # Sequential target scheduler with stale lock recovery
  taxonomy.py          # Failure classification: LLM-based categorization, distribution, blind spots
  tree_search.py       # UCB tree search: backtracking, pruning, persistence, history bootstrap
  policy_agent.py      # Policy agent: continuous instruction rewriting, reward tracking
  registry.py          # Target configuration (config.toml persistence)
  dashboard.py         # File-based SSE live dashboard

Key Features

Core

Scope enforcement — declare what the agent can and cannot modify. Path traversal attempts and absolute paths are rejected. Violations are reverted automatically.
Knowledge compounding — experiment history + consolidated learnings + cross-condition insights available for agent context. Per-criterion feedback (PASS/FAIL per criterion) helps the agent target specific weaknesses.
Cost control — per-experiment and daily budget caps. Pricing loaded from ~/.anneal/pricing.toml with hardcoded defaults. Local models tracked at $0.
Safety — process group time-boxing (SIGKILL), consecutive failure halting, disk space checks, JSONL corruption recovery, git fsck integrity checks after kill recovery.
Graceful shutdown — anneal stop --target <id> writes a stop file; the runner exits cleanly after the current experiment.
Verification gates — binary pass/fail commands that run after scope enforcement, before eval. Discard mutations that fail structural checks without spending eval budget. Stderr captured for diagnosis.
Failure taxonomy — LLM-based classification of failed experiments into structured categories (output_format, logic_error, regression, etc.) with blind spot detection for unattributed failure modes.
Multi-draft mutation — generate N candidate mutations per cycle with temperature variation. Per-draft verifier pruning selects the best survivor. Budget split evenly across drafts.
Random restart — probabilistic fresh-start experiments that escape local optima. SA temperature-linked decay reduces restart probability as search converges.
Policy agent — continuous meta-optimizer that rewrites mutation instructions between experiments based on failure patterns. Complements the plateau-triggered program.md rewriting at a faster cadence (~$0.001/call).

Statistical Rigor

Wilcoxon signed-rank test for stochastic comparison with minimum sample size guard (n ≥ 6); falls back to Cohen's d effect-size threshold for small samples.
Holm-Bonferroni correction adjusts acceptance threshold across the consolidation window, reducing false positive rate by ~86% on null distributions.
Bootstrap confidence intervals with deterministic seeding (float precision normalized for cross-platform reproducibility).
Held-out evaluation with two-tier divergence detection: 10% warning, 25% critical (possible evaluator compromise).
Sliding window drift detection compares first-half vs second-half variance within consolidation windows to catch temporal drift.

Search Strategies

Greedy (default) — accept only strict improvements, verified by statistical test.
Simulated annealing — adaptive cooling with reheat when acceptance drops below target rate. Escapes local optima early, converges to greedy behavior over time.
Population-based — tournament selection with LLM-guided crossover. Top candidates' hypotheses are combined into crossover prompts.
Pareto — multi-objective search over per-criterion score vectors. Maintains a Pareto front; non-dominated trade-off solutions are preserved.
Thompson Sampling — contextual bandit meta-strategy that adaptively selects between search algorithms based on observed reward.
Bayesian surrogate — Gaussian Process model predicts mutation quality from experiment history. Expected Improvement acquisition balances exploration and exploitation. Requires optional scikit-learn dependency.
UCB tree search — maps experiment history to a tree of git commits. Selects the most promising ancestor to branch from via UCB1 (balancing exploitation and exploration). Supports subtree pruning and crash recovery via JSON persistence.

Evaluation Intelligence

Per-criterion structured feedback — agents see which criteria passed/failed, not just the aggregate score.
Position debiasing — when votes ≥ 2, splits between forward and reverse criterion orderings to cancel LLM judge position bias at zero additional API cost.
Bradley-Terry comparison — calibrated Bayesian strength estimation with early stopping, replacing binary majority voting when configured.
Eval result caching — content-hash LRU cache avoids re-evaluating identical artifact content + criteria combinations.
Multi-fidelity pipeline — cheap deterministic stages filter out bad mutations before expensive stochastic evaluation. Constraint pre-checks also run before eval.

Knowledge System

TF-IDF similarity retrieval — IDF-weighted cosine similarity replaces word-level Jaccard for hypothesis matching.
Embedding-based retrieval (optional) — sentence-transformer embeddings with lazy loading and TF-IDF fallback when sentence-transformers is not installed.
Domain-aware learning transfer — cross-domain learnings are penalized by a configurable factor, preventing negative transfer between unrelated optimization targets.
Criterion delta exposure — learning summaries show per-criterion improvements/regressions, not just aggregate score deltas.
MAP-Elites archive — quality-diversity archive maintaining best solutions per behavioral region, enabling warm-starting and trade-off exploration.

Operations

Meta-optimization — two complementary timescales: (1) policy agent rewrites mutation instructions every N experiments (continuous, ~$0.001/call), (2) plateau-triggered program.md rewriting when M consecutive experiments fail (episodic).
Stale lock recovery — scheduler detects and removes lock files older than 1 hour from crashed runners.
Concurrent consolidation safety — check-and-act consolidation is atomic under FileLock.
Live dashboard — anneal dashboard reads from .anneal/ directory. No coupling to the runner process.

Installation

# From PyPI
uv tool install anneal-cli

# With ML extras (Bayesian surrogate, optional)
uv tool install anneal-cli --with scikit-learn

# Or with pip
pip install anneal-cli

Requires Python 3.12+. The anneal command is available globally after installation.

Quick Start

# Initialize in a git repo
anneal init

# Register a deterministic target
anneal register \
  --name my-target \
  --artifact path/to/file.py \
  --eval-mode deterministic \
  --run-cmd "python benchmark.py" \
  --parse-cmd "grep 'score' | awk '{print \$2}'" \
  --direction maximize \
  --scope scope.yaml

# Register with verification gates and restart
anneal register \
  --name my-target \
  --artifact path/to/file.py \
  --eval-mode deterministic \
  --run-cmd "python benchmark.py" \
  --parse-cmd "grep 'score' | awk '{print \$2}'" \
  --direction maximize \
  --scope scope.yaml \
  --verifier "typecheck:python -m mypy path/to/file.py" \
  --verifier "lint:ruff check path/to/file.py" \
  --restart-probability 0.05 \
  --n-drafts 3 \
  --policy-model gpt-4.1-mini

# Run 20 experiments
anneal run --target my-target --experiments 20

# Stop gracefully
anneal stop --target my-target

# Monitor
anneal status --target my-target
anneal history --target my-target
anneal dashboard --open

Testing

# Run all tests (492 tests)
uv run pytest tests/ -x -q

# Run with coverage
uv run pytest tests/ --cov=anneal --cov-report=term-missing

# Run e2e tests only
uv run pytest tests/test_e2e.py -v

# Run validation benchmarks
uv run python benchmarks/bench_false_positives.py
uv run python benchmarks/bench_sa_convergence.py
uv run python benchmarks/bench_retrieval_precision.py

Project Status

492 tests passing. 3 validation benchmarks passing.

Complete

Core engine (git worktrees, scope enforcement, registry, agent invoker, eval engine, runner state machine)
Production hardening (safety layer, knowledge store, learning pool, notifications, JSONL recovery)
Multi-target orchestration, context budget assembly, rate limiting, background daemon
Search strategies: greedy, simulated annealing (adaptive), population (crossover), Pareto, Thompson Sampling, Bayesian surrogate
Statistical rigor: Wilcoxon guard, Holm-Bonferroni correction, criterion name tracking, divergence thresholds
Evaluation intelligence: per-criterion feedback, position debiasing, Bradley-Terry comparison, eval caching, multi-fidelity pipeline
Knowledge upgrades: TF-IDF retrieval, embedding-based retrieval (optional), domain-aware transfer, criterion delta summaries
Operational hardening: anneal stop, git fsck, constraint pre-check ordering, pricing externalization
Quality-diversity archive (MAP-Elites), stale lock recovery, concurrent consolidation safety
File-based live dashboard, deployment-tier approval gates, meta-optimization
End-to-end test suite, validation benchmark suite
Research-driven enhancements: verification gates, failure taxonomy, multi-draft mutation, random restart, UCB tree search, policy agent

Planned

Adaptive draft count — auto-adjust n_drafts based on per-draft survival rate
Population immigration — restart mutations enter population search via tournament selection
Cross-enhancement runner integration tests for multi-draft + tree search + policy agent running simultaneously

Project details

Release history Release notifications | RSS feed

0.4.0

Apr 1, 2026

0.3.0

Apr 1, 2026

This version

0.2.0

Mar 24, 2026

0.1.1

Mar 22, 2026

0.1.0

Mar 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anneal_cli-0.2.0.tar.gz (169.2 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anneal_cli-0.2.0-py3-none-any.whl (124.7 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file anneal_cli-0.2.0.tar.gz.

File metadata

Download URL: anneal_cli-0.2.0.tar.gz
Upload date: Mar 24, 2026
Size: 169.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for anneal_cli-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a0de2e432365657d30fd9f36486412712ca5434c3ae2c0c60b79b8c2ce85f9fc`
MD5	`9c992896d6aa01288896311d76256593`
BLAKE2b-256	`81cb98485a977157294f2c284266d83de5d9b38a8b9a01cc92ddf89b3b0722db`

See more details on using hashes here.

File details

Details for the file anneal_cli-0.2.0-py3-none-any.whl.

File metadata

Download URL: anneal_cli-0.2.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 124.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for anneal_cli-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a78961e18b2e82225e4dbbaa140017c29324349c09315e6320b8d2c26894bbcd`
MD5	`7ababefdd1a04eaa26df1b3002bdc782`
BLAKE2b-256	`10bf61bde535491c04cf9e8805ea720fc92183cf5abbeabe7fe012661a992178`

See more details on using hashes here.

anneal-cli 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

anneal

How It Works

Two Evaluation Modes

Where Anneal Works

Where Anneal Does NOT Work

Architecture

Key Features

Core

Statistical Rigor

Search Strategies

Evaluation Intelligence

Knowledge System

Operations

Installation

Quick Start

Testing

Project Status

Complete

Planned

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes