Autonomous experiment platform for iterative LLM agent optimization
Project description
crucible
繁體中文 | English
A general-purpose autonomous experiment platform. Define what to edit, how to run, and what to measure — then let an LLM agent iterate indefinitely to optimize your metric.
Prerequisites
- Python 3.10+
- uv — Python package manager
# macOS / Linux curl -LsSf https://astral.sh/uv/install.sh | sh # or via Homebrew brew install uv
- Git — the platform uses git for version control of experiments
- Claude Code — the
claudeCLI must be installed and authenticated# Install npm install -g @anthropic-ai/claude-code # Authenticate (follow the prompts) claude
Install
# Install as a global CLI tool
uv tool install autocrucible
# Or install from a local clone
git clone https://github.com/suzuke/crucible.git
uv tool install ./crucible
Verify:
crucible --help
Updating
# From PyPI
uv tool install autocrucible --force
# From local source (after pulling changes)
uv tool install ./crucible --force
For development
git clone https://github.com/suzuke/crucible.git
cd crucible
uv sync # install in local .venv
uv run crucible --help # run from source
uv run pytest # run tests
Quick Start
1. Create a project
From an example:
# List available examples
crucible new . --list
# Create from example
crucible new ~/my-experiment -e optimize-sorting
cd ~/my-experiment
crucible init --tag run1 # auto git-init if needed
Using the wizard (AI-generated scaffold):
crucible wizard ~/my-experiment --describe "Train an AlphaZero Gomoku agent using NN and MCTS"
cd ~/my-experiment
crucible init --tag run1
The wizard analyzes your description, asks clarifying questions, and generates a complete project with architecture guards baked into evaluate.py — preventing the agent from bypassing your intended approach.
From scratch:
crucible new ~/my-experiment
cd ~/my-experiment
# Edit .crucible/config.yaml and program.md
crucible init --tag run1 # auto git-init if needed
If your experiment needs third-party packages (numpy, torch, etc.), they are listed in the generated pyproject.toml. Install them:
uv sync
Or manually — in your project repo, create .crucible/config.yaml:
name: "optimize-sorting"
description: "Find the fastest sorting implementation"
files:
editable:
- "sort.py"
readonly:
- "benchmark.py"
commands:
run: "python benchmark.py > run.log 2>&1"
eval: "grep '^ops_per_sec:' run.log"
metric:
name: "ops_per_sec"
direction: "maximize"
And .crucible/program.md with instructions for the agent:
You are optimizing a sorting algorithm.
Edit sort.py to improve throughput measured by ops_per_sec.
Try different algorithms, data structures, and optimizations.
2. Initialize
crucible init --tag run1
This creates a git branch crucible/run1 and initializes results.tsv. If the project isn't a git repo yet, init automatically runs git init, stages all files, and creates an initial commit.
3. Run
crucible run --tag run1
The platform will loop indefinitely:
- Ask the agent to propose and implement one change
- Validate the edit (only allowed files modified)
- Commit and run the experiment
- Parse the metric
- Keep if improved, discard if not
- Repeat
Press Ctrl+C to stop gracefully (waits for current experiment to finish).
If interrupted, simply re-run the same command — crucible automatically detects the existing branch and resumes where it left off:
crucible run --tag run1 # resumes from previous state
4. Check results
crucible status
# Experiment: optimize-sorting
# Total: 15 Kept: 8 Discarded: 5 Crashed: 2
# Best ops_per_sec: 142000.0 (commit b2c3d4e)
crucible history --last 5
# Commit Metric Status Description
# ------------------------------------------------------------
# b2c3d4e 142000.0 keep switch to radix sort for large arrays
# a1b2c3d 138000.0 keep add insertion sort for small partitions
# ...
# JSON output for programmatic use
crucible status --json
crucible history --json --last 20
# Compare two experiment runs
crucible compare run1 run2
crucible compare run1 run2 --json
How It Works
crucible run --tag run1
│
▼
┌─────────────────────────────────┐
│ 1. Assemble prompt │ instructions + history + state
│ 2. Claude Agent SDK │ agent reads/edits files
│ 3. Guard rails │ validate edits
│ 4. Git commit │ snapshot the change
│ 5. Run experiment │ python evaluate.py > run.log
│ 6. Parse metric │ grep '^metric:' run.log
│ 7. Keep or discard │ improved? keep : reset
│ 8. Loop │
└─────────────────────────────────┘
- Agent: Uses the Claude Agent SDK with a tool allowlist (Read, Edit, Write, Glob, Grep). The agent can read files, make targeted edits, and search the codebase — but cannot execute arbitrary commands.
- Environment: If your project has a
.venv/, crucible automatically activates it when running experiment commands, sopython3 evaluate.pyuses the correct interpreter and packages. - Git: Every attempt is committed. Improvements advance the branch; failures are tagged and reset, preserving the diff for analysis.
Postmortem analysis
After a run completes (or is interrupted), analyze what happened:
crucible postmortem # text report with trend chart
crucible postmortem --json # machine-readable output
crucible postmortem --ai # include AI-generated insights
The postmortem shows metric trends, failure streaks, and the best result. With --ai, Claude analyzes the iteration history and provides actionable insights about turning points, plateaus, and suggested next directions.
Validate before running
crucible validate
# [PASS] Config: config.yaml is valid
# [PASS] Instructions: .crucible/program.md exists
# [PASS] Editable files: All files exist
# [PASS] Run command: Executed successfully
# [PASS] Eval/metric: ops_per_sec: 42000.0
Verbose logging
crucible -v run --tag run1 # debug-level output
Config Reference
.crucible/config.yaml
# Required
name: "experiment-name" # Experiment identifier
files:
editable: ["train.py"] # Files the agent can modify
readonly: ["eval.py"] # Files the agent must not touch (optional)
commands:
run: "python train.py > run.log 2>&1" # How to run one experiment
eval: "grep '^metric:' run.log" # How to extract the metric
metric:
name: "metric" # Metric key (matches eval output)
direction: "minimize" # "minimize" or "maximize"
# Optional (defaults shown)
description: "" # Human-readable description
commands:
setup: "pip install -r requirements.txt" # One-time setup (run on init)
constraints:
timeout_seconds: 600 # Kill experiment after this
max_retries: 3 # Max consecutive failures before stop
agent:
type: "claude-code" # Agent backend
instructions: "program.md" # Static instructions file
system_prompt: "system.md" # Custom system prompt (optional, default: built-in)
context_window:
include_history: true # Inject past experiment results
history_limit: 20 # Max history entries in prompt
include_best: true # Show current best metric
git:
branch_prefix: "crucible" # Branch: <prefix>/<tag>
tag_failed: true # Tag failed experiments before reset
Eval Command Convention
The eval command must output lines in key: value format:
metric_name: 0.12345
The platform extracts the value matching metric.name. This is compatible with common patterns like grep '^loss:' run.log.
Single Metric by Design
Crucible uses a single scalar metric — this is a deliberate design choice, not a limitation. A single number makes the keep/discard decision unambiguous, keeps the loop simple and reliable, and forces you to define "better" clearly in your evaluation harness.
Multi-objective optimization is handled in evaluate.py, not the platform:
latency = measure_latency()
throughput = measure_throughput()
# Weighted combination
metric = throughput / latency
# Constraint-based (zero the metric if a constraint is violated)
metric = throughput if latency < 100 else 0
# Staged (correctness first, then optimize)
metric = throughput if correctness == 1.0 else -1000
print(f"metric: {metric}")
This keeps complexity in your domain logic (where it belongs) rather than in the platform.
Git Strategy
- Each session runs on a branch:
<branch_prefix>/<tag> - Successful experiments advance the branch (commit stays)
- Failed experiments are tagged
failed/<tag>/<n>then reset, preserving the diff for analysis results.tsvrecords every experiment regardless of outcome
Guard Rails
Pre-commit: readonly files not modified, only listed files changed, at least one file edited.
Post-execution: timeout enforced (SIGTERM → SIGKILL), metric must be a valid number (not NaN/inf), consecutive failures capped at max_retries.
Context Assembly
Each iteration, the agent receives a dynamically assembled prompt:
- Static instructions from
program.md - Current state — branch, best metric, experiment counts
- Experiment history — recent results table + observed patterns
- Action directive — "propose and implement ONE experiment"
- Error/crash context — if the previous iteration failed, the error is included
Examples
Bundled examples to get started quickly. Create a project from any example:
crucible new ~/my-project -e <example-name>
| Example | Metric | Direction | Description |
|---|---|---|---|
optimize-sorting |
ops_per_sec |
maximize | Pure Python sorting throughput optimization |
optimize-regression |
val_mse |
minimize | Synthetic regression with nonlinear interactions |
optimize-classifier |
val_accuracy |
maximize | Numpy-only neural network on 8-class dataset |
optimize-compress |
compression_ratio |
maximize | Lossless text compression (no zlib/gzip allowed) |
optimize-gomoku |
win_rate |
maximize | AlphaZero-style Gomoku agent training |
Demo: optimize-compress
A showcase example where the agent builds a lossless text compressor from scratch:
crucible new ~/compress -e optimize-compress
cd ~/compress
crucible init --tag run1
crucible run --tag run1
Starting from a baseline RLE compressor (0.51x — worse than no compression), the agent typically:
- Iter 1: Implements LZ77 + Huffman → ~2.63x
- Iter 2: Adds optimal parsing DP + symbol remapping → ~2.81x (beats zlib's 2.65x)
- Iter 3+: Context modeling, arithmetic coding → 3.0x+
Project Structure
my-experiment/
├── .crucible/
│ ├── config.yaml # What to optimize, how to run, what to measure
│ └── program.md # Instructions for the LLM agent
├── solution.py # Code the agent modifies (editable)
├── evaluate.py # Fixed harness that measures the metric (readonly)
├── pyproject.toml # Experiment dependencies (NOT crucible itself)
├── results.tsv # Auto-generated experiment log
└── run.log # Latest experiment output
Crucible is installed as a global CLI tool — it is NOT a dependency of your experiment project. Your project's pyproject.toml only lists experiment-specific packages (numpy, torch, etc.).
Claude Code Skill: Interactive Setup
Crucible ships with a Claude Code skill that provides an interactive, guided workflow for creating experiment projects from scratch.
Installing the skill
# Copy the skill to your Claude Code skills directory
cp -r /path/to/crucible/.claude/skills/crucible-setup ~/.claude/skills/
Or, if you cloned the crucible repo, add it to your project's .claude/ directory:
mkdir -p .claude/skills
cp -r /path/to/crucible/.claude/skills/crucible-setup .claude/skills/
Using the skill
Once installed, simply tell Claude Code what you want to optimize:
> I want to optimize a matrix multiplication algorithm
> Set up a new experiment to maximize inference throughput
> Create a benchmark for my sorting implementation
Claude Code will automatically activate the crucible-setup skill and walk you through a 7-step workflow:
- Define the metric — what to measure, direction (min/max), dependencies
- Architecture constraints — if you require a specific approach, the skill enforces it in
evaluate.py(not just prompts) to prevent Goodhart's Law violations - Create evaluation harness — readonly
evaluate.pywith correctness gating and method verification - Create baseline — simple, correct starting implementation
- Write agent instructions —
program.mdwith hard rules (code-enforced) vs soft rules (guidelines) - Write config.yaml — metric, commands, timeout, guard rails
- Verify baseline — run the experiment to confirm everything works
Why use the skill instead of examples?
| Approach | Best for |
|---|---|
crucible new -e <example> |
Standard problems similar to bundled examples |
| Claude Code skill | Custom problems, unique metrics, architecture constraints |
The skill is especially valuable when you have architecture constraints (e.g., "must use neural network", "implement with MCTS"). It generates verify_method() checks in the evaluation harness that zero the metric if the agent abandons the required approach — something you'd have to write manually otherwise.
FAQ
Won't the greedy strategy get stuck in local optima?
Crucible uses a greedy keep/discard loop — improvements are kept, regressions are discarded. This sounds like it could get stuck, but an LLM agent is fundamentally different from traditional optimization:
- The agent sees full history including discarded and crashed attempts, so it knows what didn't work and why
- It can reason about failures and deliberately try different architectural approaches, not just parameter tweaks
- It reads the actual code each iteration, so it can make structural changes that a blind search never would
That said, local optima is a real risk for long runs. The built-in escape hatch is multiple tags — essentially manual beam search:
# Explore different directions from the same baseline
crucible init --tag approach-a
crucible init --tag approach-b
crucible run --tag approach-a # e.g. "focus on algorithmic improvements"
crucible run --tag approach-b # e.g. "focus on low-level optimizations"
crucible compare approach-a approach-b
You can also backtrack to an earlier commit and branch from there:
git log crucible/run1 # find a promising commit
git checkout <commit>
crucible init --tag run1-variant # new branch from that point
crucible run --tag run1-variant
Why only one metric? What about multi-objective optimization?
See Single Metric by Design above. The single scalar metric is a deliberate design choice that keeps the keep/discard decision unambiguous. Multi-objective trade-offs belong in your evaluate.py, where you have full domain knowledge to define what "better" means.
Why not run multiple agents in parallel?
Crucible runs one agent per tag, serially. This is deliberate:
- Cost efficiency: Parallel agents multiply API costs, but serial agents learn from history — iteration N+1 is smarter than N because it sees what worked and what didn't. Blind parallel exploration doesn't have this advantage.
- Simplicity: Parallel agents editing the same files in the same repo cause git conflicts. Solving this requires worktree isolation, result synchronization, and merge strategies — significant complexity for marginal gain.
The manual approach covers most needs. Run multiple tags in separate terminals:
# Terminal 1 # Terminal 2
crucible run --tag algo-focus crucible run --tag lowlevel-focus
Each tag is an independent experiment branch. Compare results when done:
crucible compare algo-focus lowlevel-focus
This gives you full control over which directions to explore in parallel, with zero additional complexity.
Is it safe to let the agent modify code that gets executed?
The agent cannot run arbitrary commands — it only has access to Read, Edit, Write, Glob, and Grep tools. However, the code it writes into editable files is executed by commands.run. If the editable file can make network requests, delete files, or perform other dangerous operations, guard rails won't catch that.
Mitigations:
- Scope the editable files narrowly. If
sort.pyonly contains a sort function, the blast radius is limited even if the agent writes bad code. - Make the evaluation harness (readonly) import and call the editable code in a controlled way. The agent can't modify
evaluate.py. - Use
constraints.timeout_secondsto kill runaway experiments. - Run in a container or VM for untrusted workloads. Crucible doesn't require root or network access.
- Review the git log. Every change is committed — you can audit exactly what the agent did.
This is the same trust model as CI/CD: you review the code, the system runs it. Crucible just automates the iteration loop.
Where's the web dashboard?
There isn't one — by design. results.tsv is a plain TSV file that any tool can read, and experiments typically run tens of iterations, not thousands. A full web UI would be a separate project-sized effort for marginal benefit.
Live monitoring (in a separate terminal):
watch -n 5 crucible status
watch -n 5 crucible history --last 10
Quick trend chart:
# ASCII chart with gnuplot
tail -n +2 results.tsv | cut -f2 | gnuplot -e "set terminal dumb; plot '-' with lines"
# Or Python
python3 -c "
import csv
with open('results.tsv') as f:
for i, x in enumerate(csv.DictReader(f, delimiter='\t')):
bar = '#' * int(float(x['metric_value']) / 10)
print(f'{i+1:3d} {float(x[\"metric_value\"]):8.2f} {bar}')
"
Programmatic access:
crucible status --json | jq .
crucible history --json --last 50 | jq '.[].metric'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autocrucible-0.2.0.tar.gz.
File metadata
- Download URL: autocrucible-0.2.0.tar.gz
- Upload date:
- Size: 103.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f75cd98c7e43c3c7145145ff407b3212b986067b5905ab882fe1cc6c4148c698
|
|
| MD5 |
7a524f4851eaa6f574aa7c44629fed0d
|
|
| BLAKE2b-256 |
74b0ab050364c89aa7b0cb9c3345840731b5a0bd5cf766e3e6b07014170bdb3e
|
Provenance
The following attestation bundles were made for autocrucible-0.2.0.tar.gz:
Publisher:
publish.yml on suzuke/crucible
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autocrucible-0.2.0.tar.gz -
Subject digest:
f75cd98c7e43c3c7145145ff407b3212b986067b5905ab882fe1cc6c4148c698 - Sigstore transparency entry: 1078819471
- Sigstore integration time:
-
Permalink:
suzuke/crucible@21b0e7430eb835adc6c4a7c7403896017e3171bf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/suzuke
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21b0e7430eb835adc6c4a7c7403896017e3171bf -
Trigger Event:
release
-
Statement type:
File details
Details for the file autocrucible-0.2.0-py3-none-any.whl.
File metadata
- Download URL: autocrucible-0.2.0-py3-none-any.whl
- Upload date:
- Size: 85.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27787fd706d42dbe320b337e138a3ca7c5f09477e058181fe3a2f2546444ea37
|
|
| MD5 |
f5f2c2c3d4f650b8545a0bdb9c0935cf
|
|
| BLAKE2b-256 |
74c63c7f9a4021a48e79e91d8e245adc455b399df114f8ff2c08d33ba953b23d
|
Provenance
The following attestation bundles were made for autocrucible-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on suzuke/crucible
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autocrucible-0.2.0-py3-none-any.whl -
Subject digest:
27787fd706d42dbe320b337e138a3ca7c5f09477e058181fe3a2f2546444ea37 - Sigstore transparency entry: 1078819537
- Sigstore integration time:
-
Permalink:
suzuke/crucible@21b0e7430eb835adc6c4a7c7403896017e3171bf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/suzuke
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21b0e7430eb835adc6c4a7c7403896017e3171bf -
Trigger Event:
release
-
Statement type: