Autonomous experiment platform for iterative LLM agent optimization

These details have not been verified by PyPI

Project description

crucible

A general-purpose autonomous experiment platform. Define what to edit, how to run, and what to measure — then let an LLM agent iterate indefinitely to optimize your metric.

Prerequisites

Python 3.10+

uv — Python package manager

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# or via Homebrew
brew install uv

Git — the platform uses git for version control of experiments

Claude Code — the claude CLI must be installed and authenticated

# Install
npm install -g @anthropic-ai/claude-code

# Authenticate (follow the prompts)
claude

Install

# Install as a global CLI tool
uv tool install autocrucible

# Or install from a local clone
git clone https://github.com/suzuke/crucible.git
uv tool install ./crucible

Verify:

crucible --help

Updating

# From PyPI
uv tool install autocrucible --force

# From local source (after pulling changes)
uv tool install ./crucible --force

For development

git clone https://github.com/suzuke/crucible.git
cd crucible
uv sync                 # install in local .venv
uv run crucible --help  # run from source
uv run pytest           # run tests

Quick Start

1. Create a project

From an example:

# List available examples
crucible new . --list

# Create from example
crucible new ~/my-experiment -e optimize-sorting
cd ~/my-experiment
crucible init --tag run1   # auto git-init if needed

Using the wizard (AI-generated scaffold):

crucible wizard ~/my-experiment --describe "Train an AlphaZero Gomoku agent using NN and MCTS"
cd ~/my-experiment
crucible init --tag run1

The wizard analyzes your description, asks clarifying questions, and generates a complete project with architecture guards baked into evaluate.py — preventing the agent from bypassing your intended approach.

From scratch:

crucible new ~/my-experiment
cd ~/my-experiment
# Edit .crucible/config.yaml and program.md
crucible init --tag run1   # auto git-init if needed

If your experiment needs third-party packages (numpy, torch, etc.), they are listed in the generated pyproject.toml. Install them:

uv sync

Or manually — in your project repo, create .crucible/config.yaml:

name: "optimize-sorting"
description: "Find the fastest sorting implementation"

files:
  editable:
    - "sort.py"
  readonly:
    - "benchmark.py"

commands:
  run: "python benchmark.py > run.log 2>&1"
  eval: "grep '^ops_per_sec:' run.log"

metric:
  name: "ops_per_sec"
  direction: "maximize"

And .crucible/program.md with instructions for the agent:

You are optimizing a sorting algorithm.
Edit sort.py to improve throughput measured by ops_per_sec.
Try different algorithms, data structures, and optimizations.

2. Initialize

crucible init --tag run1

This creates a git branch crucible/run1 and initializes results.tsv. If the project isn't a git repo yet, init automatically runs git init, stages all files, and creates an initial commit.

3. Run

crucible run --tag run1

The platform will loop indefinitely:

Ask the agent to propose and implement one change
Validate the edit (only allowed files modified)
Commit and run the experiment
Parse the metric
Keep if improved, discard if not
Repeat

Press Ctrl+C to stop gracefully (waits for current experiment to finish).

If interrupted, simply re-run the same command — crucible automatically detects the existing branch and resumes where it left off:

crucible run --tag run1   # resumes from previous state

4. Check results

crucible status
# Experiment: optimize-sorting
# Total: 15  Kept: 8  Discarded: 5  Crashed: 2
# Best ops_per_sec: 142000.0 (commit b2c3d4e)

crucible history --last 5
# Commit      Metric Status   Description
# ------------------------------------------------------------
# b2c3d4e   142000.0 keep     switch to radix sort for large arrays
# a1b2c3d   138000.0 keep     add insertion sort for small partitions
# ...

# JSON output for programmatic use
crucible status --json
crucible history --json --last 20

# Compare two experiment runs
crucible compare run1 run2
crucible compare run1 run2 --json

How It Works

crucible run --tag run1
        │
        ▼
┌─────────────────────────────────┐
│  1. Assemble prompt             │  instructions + history + state
│  2. Claude Agent SDK            │  agent reads/edits files
│  3. Guard rails                 │  validate edits
│  4. Git commit                  │  snapshot the change
│  5. Run experiment              │  python evaluate.py > run.log
│  6. Parse metric                │  grep '^metric:' run.log
│  7. Keep or discard             │  improved? keep : reset
│  8. Loop                        │
└─────────────────────────────────┘

Agent: Uses the Claude Agent SDK with a tool allowlist (Read, Edit, Write, Glob, Grep). The agent can read files, make targeted edits, and search the codebase — but cannot execute arbitrary commands.
Environment: If your project has a .venv/, crucible automatically activates it when running experiment commands, so python3 evaluate.py uses the correct interpreter and packages.
Git: Every attempt is committed. Improvements advance the branch; failures are tagged and reset, preserving the diff for analysis.

Postmortem analysis

After a run completes (or is interrupted), analyze what happened:

crucible postmortem                   # text report with trend chart
crucible postmortem --json            # machine-readable output
crucible postmortem --ai              # include AI-generated insights

The postmortem shows metric trends, failure streaks, and the best result. With --ai, Claude analyzes the iteration history and provides actionable insights about turning points, plateaus, and suggested next directions.

Validate before running

crucible validate
#   [PASS] Config: config.yaml is valid
#   [PASS] Instructions: .crucible/program.md exists
#   [PASS] Editable files: All files exist
#   [PASS] Run command: Executed successfully
#   [PASS] Eval/metric: ops_per_sec: 42000.0

Verbose logging

crucible -v run --tag run1   # debug-level output

Config Reference

`.crucible/config.yaml`

# Required
name: "experiment-name"                    # Experiment identifier
files:
  editable: ["train.py"]                   # Files the agent can modify
  readonly: ["eval.py"]                    # Files the agent must not touch (optional)
commands:
  run: "python train.py > run.log 2>&1"    # How to run one experiment
  eval: "grep '^metric:' run.log"          # How to extract the metric
metric:
  name: "metric"                           # Metric key (matches eval output)
  direction: "minimize"                    # "minimize" or "maximize"

# Optional (defaults shown)
description: ""                            # Human-readable description
commands:
  setup: "pip install -r requirements.txt" # One-time setup (run on init)
constraints:
  timeout_seconds: 600                     # Kill experiment after this
  max_retries: 3                           # Max consecutive failures before stop
agent:
  type: "claude-code"                      # Agent backend
  instructions: "program.md"              # Static instructions file
  system_prompt: "system.md"              # Custom system prompt (optional, default: built-in)
  context_window:
    include_history: true                  # Inject past experiment results
    history_limit: 20                      # Max history entries in prompt
    include_best: true                     # Show current best metric
git:
  branch_prefix: "crucible"                # Branch: <prefix>/<tag>
  tag_failed: true                         # Tag failed experiments before reset

Eval Command Convention

The eval command must output lines in key: value format:

metric_name: 0.12345

The platform extracts the value matching metric.name. This is compatible with common patterns like grep '^loss:' run.log.

Single Metric by Design

Crucible uses a single scalar metric — this is a deliberate design choice, not a limitation. A single number makes the keep/discard decision unambiguous, keeps the loop simple and reliable, and forces you to define "better" clearly in your evaluation harness.

Multi-objective optimization is handled in evaluate.py, not the platform:

latency = measure_latency()
throughput = measure_throughput()

# Weighted combination
metric = throughput / latency

# Constraint-based (zero the metric if a constraint is violated)
metric = throughput if latency < 100 else 0

# Staged (correctness first, then optimize)
metric = throughput if correctness == 1.0 else -1000

print(f"metric: {metric}")

This keeps complexity in your domain logic (where it belongs) rather than in the platform.

Git Strategy

Each session runs on a branch: <branch_prefix>/<tag>
Successful experiments advance the branch (commit stays)
Failed experiments are tagged failed/<tag>/<n> then reset, preserving the diff for analysis
results.tsv records every experiment regardless of outcome

Guard Rails

Pre-commit: readonly files not modified, only listed files changed, at least one file edited.

Post-execution: timeout enforced (SIGTERM → SIGKILL), metric must be a valid number (not NaN/inf), consecutive failures capped at max_retries.

Context Assembly

Each iteration, the agent receives a dynamically assembled prompt:

Static instructions from program.md
Current state — branch, best metric, experiment counts
Experiment history — recent results table + observed patterns
Action directive — "propose and implement ONE experiment"
Error/crash context — if the previous iteration failed, the error is included

Examples

Bundled examples to get started quickly. Create a project from any example:

crucible new ~/my-project -e <example-name>

Example	Metric	Direction	Description
`optimize-sorting`	`ops_per_sec`	maximize	Pure Python sorting throughput optimization
`optimize-regression`	`val_mse`	minimize	Synthetic regression with nonlinear interactions
`optimize-classifier`	`val_accuracy`	maximize	Numpy-only neural network on 8-class dataset
`optimize-compress`	`compression_ratio`	maximize	Lossless text compression (no zlib/gzip allowed)
`optimize-gomoku`	`win_rate`	maximize	AlphaZero-style Gomoku agent training

Demo: optimize-compress

A showcase example where the agent builds a lossless text compressor from scratch:

crucible new ~/compress -e optimize-compress
cd ~/compress
crucible init --tag run1
crucible run --tag run1

Starting from a baseline RLE compressor (0.51x — worse than no compression), the agent typically:

Iter 1: Implements LZ77 + Huffman → ~2.63x
Iter 2: Adds optimal parsing DP + symbol remapping → ~2.81x (beats zlib's 2.65x)
Iter 3+: Context modeling, arithmetic coding → 3.0x+

Project Structure

my-experiment/
├── .crucible/
│   ├── config.yaml     # What to optimize, how to run, what to measure
│   └── program.md      # Instructions for the LLM agent
├── solution.py          # Code the agent modifies (editable)
├── evaluate.py          # Fixed harness that measures the metric (readonly)
├── pyproject.toml       # Experiment dependencies (NOT crucible itself)
├── results.tsv          # Auto-generated experiment log
└── run.log              # Latest experiment output

Crucible is installed as a global CLI tool — it is NOT a dependency of your experiment project. Your project's pyproject.toml only lists experiment-specific packages (numpy, torch, etc.).

Claude Code Skill: Interactive Setup

Crucible ships with a Claude Code skill that provides an interactive, guided workflow for creating experiment projects from scratch.

Installing the skill

# Copy the skill to your Claude Code skills directory
cp -r /path/to/crucible/.claude/skills/crucible-setup ~/.claude/skills/

Or, if you cloned the crucible repo, add it to your project's .claude/ directory:

mkdir -p .claude/skills
cp -r /path/to/crucible/.claude/skills/crucible-setup .claude/skills/

Using the skill

Once installed, simply tell Claude Code what you want to optimize:

> I want to optimize a matrix multiplication algorithm
> Set up a new experiment to maximize inference throughput
> Create a benchmark for my sorting implementation

Claude Code will automatically activate the crucible-setup skill and walk you through a 7-step workflow:

Define the metric — what to measure, direction (min/max), dependencies
Architecture constraints — if you require a specific approach, the skill enforces it in evaluate.py (not just prompts) to prevent Goodhart's Law violations
Create evaluation harness — readonly evaluate.py with correctness gating and method verification
Create baseline — simple, correct starting implementation
Write agent instructions — program.md with hard rules (code-enforced) vs soft rules (guidelines)
Write config.yaml — metric, commands, timeout, guard rails
Verify baseline — run the experiment to confirm everything works

Why use the skill instead of examples?

Approach	Best for
`crucible new -e <example>`	Standard problems similar to bundled examples
Claude Code skill	Custom problems, unique metrics, architecture constraints

The skill is especially valuable when you have architecture constraints (e.g., "must use neural network", "implement with MCTS"). It generates verify_method() checks in the evaluation harness that zero the metric if the agent abandons the required approach — something you'd have to write manually otherwise.

FAQ

Won't the greedy strategy get stuck in local optima?

Crucible uses a greedy keep/discard loop — improvements are kept, regressions are discarded. This sounds like it could get stuck, but an LLM agent is fundamentally different from traditional optimization:

The agent sees full history including discarded and crashed attempts, so it knows what didn't work and why
It can reason about failures and deliberately try different architectural approaches, not just parameter tweaks
It reads the actual code each iteration, so it can make structural changes that a blind search never would

That said, local optima is a real risk for long runs. The built-in escape hatch is multiple tags — essentially manual beam search:

# Explore different directions from the same baseline
crucible init --tag approach-a
crucible init --tag approach-b
crucible run --tag approach-a    # e.g. "focus on algorithmic improvements"
crucible run --tag approach-b    # e.g. "focus on low-level optimizations"
crucible compare approach-a approach-b

You can also backtrack to an earlier commit and branch from there:

git log crucible/run1              # find a promising commit
git checkout <commit>
crucible init --tag run1-variant   # new branch from that point
crucible run --tag run1-variant

Why only one metric? What about multi-objective optimization?

See Single Metric by Design above. The single scalar metric is a deliberate design choice that keeps the keep/discard decision unambiguous. Multi-objective trade-offs belong in your evaluate.py, where you have full domain knowledge to define what "better" means.

Why not run multiple agents in parallel?

Crucible runs one agent per tag, serially. This is deliberate:

Cost efficiency: Parallel agents multiply API costs, but serial agents learn from history — iteration N+1 is smarter than N because it sees what worked and what didn't. Blind parallel exploration doesn't have this advantage.
Simplicity: Parallel agents editing the same files in the same repo cause git conflicts. Solving this requires worktree isolation, result synchronization, and merge strategies — significant complexity for marginal gain.

The manual approach covers most needs. Run multiple tags in separate terminals:

# Terminal 1                        # Terminal 2
crucible run --tag algo-focus       crucible run --tag lowlevel-focus

Each tag is an independent experiment branch. Compare results when done:

crucible compare algo-focus lowlevel-focus

This gives you full control over which directions to explore in parallel, with zero additional complexity.

Is it safe to let the agent modify code that gets executed?

The agent cannot run arbitrary commands — it only has access to Read, Edit, Write, Glob, and Grep tools. However, the code it writes into editable files is executed by commands.run. If the editable file can make network requests, delete files, or perform other dangerous operations, guard rails won't catch that.

Mitigations:

Scope the editable files narrowly. If sort.py only contains a sort function, the blast radius is limited even if the agent writes bad code.
Make the evaluation harness (readonly) import and call the editable code in a controlled way. The agent can't modify evaluate.py.
Use constraints.timeout_seconds to kill runaway experiments.
Run in a container or VM for untrusted workloads. Crucible doesn't require root or network access.
Review the git log. Every change is committed — you can audit exactly what the agent did.

This is the same trust model as CI/CD: you review the code, the system runs it. Crucible just automates the iteration loop.

Where's the web dashboard?

There isn't one — by design. results.tsv is a plain TSV file that any tool can read, and experiments typically run tens of iterations, not thousands. A full web UI would be a separate project-sized effort for marginal benefit.

Live monitoring (in a separate terminal):

watch -n 5 crucible status
watch -n 5 crucible history --last 10

Quick trend chart:

# ASCII chart with gnuplot
tail -n +2 results.tsv | cut -f2 | gnuplot -e "set terminal dumb; plot '-' with lines"

# Or Python
python3 -c "
import csv
with open('results.tsv') as f:
    for i, x in enumerate(csv.DictReader(f, delimiter='\t')):
        bar = '#' * int(float(x['metric_value']) / 10)
        print(f'{i+1:3d} {float(x[\"metric_value\"]):8.2f} {bar}')
"

Programmatic access:

crucible status --json | jq .
crucible history --json --last 50 | jq '.[].metric'

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.0

Apr 16, 2026

0.6.1

Mar 22, 2026

0.6.0

Mar 20, 2026

0.5.3

Mar 19, 2026

0.5.2

Mar 19, 2026

0.5.1

Mar 18, 2026

0.5.0

Mar 17, 2026

0.4.0

Mar 16, 2026

0.3.0

Mar 13, 2026

This version

0.2.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocrucible-0.2.0.tar.gz (103.2 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autocrucible-0.2.0-py3-none-any.whl (85.0 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file autocrucible-0.2.0.tar.gz.

File metadata

Download URL: autocrucible-0.2.0.tar.gz
Upload date: Mar 11, 2026
Size: 103.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autocrucible-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f75cd98c7e43c3c7145145ff407b3212b986067b5905ab882fe1cc6c4148c698`
MD5	`7a524f4851eaa6f574aa7c44629fed0d`
BLAKE2b-256	`74b0ab050364c89aa7b0cb9c3345840731b5a0bd5cf766e3e6b07014170bdb3e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocrucible-0.2.0.tar.gz:

Publisher: publish.yml on suzuke/crucible

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autocrucible-0.2.0.tar.gz
- Subject digest: f75cd98c7e43c3c7145145ff407b3212b986067b5905ab882fe1cc6c4148c698
- Sigstore transparency entry: 1078819471
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: suzuke/crucible@21b0e7430eb835adc6c4a7c7403896017e3171bf
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/suzuke
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@21b0e7430eb835adc6c4a7c7403896017e3171bf
- Trigger Event: release

File details

Details for the file autocrucible-0.2.0-py3-none-any.whl.

File metadata

Download URL: autocrucible-0.2.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 85.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autocrucible-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27787fd706d42dbe320b337e138a3ca7c5f09477e058181fe3a2f2546444ea37`
MD5	`f5f2c2c3d4f650b8545a0bdb9c0935cf`
BLAKE2b-256	`74c63c7f9a4021a48e79e91d8e245adc455b399df114f8ff2c08d33ba953b23d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocrucible-0.2.0-py3-none-any.whl:

Publisher: publish.yml on suzuke/crucible

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autocrucible-0.2.0-py3-none-any.whl
- Subject digest: 27787fd706d42dbe320b337e138a3ca7c5f09477e058181fe3a2f2546444ea37
- Sigstore transparency entry: 1078819537
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: suzuke/crucible@21b0e7430eb835adc6c4a7c7403896017e3171bf
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/suzuke
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@21b0e7430eb835adc6c4a7c7403896017e3171bf
- Trigger Event: release

autocrucible 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

crucible

Prerequisites

Install

Updating

For development

Quick Start

1. Create a project

2. Initialize

3. Run

4. Check results

How It Works

Postmortem analysis

Validate before running

Verbose logging

Config Reference

.crucible/config.yaml

Eval Command Convention

Single Metric by Design

Git Strategy

Guard Rails

Context Assembly

Examples

Demo: optimize-compress

Project Structure

Claude Code Skill: Interactive Setup

Installing the skill

Using the skill

Why use the skill instead of examples?

FAQ

Won't the greedy strategy get stuck in local optima?

Why only one metric? What about multi-objective optimization?

Why not run multiple agents in parallel?

Is it safe to let the agent modify code that gets executed?

Where's the web dashboard?

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`.crucible/config.yaml`