Calibration discipline for Claude API prompts - sensitivity-driven coordinate descent and walk-forward validation, ported from omega-lock.
Project description
omegaprompt
Calibration discipline for Claude API prompts. A prompt that scores 5.0 on the training slice and collapses on held-out data is a textbook case of overfitting. omegaprompt ports omega-lock's sensitivity-driven coordinate descent and walk-forward validation to the prompt-engineering setting — so prompts ship only after they generalize, with hard gates that zero out fitness on refusal or malformed output.
pip install omegaprompt
한국어 README: README_KR.md
Table of Contents
- Why this exists
- 30-second demo
- The calibratable axes
- Architecture
- Design decisions worth defending
- Cost & performance
- Validation
- The 3-layer stack
- Relation to adjacent tools
- Status
- Roadmap
- Contributing
- Citing
- License
- Colophon
Why this exists
Prompt engineering has a known failure mode: you iterate on a handful of hand-picked examples, the prompt starts looking great, you ship it, and it collapses on the first input outside your head. The failure mode is not new — it is overfitting, the same thing machine learning practitioners have been fighting since the 1990s — but the prompt-engineering toolchain has mostly reinvented evaluation without reinventing the defense.
omega-lock, a sibling project, solved this for numeric parameter calibration. The insight is simple and transfers:
- Measure sensitivity. Which parameters actually matter? Perturb each one around a neutral baseline, rank by Gini coefficient of the fitness delta.
- Unlock top-K, lock the rest. Search only in the subspace that moves the fitness. The rest stay at neutral.
- Walk forward. After the search finishes on the training slice, re-evaluate on a held-out test slice the searcher never saw. Require a Pearson correlation above a pre-declared threshold (KC-4) before the result ships. No tuning the threshold after the fact.
This repository is the port. PromptTarget implements omega-lock's CalibrableTarget protocol, so every omega-lock pipeline (run_p1, run_p1_iterative, run_p2_tpe, run_benchmark) works on prompt calibration unchanged. The prompt-specific pieces — the LLM-as-judge scorer, the composite hard_gate × soft_score fitness, the five calibratable axes — sit on top.
Two guardrails distinguish this from "ask the judge and pick the highest score":
- Hard gates collapse fitness to zero.
no_refusal,format_valid,no_safety_violation— any gate fails on an item, that item contributes zero. A prompt that scores 5.0 on ten tasks but refuses on the eleventh does not rank above one that consistently scores 4.2 with no refusals. - Walk-forward is the ship gate. omega-lock's KC-4 (Pearson correlation between train and test rankings) is pre-declared and enforced. If the training-best prompt does not rank highly on the test slice, the calibration fails with
status = "FAIL:KC-4"and no candidate ships. You cannot retroactively lower the threshold.
30-second demo
$ omegaprompt calibrate examples/sample_dataset.jsonl \
--rubric examples/rubric_example.json \
--variants examples/variants_example.json \
--test examples/sample_test.jsonl \
--target-model claude-haiku-4-5 \
--judge-model claude-sonnet-4-6 \
--method p1 \
--unlock-k 3 \
--output outcome.json
Loading dataset from examples/sample_dataset.jsonl ...
5 items
Loading test set from examples/sample_test.jsonl ...
5 items
Starting omega-lock run_p1 calibration (unlock_k=3, method=p1) ...
This issues Claude API calls. Budget accordingly.
Calibration complete. best_fitness=0.8240, test_fitness=0.7820, gen_gap=5.10%
Artifact: outcome.json
The outcome.json artifact contains the winning parameter set, the per-slice fitness, the generalization gap, the hard-gate pass rate, and aggregate token usage. It is machine-readable by design — omegaprompt report (v0.2) will render it; your own CI gate can diff outcomes across prompt revisions.
The calibratable axes
PromptTarget exposes five axes to the searcher:
| Axis | Type | Meaning |
|---|---|---|
system_prompt_idx |
int | Index into your pool of candidate system prompts (ParamVariants.system_prompts). |
few_shot_count |
int | How many examples to include from ParamVariants.few_shot_examples. 0 = zero-shot. |
effort_idx |
int (0-2) | Maps to effort: low / medium / high. Only meaningful when thinking is enabled. |
thinking_enabled |
bool | Whether to enable adaptive thinking on the target call. |
max_tokens_bucket |
int (0-2) | Maps to max_tokens: 1024 / 4096 / 16000. Surfaces length-bound bias. |
The PromptSpace dataclass lets you lock axes out (effort_min == effort_max) when some dimensions are pre-decided. Setting all axes to a single value effectively runs a fixed-prompt benchmark through the judge; typical calibration leaves three axes open and unlocks the top-K by sensitivity.
New axes are deliberately conservative in v0.1. Adding temperature, top_p, or reasoning-budget knobs would require coupling to model-specific behavior that evolves between releases. The five axes above transfer across any chat-style Claude model string.
Architecture
┌──────────────────────────────────────────────────────────────┐
│ Dataset (.jsonl) ParamVariants (.json) JudgeRubric (.json) │
└────────────────┬─────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ PromptTarget │
│ implements omega-lock CalibrableTarget │
│ param_space() ---> 5 axes │
│ evaluate(params): │
│ for each dataset item: │
│ 1. build prompt from params │
│ 2. call_target(target_client, ...) -> response │
│ 3. call_judge(judge_client, rubric, ...) -> JudgeResult │
│ CompositeFitness(judge_results) -> fitness │
│ returns EvalResult(fitness, n_trials, metadata) │
└────────────────┬─────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ omega-lock run_p1 │
│ measure_stress + select_unlock_top_k │
│ GridSearch over unlocked subspace │
│ WalkForward on --test slice (KC-4) │
│ emits grid_best, test_fitness, status │
└────────────────┬─────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ CalibrationOutcome (JSON artifact) │
│ best_params, best_fitness, test_fitness, │
│ generalization_gap, hard_gate_pass_rate, │
│ n_candidates_evaluated, total_api_calls, usage_summary │
└──────────────────────────────────────────────────────────────┘
Every piece above the omega-lock line is new; every piece at or below it is reused unchanged. The composition boundary is the CalibrableTarget protocol — one function (param_space()) plus one method (evaluate(params)).
Design decisions worth defending
Single-responsibility adapter, reuses an existing calibration engine. omega-lock already handles stress measurement, top-K unlock, grid search, walk-forward, KC gates, benchmark scorecards, and iterative lock-in. Reimplementing any of that here would be wrong. The PromptTarget is ~200 lines; the value is in the fact that every omega-lock pipeline works on it.
LLM-as-judge with a Pydantic-validated response. The judge call uses messages.parse(output_format=JudgeResult). A malformed judge response raises ValidationError at the SDK boundary — it never pollutes the fitness. Without this, a single misbehaving judge call could tank an entire calibration run, and you would not notice until the final report.
Hard gates collapse fitness to zero, no gradient. A soft penalty on refusal (e.g. "refusals lose 20%") rewards prompts that almost refuse. Hard-zero punishes refusal absolutely. The searcher sees no reward signal from inside the refusal region, so it does not approach the boundary. This matches how real deployments evaluate prompts: a prompt that refuses 1-in-10 is not "90% as good," it is unshippable.
Prompt caching on the judge system prompt. The judge's ~170-line system prompt is sized past the cacheable-prefix minimum. A typical calibration run issues hundreds of judge calls; cache hits dominate cost. Every run surfaces cache_read_input_tokens so silent invalidators (e.g. a non-deterministic byte sneaking into the system prompt) fail loud. The full judge prompt is worth reading as a case study in prompt-cache-aware design.
effort and thinking_enabled are first-class axes, not globals. Some prompts only need low effort; forcing high across the board wastes tokens without improving fitness. Letting the calibration surface this is the entire point — if effort_idx has high stress, it matters for your task; if it has low stress, lock it at the neutral and cut tokens.
Target and judge clients are separate parameters. Passing the same Anthropic() instance to both is the common case. Splitting them means you can swap judge models (the judge can be stronger than the target — often a better quality/cost tradeoff), mock each side independently in tests, or route judge calls through a different workspace.
Every path is normalized to forward slashes before going into a prompt. src\foo.py and src/foo.py are different bytes in the API payload. Prompt caching is cache-invariant only if the bytes match. Windows users would silently lose cache hits otherwise.
No temperature / top_p axis. Modern Claude pinned models remove these. Rather than pretend to support them and fail on the server, omegaprompt excludes them from the default space. If you need them for an older model, override the axes in PromptSpace.
Cost & performance
Per evaluate() call: 2 × (dataset_size) API calls — one target, one judge per item. Typical 10-item dataset = 20 API calls per candidate.
| Scenario | Target calls | Judge calls | Est. cost (cached judge) |
|---|---|---|---|
| Single candidate, 10-item dataset | 10 | 10 | ~$0.05-0.10 |
| Grid search, 5^3 = 125 candidates, 10-item dataset | 1250 | 1250 | ~$6-12 |
| run_p1 with walk-forward (train + test) | 2x the above | 2x the above | ~$12-24 |
Actual cost depends on target and judge model tiers. Use claude-haiku-4-5 as the judge during prompt iteration to bring this down by 4-5×; promote to a stronger judge only for the final shipped-calibration run.
Every CLI invocation prints aggregate token usage at the end. If cache_read_input_tokens is zero across consecutive runs, something in the judge prompt drifted — the CLI surfaces this explicitly rather than silently absorbing the cost.
Validation
50 tests, 0 network calls. Both the target client and judge client are accepted via a structural Protocol, so every API test mocks with SimpleNamespace or MagicMock. The test surface asserts the exact shape of each request payload (model, thinking config, cache_control placement, few-shot ordering) without negotiating with a real server.
| Module | Coverage |
|---|---|
schema.py |
ParamVariants / PromptSpace / CalibrationOutcome — required fields, range validation, JSON roundtrip. |
dataset.py |
JSONL loader — schema validation, duplicate id detection, blank-line tolerance, missing-file error. |
judge.py |
Dimension / HardGate / JudgeRubric / JudgeResult — scale validation, normalized weights, clamping out-of-scale scores, gate aggregation. |
fitness.py |
CompositeFitness — empty batch, all-pass, partial-fail, all-fail, per-item preservation. |
api.py |
call_target / call_judge — payload shape, thinking on/off, few-shot ordering, refusal branch, dict coercion, reference handling. |
target.py |
PromptTarget — end-to-end with mocked clients, default resolution, parameter clamping, usage accumulation. |
cli.py |
Help / version / subcommand wiring. |
Run with uv run pytest -q. Typical wall time: under one second.
The 3-layer stack
omegaprompt does not stand alone. It is the applied layer in a three-project system:
┌─────────────────────────────────────────────┐
LAYER │ omegaprompt (this repo) │ "Apply the discipline to prompts"
APPLY │ v0.1.0 — Claude API prompt calibration │
└────────────────────┬────────────────────────┘
│ depends on
▼
┌─────────────────────────────────────────────┐
LAYER │ omega-lock │ "The calibration framework"
CORE │ v0.1.4 — stress + grid + walk-forward + KC │
└────────────────────┬────────────────────────┘
│ validated by
▼
┌─────────────────────────────────────────────┐
LAYER │ Antemortem + antemortem-cli │ "The discipline around the build"
META │ methodology + tooling for pre-impl recon │
└─────────────────────────────────────────────┘
- omega-lock supplies the calibration engine: stress measurement, top-K unlock, grid search, walk-forward, kill criteria, benchmark scorecards. omegaprompt uses it as a library.
- Antemortem + antemortem-cli — the pre-implementation reconnaissance discipline under which both omega-lock and omegaprompt were built. Antemortem catches ghost traps before code is written; omega-lock catches overfit parameters before they ship; omegaprompt catches overfit prompts before they deploy. The pattern repeats at three scales: spec, parameters, prompts.
The layering matters for credibility. The calibration engine was shipped and validated (omega-lock 0.1.4 on PyPI, 176 tests) before this prompt adapter was written. The adapter is ~200 lines because everything it needs already exists.
Relation to adjacent tools
| Tool | What it does | What omegaprompt adds |
|---|---|---|
| promptfoo | Run prompts against test cases, compare outputs, assertion-based grading | Pre-declared walk-forward gate (KC-4) so training ≠ ship criterion. Hard gates that collapse fitness, not softly penalize. Stress-based axis selection. |
| DSPy | Prompt optimization via program abstraction + bootstrapped few-shot | Domain-agnostic adapter (any CalibrableTarget works). Calibration-first framing (stress + grid + walk-forward), not program synthesis. Composable with DSPy — DSPy output is just another system_prompt_variant. |
| Optuna / Ray Tune (on prompts) | General HPO over prompt knobs | Walk-forward validation + pre-declared kill criteria out of the box. LLM-as-judge with schema-enforced responses. Composite hard_gate × soft_score fitness built in, not reinvented per project. |
| Hand-rolled "eval suite" | Custom per-project scripts that call the model, score, rank | Structured data contract (Dataset, Rubric, Outcome), machine-readable artifact, reproducibility, and plug-and-play into an existing calibration engine that already has a 30-run reference benchmark. |
The USP is discipline, not search. The search part is handled by omega-lock (which handles it for any CalibrableTarget). omegaprompt contributes the prompt-specific adapter and the hard-gates-first fitness shape.
Status
v0.1.0 is alpha. The data contract (Dataset, JudgeRubric, ParamVariants, PromptSpace, CalibrationOutcome) is stable. The CLI contract (omegaprompt calibrate, its flags, exit codes) is stable. The judge prompt will iterate as we accumulate scoring-quality data from real runs — expect v0.1.x bumps for judge prompt revisions, tracked in CHANGELOG under "Judge prompt revisions".
Semver applies strictly from v1.0.
Full changelog: CHANGELOG.md.
Roadmap
v0.1.x (judge prompt iteration track)
- Dogfood against diverse task types (code generation, reasoning, extraction, classification). Record scoring drift.
- Reference scoring-quality benchmark so judge prompt revisions are measured, not guessed.
- Additional hard gate evaluators (format predicates, safety classifiers) callable without a judge round-trip.
v0.2 (tooling depth)
omegaprompt report <outcome.json>— human-readable debrief renderer.- Multi-judge validation pattern:
judge_v1+judge_v2over top-K, disagreement = trust signal. --dry-runwith cost estimate before launching a calibration run.- Second
cache_controlbreakpoint on the rubric for iterative same-rubric runs.
v0.3 (ecosystem)
- Benchmark harness: multiple (task × rubric × seed) combinations, RAGAS-style scorecard like omega-lock's.
- GitHub Action for CI gating — runs a calibration on PR, blocks merge on KC-4 fail.
Explicitly out of scope: web dashboard, proprietary hosting, multi-user tenancy. omegaprompt is a local developer tool; keep it local.
Contributing
The most valuable contributions are published calibration outcomes — a dataset, a rubric, and the resulting CalibrationOutcome.json across methods. They make the judge prompt evidence-based.
Issues and PRs welcome. For non-trivial changes, run an antemortem first with antemortem-cli — we dogfood the discipline that built this framework.
Citing
omegaprompt v0.1.0 — calibration discipline for Claude API prompts.
https://github.com/hibou04-ops/omegaprompt, 2026.
Parent framework:
omega-lock v0.1.4 — sensitivity-driven coordinate descent calibration framework.
https://github.com/hibou04-ops/omega-lock, 2026.
Methodology (how this and its siblings were built):
Antemortem v0.1.1 — AI-assisted pre-implementation reconnaissance for software changes.
https://github.com/hibou04-ops/Antemortem, 2026.
License
MIT. See LICENSE.
Colophon
Designed, implemented, and shipped solo. Adapter layer over omega-lock; zero calibration-engine reimplementation. 50 tests, 0 live API calls in CI. The tool is built with the pre-implementation reconnaissance discipline it supports for its callers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omegaprompt-0.1.0.tar.gz.
File metadata
- Download URL: omegaprompt-0.1.0.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d4a546b8972d8bde1dc7e040114ce5041e3837e3c20aebc5286cbfcea8a2472
|
|
| MD5 |
452264d28acd0e6324253f1e9136ea9b
|
|
| BLAKE2b-256 |
6818c5cbc4b341dba594ef00662e6556fc26d6068a93c8b383ccbf9804815ce9
|
File details
Details for the file omegaprompt-0.1.0-py3-none-any.whl.
File metadata
- Download URL: omegaprompt-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fe9bd454cd1ae28c559d1d1da96bf00acf853b1e83009e39d86829bdee3afee
|
|
| MD5 |
04f22df5cae37af9086bb9bf105c1918
|
|
| BLAKE2b-256 |
a4fcb9dc5c14dafc75787b5a7caadfd32fe9f78822898a6fd14f9d4ca740360a
|