Skip to main content

Calibration discipline for Claude API prompts - sensitivity-driven coordinate descent and walk-forward validation, ported from omega-lock.

Project description

omegaprompt

License: MIT Python PyPI Status Parent framework

Calibration discipline for Claude API prompts. A prompt that scores 5.0 on the training slice and collapses on held-out data is a textbook case of overfitting. omegaprompt ports omega-lock's sensitivity-driven coordinate descent and walk-forward validation to the prompt-engineering setting — so prompts ship only after they generalize, with hard gates that zero out fitness on refusal or malformed output.

pip install omegaprompt

한국어 README: README_KR.md


Table of Contents


Why this exists

Prompt engineering has a known failure mode: you iterate on a handful of hand-picked examples, the prompt starts looking great, you ship it, and it collapses on the first input outside your head. The failure mode is not new — it is overfitting, the same thing machine learning practitioners have been fighting since the 1990s — but the prompt-engineering toolchain has mostly reinvented evaluation without reinventing the defense.

omega-lock, a sibling project, solved this for numeric parameter calibration. The insight is simple and transfers:

  1. Measure sensitivity. Which parameters actually matter? Perturb each one around a neutral baseline, rank by Gini coefficient of the fitness delta.
  2. Unlock top-K, lock the rest. Search only in the subspace that moves the fitness. The rest stay at neutral.
  3. Walk forward. After the search finishes on the training slice, re-evaluate on a held-out test slice the searcher never saw. Require a Pearson correlation above a pre-declared threshold (KC-4) before the result ships. No tuning the threshold after the fact.

This repository is the port. PromptTarget implements omega-lock's CalibrableTarget protocol, so every omega-lock pipeline (run_p1, run_p1_iterative, run_p2_tpe, run_benchmark) works on prompt calibration unchanged. The prompt-specific pieces — the LLM-as-judge scorer, the composite hard_gate × soft_score fitness, the five calibratable axes — sit on top.

Two guardrails distinguish this from "ask the judge and pick the highest score":

  1. Hard gates collapse fitness to zero. no_refusal, format_valid, no_safety_violation — any gate fails on an item, that item contributes zero. A prompt that scores 5.0 on ten tasks but refuses on the eleventh does not rank above one that consistently scores 4.2 with no refusals.
  2. Walk-forward is the ship gate. omega-lock's KC-4 (Pearson correlation between train and test rankings) is pre-declared and enforced. If the training-best prompt does not rank highly on the test slice, the calibration fails with status = "FAIL:KC-4" and no candidate ships. You cannot retroactively lower the threshold.

30-second demo

$ omegaprompt calibrate examples/sample_dataset.jsonl \
    --rubric examples/rubric_example.json \
    --variants examples/variants_example.json \
    --test examples/sample_test.jsonl \
    --target-model claude-haiku-4-5 \
    --judge-model claude-sonnet-4-6 \
    --method p1 \
    --unlock-k 3 \
    --output outcome.json

Loading dataset from examples/sample_dataset.jsonl ...
  5 items
Loading test set from examples/sample_test.jsonl ...
  5 items
Starting omega-lock run_p1 calibration (unlock_k=3, method=p1) ...
This issues Claude API calls. Budget accordingly.
Calibration complete. best_fitness=0.8240, test_fitness=0.7820, gen_gap=5.10%
Artifact: outcome.json

The outcome.json artifact contains the winning parameter set, the per-slice fitness, the generalization gap, the hard-gate pass rate, and aggregate token usage. It is machine-readable by design — omegaprompt report (v0.2) will render it; your own CI gate can diff outcomes across prompt revisions.


The calibratable axes

PromptTarget exposes five axes to the searcher:

Axis Type Meaning
system_prompt_idx int Index into your pool of candidate system prompts (ParamVariants.system_prompts).
few_shot_count int How many examples to include from ParamVariants.few_shot_examples. 0 = zero-shot.
effort_idx int (0-2) Maps to effort: low / medium / high. Only meaningful when thinking is enabled.
thinking_enabled bool Whether to enable adaptive thinking on the target call.
max_tokens_bucket int (0-2) Maps to max_tokens: 1024 / 4096 / 16000. Surfaces length-bound bias.

The PromptSpace dataclass lets you lock axes out (effort_min == effort_max) when some dimensions are pre-decided. Setting all axes to a single value effectively runs a fixed-prompt benchmark through the judge; typical calibration leaves three axes open and unlocks the top-K by sensitivity.

New axes are deliberately conservative in v0.1. Adding temperature, top_p, or reasoning-budget knobs would require coupling to model-specific behavior that evolves between releases. The five axes above transfer across any chat-style Claude model string.


Architecture

┌──────────────────────────────────────────────────────────────┐
│  Dataset (.jsonl)   ParamVariants (.json)   JudgeRubric (.json) │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  PromptTarget                                                 │
│    implements omega-lock CalibrableTarget                    │
│    param_space() ---> 5 axes                                  │
│    evaluate(params):                                          │
│       for each dataset item:                                  │
│         1. build prompt from params                           │
│         2. call_target(target_client, ...)  -> response       │
│         3. call_judge(judge_client, rubric, ...) -> JudgeResult │
│       CompositeFitness(judge_results) -> fitness              │
│    returns EvalResult(fitness, n_trials, metadata)            │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  omega-lock run_p1                                            │
│    measure_stress + select_unlock_top_k                       │
│    GridSearch over unlocked subspace                          │
│    WalkForward on --test slice (KC-4)                         │
│    emits grid_best, test_fitness, status                      │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  CalibrationOutcome (JSON artifact)                           │
│    best_params, best_fitness, test_fitness,                   │
│    generalization_gap, hard_gate_pass_rate,                   │
│    n_candidates_evaluated, total_api_calls, usage_summary     │
└──────────────────────────────────────────────────────────────┘

Every piece above the omega-lock line is new; every piece at or below it is reused unchanged. The composition boundary is the CalibrableTarget protocol — one function (param_space()) plus one method (evaluate(params)).


Design decisions worth defending

Single-responsibility adapter, reuses an existing calibration engine. omega-lock already handles stress measurement, top-K unlock, grid search, walk-forward, KC gates, benchmark scorecards, and iterative lock-in. Reimplementing any of that here would be wrong. The PromptTarget is ~200 lines; the value is in the fact that every omega-lock pipeline works on it.

LLM-as-judge with a Pydantic-validated response. The judge call uses messages.parse(output_format=JudgeResult). A malformed judge response raises ValidationError at the SDK boundary — it never pollutes the fitness. Without this, a single misbehaving judge call could tank an entire calibration run, and you would not notice until the final report.

Hard gates collapse fitness to zero, no gradient. A soft penalty on refusal (e.g. "refusals lose 20%") rewards prompts that almost refuse. Hard-zero punishes refusal absolutely. The searcher sees no reward signal from inside the refusal region, so it does not approach the boundary. This matches how real deployments evaluate prompts: a prompt that refuses 1-in-10 is not "90% as good," it is unshippable.

Prompt caching on the judge system prompt. The judge's ~170-line system prompt is sized past the cacheable-prefix minimum. A typical calibration run issues hundreds of judge calls; cache hits dominate cost. Every run surfaces cache_read_input_tokens so silent invalidators (e.g. a non-deterministic byte sneaking into the system prompt) fail loud. The full judge prompt is worth reading as a case study in prompt-cache-aware design.

effort and thinking_enabled are first-class axes, not globals. Some prompts only need low effort; forcing high across the board wastes tokens without improving fitness. Letting the calibration surface this is the entire point — if effort_idx has high stress, it matters for your task; if it has low stress, lock it at the neutral and cut tokens.

Target and judge clients are separate parameters. Passing the same Anthropic() instance to both is the common case. Splitting them means you can swap judge models (the judge can be stronger than the target — often a better quality/cost tradeoff), mock each side independently in tests, or route judge calls through a different workspace.

Every path is normalized to forward slashes before going into a prompt. src\foo.py and src/foo.py are different bytes in the API payload. Prompt caching is cache-invariant only if the bytes match. Windows users would silently lose cache hits otherwise.

No temperature / top_p axis. Modern Claude pinned models remove these. Rather than pretend to support them and fail on the server, omegaprompt excludes them from the default space. If you need them for an older model, override the axes in PromptSpace.


Cost & performance

Per evaluate() call: 2 × (dataset_size) API calls — one target, one judge per item. Typical 10-item dataset = 20 API calls per candidate.

Scenario Target calls Judge calls Est. cost (cached judge)
Single candidate, 10-item dataset 10 10 ~$0.05-0.10
Grid search, 5^3 = 125 candidates, 10-item dataset 1250 1250 ~$6-12
run_p1 with walk-forward (train + test) 2x the above 2x the above ~$12-24

Actual cost depends on target and judge model tiers. Use claude-haiku-4-5 as the judge during prompt iteration to bring this down by 4-5×; promote to a stronger judge only for the final shipped-calibration run.

Every CLI invocation prints aggregate token usage at the end. If cache_read_input_tokens is zero across consecutive runs, something in the judge prompt drifted — the CLI surfaces this explicitly rather than silently absorbing the cost.


Validation

50 tests, 0 network calls. Both the target client and judge client are accepted via a structural Protocol, so every API test mocks with SimpleNamespace or MagicMock. The test surface asserts the exact shape of each request payload (model, thinking config, cache_control placement, few-shot ordering) without negotiating with a real server.

Module Coverage
schema.py ParamVariants / PromptSpace / CalibrationOutcome — required fields, range validation, JSON roundtrip.
dataset.py JSONL loader — schema validation, duplicate id detection, blank-line tolerance, missing-file error.
judge.py Dimension / HardGate / JudgeRubric / JudgeResult — scale validation, normalized weights, clamping out-of-scale scores, gate aggregation.
fitness.py CompositeFitness — empty batch, all-pass, partial-fail, all-fail, per-item preservation.
api.py call_target / call_judge — payload shape, thinking on/off, few-shot ordering, refusal branch, dict coercion, reference handling.
target.py PromptTarget — end-to-end with mocked clients, default resolution, parameter clamping, usage accumulation.
cli.py Help / version / subcommand wiring.

Run with uv run pytest -q. Typical wall time: under one second.


The 3-layer stack

omegaprompt does not stand alone. It is the applied layer in a three-project system:

       ┌─────────────────────────────────────────────┐
LAYER  │  omegaprompt  (this repo)                   │  "Apply the discipline to prompts"
APPLY  │  v0.1.0 — Claude API prompt calibration     │
       └────────────────────┬────────────────────────┘
                            │ depends on
                            ▼
       ┌─────────────────────────────────────────────┐
LAYER  │  omega-lock                                 │  "The calibration framework"
CORE   │  v0.1.4 — stress + grid + walk-forward + KC │
       └────────────────────┬────────────────────────┘
                            │ validated by
                            ▼
       ┌─────────────────────────────────────────────┐
LAYER  │  Antemortem + antemortem-cli                │  "The discipline around the build"
META   │  methodology + tooling for pre-impl recon   │
       └─────────────────────────────────────────────┘
  • omega-lock supplies the calibration engine: stress measurement, top-K unlock, grid search, walk-forward, kill criteria, benchmark scorecards. omegaprompt uses it as a library.
  • Antemortem + antemortem-cli — the pre-implementation reconnaissance discipline under which both omega-lock and omegaprompt were built. Antemortem catches ghost traps before code is written; omega-lock catches overfit parameters before they ship; omegaprompt catches overfit prompts before they deploy. The pattern repeats at three scales: spec, parameters, prompts.

The layering matters for credibility. The calibration engine was shipped and validated (omega-lock 0.1.4 on PyPI, 176 tests) before this prompt adapter was written. The adapter is ~200 lines because everything it needs already exists.


Relation to adjacent tools

Tool What it does What omegaprompt adds
promptfoo Run prompts against test cases, compare outputs, assertion-based grading Pre-declared walk-forward gate (KC-4) so training ≠ ship criterion. Hard gates that collapse fitness, not softly penalize. Stress-based axis selection.
DSPy Prompt optimization via program abstraction + bootstrapped few-shot Domain-agnostic adapter (any CalibrableTarget works). Calibration-first framing (stress + grid + walk-forward), not program synthesis. Composable with DSPy — DSPy output is just another system_prompt_variant.
Optuna / Ray Tune (on prompts) General HPO over prompt knobs Walk-forward validation + pre-declared kill criteria out of the box. LLM-as-judge with schema-enforced responses. Composite hard_gate × soft_score fitness built in, not reinvented per project.
Hand-rolled "eval suite" Custom per-project scripts that call the model, score, rank Structured data contract (Dataset, Rubric, Outcome), machine-readable artifact, reproducibility, and plug-and-play into an existing calibration engine that already has a 30-run reference benchmark.

The USP is discipline, not search. The search part is handled by omega-lock (which handles it for any CalibrableTarget). omegaprompt contributes the prompt-specific adapter and the hard-gates-first fitness shape.


Status

v0.1.0 is alpha. The data contract (Dataset, JudgeRubric, ParamVariants, PromptSpace, CalibrationOutcome) is stable. The CLI contract (omegaprompt calibrate, its flags, exit codes) is stable. The judge prompt will iterate as we accumulate scoring-quality data from real runs — expect v0.1.x bumps for judge prompt revisions, tracked in CHANGELOG under "Judge prompt revisions".

Semver applies strictly from v1.0.

Full changelog: CHANGELOG.md.


Roadmap

v0.1.x (judge prompt iteration track)

  • Dogfood against diverse task types (code generation, reasoning, extraction, classification). Record scoring drift.
  • Reference scoring-quality benchmark so judge prompt revisions are measured, not guessed.
  • Additional hard gate evaluators (format predicates, safety classifiers) callable without a judge round-trip.

v0.2 (tooling depth)

  • omegaprompt report <outcome.json> — human-readable debrief renderer.
  • Multi-judge validation pattern: judge_v1 + judge_v2 over top-K, disagreement = trust signal.
  • --dry-run with cost estimate before launching a calibration run.
  • Second cache_control breakpoint on the rubric for iterative same-rubric runs.

v0.3 (ecosystem)

  • Benchmark harness: multiple (task × rubric × seed) combinations, RAGAS-style scorecard like omega-lock's.
  • GitHub Action for CI gating — runs a calibration on PR, blocks merge on KC-4 fail.

Explicitly out of scope: web dashboard, proprietary hosting, multi-user tenancy. omegaprompt is a local developer tool; keep it local.


Contributing

The most valuable contributions are published calibration outcomes — a dataset, a rubric, and the resulting CalibrationOutcome.json across methods. They make the judge prompt evidence-based.

Issues and PRs welcome. For non-trivial changes, run an antemortem first with antemortem-cli — we dogfood the discipline that built this framework.


Citing

omegaprompt v0.1.0 — calibration discipline for Claude API prompts.
https://github.com/hibou04-ops/omegaprompt, 2026.

Parent framework:

omega-lock v0.1.4 — sensitivity-driven coordinate descent calibration framework.
https://github.com/hibou04-ops/omega-lock, 2026.

Methodology (how this and its siblings were built):

Antemortem v0.1.1 — AI-assisted pre-implementation reconnaissance for software changes.
https://github.com/hibou04-ops/Antemortem, 2026.

License

MIT. See LICENSE.

Colophon

Designed, implemented, and shipped solo. Adapter layer over omega-lock; zero calibration-engine reimplementation. 50 tests, 0 live API calls in CI. The tool is built with the pre-implementation reconnaissance discipline it supports for its callers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omegaprompt-0.1.0.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omegaprompt-0.1.0-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file omegaprompt-0.1.0.tar.gz.

File metadata

  • Download URL: omegaprompt-0.1.0.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for omegaprompt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2d4a546b8972d8bde1dc7e040114ce5041e3837e3c20aebc5286cbfcea8a2472
MD5 452264d28acd0e6324253f1e9136ea9b
BLAKE2b-256 6818c5cbc4b341dba594ef00662e6556fc26d6068a93c8b383ccbf9804815ce9

See more details on using hashes here.

File details

Details for the file omegaprompt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: omegaprompt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for omegaprompt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4fe9bd454cd1ae28c559d1d1da96bf00acf853b1e83009e39d86829bdee3afee
MD5 04f22df5cae37af9086bb9bf105c1918
BLAKE2b-256 a4fcb9dc5c14dafc75787b5a7caadfd32fe9f78822898a6fd14f9d4ca740360a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page