Calibration discipline for Claude API prompts - sensitivity-driven coordinate descent and walk-forward validation, ported from omega-lock.

These details have not been verified by PyPI

Project links

Project description

omegaprompt

Calibration discipline for Claude API prompts. A prompt that scores 5.0 on the training slice and collapses on held-out data is a textbook case of overfitting. omegaprompt ports omega-lock's sensitivity-driven coordinate descent and walk-forward validation to the prompt-engineering setting — so prompts ship only after they generalize, with hard gates that zero out fitness on refusal or malformed output.

pip install omegaprompt

한국어 README: README_KR.md

Why this exists
30-second demo
The calibratable axes
Architecture
Design decisions worth defending
Cost & performance
Validation
The 3-layer stack
Relation to adjacent tools
Status
Roadmap
Contributing
Citing
License
Colophon

Why this exists

Prompt engineering has a known failure mode: you iterate on a handful of hand-picked examples, the prompt starts looking great, you ship it, and it collapses on the first input outside your head. The failure mode is not new — it is overfitting, the same thing machine learning practitioners have been fighting since the 1990s — but the prompt-engineering toolchain has mostly reinvented evaluation without reinventing the defense.

omega-lock, a sibling project, solved this for numeric parameter calibration. The insight is simple and transfers:

Measure sensitivity. Which parameters actually matter? Perturb each one around a neutral baseline, rank by Gini coefficient of the fitness delta.
Unlock top-K, lock the rest. Search only in the subspace that moves the fitness. The rest stay at neutral.
Walk forward. After the search finishes on the training slice, re-evaluate on a held-out test slice the searcher never saw. Require a Pearson correlation above a pre-declared threshold (KC-4) before the result ships. No tuning the threshold after the fact.

This repository is the port. PromptTarget implements omega-lock's CalibrableTarget protocol, so every omega-lock pipeline (run_p1, run_p1_iterative, run_p2_tpe, run_benchmark) works on prompt calibration unchanged. The prompt-specific pieces — the LLM-as-judge scorer, the composite hard_gate × soft_score fitness, the five calibratable axes — sit on top.

Two guardrails distinguish this from "ask the judge and pick the highest score":

Hard gates collapse fitness to zero. no_refusal, format_valid, no_safety_violation — any gate fails on an item, that item contributes zero. A prompt that scores 5.0 on ten tasks but refuses on the eleventh does not rank above one that consistently scores 4.2 with no refusals.
Walk-forward is the ship gate. omega-lock's KC-4 (Pearson correlation between train and test rankings) is pre-declared and enforced. If the training-best prompt does not rank highly on the test slice, the calibration fails with status = "FAIL:KC-4" and no candidate ships. You cannot retroactively lower the threshold.

30-second demo

$ omegaprompt calibrate examples/sample_dataset.jsonl \
    --rubric examples/rubric_example.json \
    --variants examples/variants_example.json \
    --test examples/sample_test.jsonl \
    --target-model claude-haiku-4-5 \
    --judge-model claude-sonnet-4-6 \
    --method p1 \
    --unlock-k 3 \
    --output outcome.json

Loading dataset from examples/sample_dataset.jsonl ...
  5 items
Loading test set from examples/sample_test.jsonl ...
  5 items
Starting omega-lock run_p1 calibration (unlock_k=3, method=p1) ...
This issues Claude API calls. Budget accordingly.
Calibration complete. best_fitness=0.8240, test_fitness=0.7820, gen_gap=5.10%
Artifact: outcome.json

The outcome.json artifact contains the winning parameter set, the per-slice fitness, the generalization gap, the hard-gate pass rate, and aggregate token usage. It is machine-readable by design — omegaprompt report (v0.2) will render it; your own CI gate can diff outcomes across prompt revisions.

The calibratable axes

PromptTarget exposes five axes to the searcher:

Axis	Type	Meaning
`system_prompt_idx`	int	Index into your pool of candidate system prompts (`ParamVariants.system_prompts`).
`few_shot_count`	int	How many examples to include from `ParamVariants.few_shot_examples`. 0 = zero-shot.
`effort_idx`	int (0-2)	Maps to `effort: low / medium / high`. Only meaningful when thinking is enabled.
`thinking_enabled`	bool	Whether to enable adaptive thinking on the target call.
`max_tokens_bucket`	int (0-2)	Maps to `max_tokens: 1024 / 4096 / 16000`. Surfaces length-bound bias.

The PromptSpace dataclass lets you lock axes out (effort_min == effort_max) when some dimensions are pre-decided. Setting all axes to a single value effectively runs a fixed-prompt benchmark through the judge; typical calibration leaves three axes open and unlocks the top-K by sensitivity.

New axes are deliberately conservative in v0.1. Adding temperature, top_p, or reasoning-budget knobs would require coupling to model-specific behavior that evolves between releases. The five axes above transfer across any chat-style Claude model string.

Architecture

┌──────────────────────────────────────────────────────────────┐
│  Dataset (.jsonl)   ParamVariants (.json)   JudgeRubric (.json) │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  PromptTarget                                                 │
│    implements omega-lock CalibrableTarget                    │
│    param_space() ---> 5 axes                                  │
│    evaluate(params):                                          │
│       for each dataset item:                                  │
│         1. build prompt from params                           │
│         2. call_target(target_client, ...)  -> response       │
│         3. call_judge(judge_client, rubric, ...) -> JudgeResult │
│       CompositeFitness(judge_results) -> fitness              │
│    returns EvalResult(fitness, n_trials, metadata)            │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  omega-lock run_p1                                            │
│    measure_stress + select_unlock_top_k                       │
│    GridSearch over unlocked subspace                          │
│    WalkForward on --test slice (KC-4)                         │
│    emits grid_best, test_fitness, status                      │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  CalibrationOutcome (JSON artifact)                           │
│    best_params, best_fitness, test_fitness,                   │
│    generalization_gap, hard_gate_pass_rate,                   │
│    n_candidates_evaluated, total_api_calls, usage_summary     │
└──────────────────────────────────────────────────────────────┘

Every piece above the omega-lock line is new; every piece at or below it is reused unchanged. The composition boundary is the CalibrableTarget protocol — one function (param_space()) plus one method (evaluate(params)).

Design decisions worth defending

Single-responsibility adapter, reuses an existing calibration engine. omega-lock already handles stress measurement, top-K unlock, grid search, walk-forward, KC gates, benchmark scorecards, and iterative lock-in. Reimplementing any of that here would be wrong. The PromptTarget is ~200 lines; the value is in the fact that every omega-lock pipeline works on it.

LLM-as-judge with a Pydantic-validated response. The judge call uses messages.parse(output_format=JudgeResult). A malformed judge response raises ValidationError at the SDK boundary — it never pollutes the fitness. Without this, a single misbehaving judge call could tank an entire calibration run, and you would not notice until the final report.

Hard gates collapse fitness to zero, no gradient. A soft penalty on refusal (e.g. "refusals lose 20%") rewards prompts that almost refuse. Hard-zero punishes refusal absolutely. The searcher sees no reward signal from inside the refusal region, so it does not approach the boundary. This matches how real deployments evaluate prompts: a prompt that refuses 1-in-10 is not "90% as good," it is unshippable.

Prompt caching on the judge system prompt. The judge's ~170-line system prompt is sized past the cacheable-prefix minimum. A typical calibration run issues hundreds of judge calls; cache hits dominate cost. Every run surfaces cache_read_input_tokens so silent invalidators (e.g. a non-deterministic byte sneaking into the system prompt) fail loud. The full judge prompt is worth reading as a case study in prompt-cache-aware design.

effort and thinking_enabled are first-class axes, not globals. Some prompts only need low effort; forcing high across the board wastes tokens without improving fitness. Letting the calibration surface this is the entire point — if effort_idx has high stress, it matters for your task; if it has low stress, lock it at the neutral and cut tokens.

Target and judge clients are separate parameters. Passing the same Anthropic() instance to both is the common case. Splitting them means you can swap judge models (the judge can be stronger than the target — often a better quality/cost tradeoff), mock each side independently in tests, or route judge calls through a different workspace.

Every path is normalized to forward slashes before going into a prompt. src\foo.py and src/foo.py are different bytes in the API payload. Prompt caching is cache-invariant only if the bytes match. Windows users would silently lose cache hits otherwise.

No temperature / top_p axis. Modern Claude pinned models remove these. Rather than pretend to support them and fail on the server, omegaprompt excludes them from the default space. If you need them for an older model, override the axes in PromptSpace.

Cost & performance

Per evaluate() call: 2 × (dataset_size) API calls — one target, one judge per item. Typical 10-item dataset = 20 API calls per candidate.

Scenario	Target calls	Judge calls	Est. cost (cached judge)
Single candidate, 10-item dataset	10	10	~$0.05-0.10
Grid search, 5^3 = 125 candidates, 10-item dataset	1250	1250	~$6-12
run_p1 with walk-forward (train + test)	2x the above	2x the above	~$12-24

Actual cost depends on target and judge model tiers. Use claude-haiku-4-5 as the judge during prompt iteration to bring this down by 4-5×; promote to a stronger judge only for the final shipped-calibration run.

Every CLI invocation prints aggregate token usage at the end. If cache_read_input_tokens is zero across consecutive runs, something in the judge prompt drifted — the CLI surfaces this explicitly rather than silently absorbing the cost.

Validation

50 tests, 0 network calls. Both the target client and judge client are accepted via a structural Protocol, so every API test mocks with SimpleNamespace or MagicMock. The test surface asserts the exact shape of each request payload (model, thinking config, cache_control placement, few-shot ordering) without negotiating with a real server.

Module	Coverage
`schema.py`	ParamVariants / PromptSpace / CalibrationOutcome — required fields, range validation, JSON roundtrip.
`dataset.py`	JSONL loader — schema validation, duplicate id detection, blank-line tolerance, missing-file error.
`judge.py`	Dimension / HardGate / JudgeRubric / JudgeResult — scale validation, normalized weights, clamping out-of-scale scores, gate aggregation.
`fitness.py`	CompositeFitness — empty batch, all-pass, partial-fail, all-fail, per-item preservation.
`api.py`	call_target / call_judge — payload shape, thinking on/off, few-shot ordering, refusal branch, dict coercion, reference handling.
`target.py`	PromptTarget — end-to-end with mocked clients, default resolution, parameter clamping, usage accumulation.
`cli.py`	Help / version / subcommand wiring.

Run with uv run pytest -q. Typical wall time: under one second.

The 3-layer stack

omegaprompt does not stand alone. It is the applied layer in a three-project system:

       ┌─────────────────────────────────────────────┐
LAYER  │  omegaprompt  (this repo)                   │  "Apply the discipline to prompts"
APPLY  │  v0.1.0 — Claude API prompt calibration     │
       └────────────────────┬────────────────────────┘
                            │ depends on
                            ▼
       ┌─────────────────────────────────────────────┐
LAYER  │  omega-lock                                 │  "The calibration framework"
CORE   │  v0.1.4 — stress + grid + walk-forward + KC │
       └────────────────────┬────────────────────────┘
                            │ validated by
                            ▼
       ┌─────────────────────────────────────────────┐
LAYER  │  Antemortem + antemortem-cli                │  "The discipline around the build"
META   │  methodology + tooling for pre-impl recon   │
       └─────────────────────────────────────────────┘

omega-lock supplies the calibration engine: stress measurement, top-K unlock, grid search, walk-forward, kill criteria, benchmark scorecards. omegaprompt uses it as a library.
Antemortem + antemortem-cli — the pre-implementation reconnaissance discipline under which both omega-lock and omegaprompt were built. Antemortem catches ghost traps before code is written; omega-lock catches overfit parameters before they ship; omegaprompt catches overfit prompts before they deploy. The pattern repeats at three scales: spec, parameters, prompts.

The layering matters for credibility. The calibration engine was shipped and validated (omega-lock 0.1.4 on PyPI, 176 tests) before this prompt adapter was written. The adapter is ~200 lines because everything it needs already exists.

Relation to adjacent tools

Tool	What it does	What omegaprompt adds
promptfoo	Run prompts against test cases, compare outputs, assertion-based grading	Pre-declared walk-forward gate (KC-4) so training ≠ ship criterion. Hard gates that collapse fitness, not softly penalize. Stress-based axis selection.
DSPy	Prompt optimization via program abstraction + bootstrapped few-shot	Domain-agnostic adapter (any CalibrableTarget works). Calibration-first framing (stress + grid + walk-forward), not program synthesis. Composable with DSPy — DSPy output is just another `system_prompt_variant`.
Optuna / Ray Tune (on prompts)	General HPO over prompt knobs	Walk-forward validation + pre-declared kill criteria out of the box. LLM-as-judge with schema-enforced responses. Composite `hard_gate × soft_score` fitness built in, not reinvented per project.
Hand-rolled "eval suite"	Custom per-project scripts that call the model, score, rank	Structured data contract (Dataset, Rubric, Outcome), machine-readable artifact, reproducibility, and plug-and-play into an existing calibration engine that already has a 30-run reference benchmark.

The USP is discipline, not search. The search part is handled by omega-lock (which handles it for any CalibrableTarget). omegaprompt contributes the prompt-specific adapter and the hard-gates-first fitness shape.

Status

v0.1.0 is alpha. The data contract (Dataset, JudgeRubric, ParamVariants, PromptSpace, CalibrationOutcome) is stable. The CLI contract (omegaprompt calibrate, its flags, exit codes) is stable. The judge prompt will iterate as we accumulate scoring-quality data from real runs — expect v0.1.x bumps for judge prompt revisions, tracked in CHANGELOG under "Judge prompt revisions".

Semver applies strictly from v1.0.

Full changelog: CHANGELOG.md.

Roadmap

v0.1.x (judge prompt iteration track)

Dogfood against diverse task types (code generation, reasoning, extraction, classification). Record scoring drift.
Reference scoring-quality benchmark so judge prompt revisions are measured, not guessed.
Additional hard gate evaluators (format predicates, safety classifiers) callable without a judge round-trip.

v0.2 (tooling depth)

omegaprompt report <outcome.json> — human-readable debrief renderer.
Multi-judge validation pattern: judge_v1 + judge_v2 over top-K, disagreement = trust signal.
--dry-run with cost estimate before launching a calibration run.
Second cache_control breakpoint on the rubric for iterative same-rubric runs.

v0.3 (ecosystem)

Benchmark harness: multiple (task × rubric × seed) combinations, RAGAS-style scorecard like omega-lock's.
GitHub Action for CI gating — runs a calibration on PR, blocks merge on KC-4 fail.

Explicitly out of scope: web dashboard, proprietary hosting, multi-user tenancy. omegaprompt is a local developer tool; keep it local.

Contributing

The most valuable contributions are published calibration outcomes — a dataset, a rubric, and the resulting CalibrationOutcome.json across methods. They make the judge prompt evidence-based.

Issues and PRs welcome. For non-trivial changes, run an antemortem first with antemortem-cli — we dogfood the discipline that built this framework.

Citing

omegaprompt v0.1.0 — calibration discipline for Claude API prompts.
https://github.com/hibou04-ops/omegaprompt, 2026.

Parent framework:

omega-lock v0.1.4 — sensitivity-driven coordinate descent calibration framework.
https://github.com/hibou04-ops/omega-lock, 2026.

Methodology (how this and its siblings were built):

Antemortem v0.1.1 — AI-assisted pre-implementation reconnaissance for software changes.
https://github.com/hibou04-ops/Antemortem, 2026.

License

MIT. See LICENSE.

Colophon

Designed, implemented, and shipped solo. Adapter layer over omega-lock; zero calibration-engine reimplementation. 50 tests, 0 live API calls in CI. The tool is built with the pre-implementation reconnaissance discipline it supports for its callers.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.7.4

May 7, 2026

1.7.3

May 7, 2026

1.7.1

May 7, 2026

1.7.0

May 7, 2026

1.6.0

May 6, 2026

1.5.0

May 2, 2026

1.4.0

Apr 29, 2026

1.2.0

Apr 26, 2026

1.1.0 yanked

Apr 21, 2026

Reason this release was yanked:

License classifier mismatch; users should upgrade to 1.2.0

1.0.0

Apr 21, 2026

This version

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omegaprompt-0.1.0.tar.gz (27.2 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omegaprompt-0.1.0-py3-none-any.whl (30.0 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file omegaprompt-0.1.0.tar.gz.

File metadata

Download URL: omegaprompt-0.1.0.tar.gz
Upload date: Apr 21, 2026
Size: 27.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for omegaprompt-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2d4a546b8972d8bde1dc7e040114ce5041e3837e3c20aebc5286cbfcea8a2472`
MD5	`452264d28acd0e6324253f1e9136ea9b`
BLAKE2b-256	`6818c5cbc4b341dba594ef00662e6556fc26d6068a93c8b383ccbf9804815ce9`

See more details on using hashes here.

File details

Details for the file omegaprompt-0.1.0-py3-none-any.whl.

File metadata

Download URL: omegaprompt-0.1.0-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 30.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for omegaprompt-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4fe9bd454cd1ae28c559d1d1da96bf00acf853b1e83009e39d86829bdee3afee`
MD5	`04f22df5cae37af9086bb9bf105c1918`
BLAKE2b-256	`a4fcb9dc5c14dafc75787b5a7caadfd32fe9f78822898a6fd14f9d4ca740360a`

See more details on using hashes here.

omegaprompt 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

omegaprompt

Table of Contents

Why this exists

30-second demo

The calibratable axes

Architecture

Design decisions worth defending

Cost & performance

Validation

The 3-layer stack

Relation to adjacent tools

Status

Roadmap

Contributing

Citing

License

Colophon

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes