Measure chain-determinism (per-trace structural consistency across replays) for any OpenAI-compatible LLM endpoint.
Project description
chain-determinism-harness
Measure chain-determinism for any OpenAI-compatible LLM endpoint in under 5 minutes for under $2.
Companion harness for the chain-determinism-bench-v1 dataset.
Chain-determinism = the property that N replays of the same query against the same vendor at T=0 produce identical tool-call sequences. Across 9 frontier vendors at T=0 we observe chain-divergence rates from 17.5% (Anthropic Claude Sonnet 4.5) to 97.1% (Mistral Large 2411). This harness lets you measure your own.
Install
pip install chain-determinism-harness
Quickstart (5 minutes, ~$0.50–$2)
export OPENROUTER_API_KEY=sk-or-... # or OPENAI_API_KEY=sk-...
python -m chain_determinism_harness eval \
--model qwen/qwen-2.5-72b-instruct \
--n-queries 5 --n-replays 10
Sample output:
==> chain-determinism-harness eval
model: qwen/qwen-2.5-72b-instruct
queries: 5 (out of 10 embedded)
replays: 10 per query
total calls: 50
...
[1/5] heldout::heldout_xenobiology_taxonomy_00::author_productivity::endolithic_autotroph_order
ok=10/10 unique_sequences=4
...
Chain-divergence rate for qwen/qwen-2.5-72b-instruct:
60.0% [Wilson 95% CI 23.1%, 88.2%]
Diverged queries: 3 / 5
Total replays: 50
Error replays: 0
What it measures
For each (query, vendor) cell, the harness runs N replays of the same prompt. The agent makes tool calls (against a stub Σ-registry of 21 primitives); the harness records the tool-call sequence for each replay; chain-divergence is True for that query if the sequences are not byte-identical across all N replays. Wilson 95% CIs at the per-query level.
Hashes are byte-identical to those in the chain-determinism-bench-v1 dataset (see chain_determinism_harness/metrics.py::seq_full).
Stub-environment scope
This harness uses stub tool responses — every non-final_answer tool call returns a canned response. This isolates exploration-strategy non-determinism from data-path non-determinism.
The full-execution scope (tool calls executed against a real Σ-registry KG) is not equivalent. Stub-environment chain-divergence will typically be higher than full-execution chain-divergence (the agent gets less semantically informative feedback; exploration strategies vary more across replays). Treat the harness as a necessary condition test: if a vendor passes under stub conditions, it likely passes under full execution; if a vendor fails under stub conditions, full execution may yet improve it.
Endpoints supported
Anything OpenAI-compatible:
| Provider | Setup |
|---|---|
| OpenAI direct | OPENAI_API_KEY=sk-... |
| OpenRouter | OPENROUTER_API_KEY=sk-or-... (default base_url auto-detected) |
| Together AI | --base-url https://api.together.xyz/v1 + Together API key in OPENAI_API_KEY |
| vLLM (local) | --base-url http://localhost:8000/v1 |
| Modal vLLM | --base-url https://your-app.modal.run/v1 + your Modal token |
CLI flags
chain-determinism-harness eval --model X [options]
--model MODEL Model identifier (required)
--n-queries N Held-out queries to use (max 10; default 5)
--n-replays K Replays per query (default 10)
--temperature T Sampling temperature (default 0.0)
--concurrency C Max concurrent replays per query (default 5)
--base-url URL OpenAI-compatible API base URL (auto-detected if unset)
--timeout SEC Per-request timeout (default 60)
--out PATH Write summary JSON
--runs-out PATH Write per-replay JSONL (downstream analysis)
Cost guidance (rough)
| Configuration | Total calls | Approx. cost |
|---|---|---|
--n-queries 5 --n-replays 10 |
50 | $0.10–$2 (depends on vendor) |
--n-queries 10 --n-replays 20 |
200 | $0.50–$10 |
--n-queries 10 --n-replays 50 (extended per-cell N) |
500 | $1–$25 |
Most frontier vendors on OpenRouter or direct API run well under a dollar for the default 5×10 = 50 calls. Mistral Large 2411 on OpenRouter (the most expensive measurable vendor we observed) is roughly $2 for 50 calls.
License
MIT (this harness). The companion dataset is CC-BY-4.0; vendor responses are not redistributed (hash-only).
Caveats
- Vendor model versions are mutable. Today's
gpt-5.4is not 2027'sgpt-5.4. Pin model snapshots where the API supports it (OpenAI dated IDs; Anthropicanthropic-versionheaders). - OpenRouter routing variance. OpenRouter load-balances across backend providers. Different runs may route to different providers; this is a confound for chain-divergence measurement.
- Gameability. Chain-divergence rewards superficial template-locking (whitespace, key ordering — already enforced by
sort_keys=True). It is not adversarially robust as a certification metric in its present form.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chain_determinism_harness-0.1.5.tar.gz.
File metadata
- Download URL: chain_determinism_harness-0.1.5.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c296eceed60a84003dc10ca98f4d5847aacc829a73a524cbe8a618673ad7648
|
|
| MD5 |
61c9372a273951ddab3c81ab318a1228
|
|
| BLAKE2b-256 |
882913a7f31d76de6f3259d0a8b6006abae86e936f5ba4e68b019ee90c1ecdef
|
File details
Details for the file chain_determinism_harness-0.1.5-py3-none-any.whl.
File metadata
- Download URL: chain_determinism_harness-0.1.5-py3-none-any.whl
- Upload date:
- Size: 33.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6688f8b688b0abebff128f24d089c32456af8d8763fd720e80ccc1dcbe251a85
|
|
| MD5 |
f375f95972919c4160a80decb0656686
|
|
| BLAKE2b-256 |
b941f27f2e761027c6414dc7b877dc2c7b4f05ed2743ccc5a1f61d2ca66269ba
|