Measure chain-determinism (per-trace structural consistency across replays) for any OpenAI-compatible LLM endpoint.

These details have not been verified by PyPI

Project links

Project description

chain-determinism-harness

Measure chain-determinism for any OpenAI-compatible LLM endpoint in under 5 minutes for under $2.

Companion harness for the chain-determinism-bench-v1 dataset.

Chain-determinism = the property that N replays of the same query against the same vendor at T=0 produce identical tool-call sequences. Across 9 frontier vendors at T=0 we observe chain-divergence rates from 17.5% (Anthropic Claude Sonnet 4.5) to 97.1% (Mistral Large 2411). This harness lets you measure your own.

Install

pip install chain-determinism-harness

Quickstart (5 minutes, ~$0.50–$2)

export OPENROUTER_API_KEY=sk-or-...        # or OPENAI_API_KEY=sk-...

python -m chain_determinism_harness eval \
    --model qwen/qwen-2.5-72b-instruct \
    --n-queries 5 --n-replays 10

Sample output:

==> chain-determinism-harness eval
    model:       qwen/qwen-2.5-72b-instruct
    queries:     5  (out of 10 embedded)
    replays:     10 per query
    total calls: 50
    ...

  [1/5] heldout::heldout_xenobiology_taxonomy_00::author_productivity::endolithic_autotroph_order
      ok=10/10  unique_sequences=4

  ...

Chain-divergence rate for qwen/qwen-2.5-72b-instruct:
  60.0%  [Wilson 95% CI 23.1%, 88.2%]

  Diverged queries:    3 / 5
  Total replays:       50
  Error replays:       0

What it measures

For each (query, vendor) cell, the harness runs N replays of the same prompt. The agent makes tool calls (against a stub Σ-registry of 21 primitives — same registry the paper measures); the harness records the tool-call sequence for each replay; chain-divergence is True for that query if the sequences are not byte-identical across all N replays. Wilson 95% CIs at the per-query level.

Same operationalization as the paper: hashes are byte-identical to those in the chain-determinism-bench-v1 dataset (see chain_determinism_harness/metrics.py::seq_full).

Stub-environment scope

This harness uses stub tool responses — every non-final_answer tool call returns a canned response (matching the paper's §3.5 SWE-bench Lite scope). This isolates exploration-strategy non-determinism from data-path non-determinism. The full Phase 1a methodology in the paper (where tool calls are executed against a real Σ-registry KG) is implemented at scripts/paper5_multivendor_replay.py in the companion repository; this harness is the lighter-weight self-contained alternative for spot-checking new vendors.

The two scopes are not equivalent. Stub-environment chain-divergence will typically be higher than full-execution chain-divergence (the agent gets less semantically informative feedback; exploration strategies vary more across replays). Treat the harness as a necessary condition test: if a vendor passes under stub conditions, it likely passes under full execution; if a vendor fails under stub conditions, full execution may yet improve it.

Endpoints supported

Anything OpenAI-compatible:

Provider	Setup
OpenAI direct	`OPENAI_API_KEY=sk-...`
OpenRouter	`OPENROUTER_API_KEY=sk-or-...` (default base_url auto-detected)
Together AI	`--base-url https://api.together.xyz/v1` + Together API key in `OPENAI_API_KEY`
vLLM (local)	`--base-url http://localhost:8000/v1`
Modal vLLM	`--base-url https://your-app.modal.run/v1` + your Modal token

CLI flags

chain-determinism-harness eval --model X [options]

  --model MODEL              Model identifier (required)
  --n-queries N              Held-out queries to use (max 10; default 5)
  --n-replays K              Replays per query (default 10)
  --temperature T            Sampling temperature (default 0.0)
  --concurrency C            Max concurrent replays per query (default 5)
  --base-url URL             OpenAI-compatible API base URL (auto-detected if unset)
  --timeout SEC              Per-request timeout (default 60)
  --out PATH                 Write summary JSON
  --runs-out PATH            Write per-replay JSONL (downstream analysis)

Cost guidance (rough)

Configuration	Total calls	Approx. cost
`--n-queries 5 --n-replays 10`	50	$0.10–$2 (depends on vendor)
`--n-queries 10 --n-replays 20`	200	$0.50–$10
`--n-queries 10 --n-replays 50` (paper's Phase 1a per-cell N)	500	$1–$25

Most frontier vendors on OpenRouter or direct API run well under a dollar for the default 5×10 = 50 calls. Mistral Large 2411 on OpenRouter (the paper's most expensive measurable vendor) is roughly $2 for 50 calls.

License

MIT (this harness). The companion dataset is CC-BY-4.0; vendor responses are not redistributed (hash-only).

Caveats

Vendor model versions are mutable. Today's gpt-5.4 is not 2027's gpt-5.4. Pin model snapshots where the API supports it (OpenAI dated IDs; Anthropic anthropic-version headers).
OpenRouter routing variance. OpenRouter load-balances across backend providers. Different runs may route to different providers; this is a confound for chain-divergence measurement (see paper §3.4).
Gameability. Chain-divergence rewards superficial template-locking (whitespace, key ordering — already enforced by sort_keys=True). It is not adversarially robust as a certification metric in its present form.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

May 11, 2026

0.1.2

May 11, 2026

This version

0.1.1

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chain_determinism_harness-0.1.1.tar.gz (36.7 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chain_determinism_harness-0.1.1-py3-none-any.whl (32.2 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file chain_determinism_harness-0.1.1.tar.gz.

File metadata

Download URL: chain_determinism_harness-0.1.1.tar.gz
Upload date: May 11, 2026
Size: 36.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for chain_determinism_harness-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`aedbf253a551373f72040ce347686774f03eb12de4362e7146baec1a394b02a9`
MD5	`69e81a390aea566b15f2dbebdcb8e980`
BLAKE2b-256	`bf0222c32c757b315332877028a122a03d6896600c006bcdbadac063c4ad77c0`

See more details on using hashes here.

File details

Details for the file chain_determinism_harness-0.1.1-py3-none-any.whl.

File metadata

Download URL: chain_determinism_harness-0.1.1-py3-none-any.whl
Upload date: May 11, 2026
Size: 32.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for chain_determinism_harness-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc67ab1ada5cd3f16967d1ac01355482ae834a3db5648606944cbf0e060c8e2c`
MD5	`f3525ebf227accb6a3a7e3ce8a947045`
BLAKE2b-256	`770866a7358b0674c015eef0589e5f010b0c80eca64aee8a8dc443f41084f2ff`

See more details on using hashes here.

chain-determinism-harness 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chain-determinism-harness

Install

Quickstart (5 minutes, ~$0.50–$2)

What it measures

Stub-environment scope

Endpoints supported

CLI flags

Cost guidance (rough)

License

Caveats

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes