Skip to main content

Measure chain-determinism (per-trace structural consistency across replays) for any OpenAI-compatible LLM endpoint.

Project description

chain-determinism-harness

Measure chain-determinism for any OpenAI-compatible LLM endpoint in under 5 minutes for under $2.

Companion harness for the chain-determinism-bench-v1 dataset.

Chain-determinism = the property that N replays of the same query against the same vendor at T=0 produce identical tool-call sequences. Across 9 frontier vendors at T=0 we observe chain-divergence rates from 17.5% (Anthropic Claude Sonnet 4.5) to 97.1% (Mistral Large 2411). This harness lets you measure your own.

Install

pip install chain-determinism-harness

Quickstart (5 minutes, ~$0.50–$2)

export OPENROUTER_API_KEY=sk-or-...        # or OPENAI_API_KEY=sk-...

python -m chain_determinism_harness eval \
    --model qwen/qwen-2.5-72b-instruct \
    --n-queries 5 --n-replays 10

Sample output:

==> chain-determinism-harness eval
    model:       qwen/qwen-2.5-72b-instruct
    queries:     5  (out of 10 embedded)
    replays:     10 per query
    total calls: 50
    ...

  [1/5] heldout::heldout_xenobiology_taxonomy_00::author_productivity::endolithic_autotroph_order
      ok=10/10  unique_sequences=4

  ...

Chain-divergence rate for qwen/qwen-2.5-72b-instruct:
  60.0%  [Wilson 95% CI 23.1%, 88.2%]

  Diverged queries:    3 / 5
  Total replays:       50
  Error replays:       0

What it measures

For each (query, vendor) cell, the harness runs N replays of the same prompt. The agent makes tool calls (against a stub Σ-registry of 21 primitives); the harness records the tool-call sequence for each replay; chain-divergence is True for that query if the sequences are not byte-identical across all N replays. Wilson 95% CIs at the per-query level.

Hashes are byte-identical to those in the chain-determinism-bench-v1 dataset (see chain_determinism_harness/metrics.py::seq_full).

Stub-environment scope

This harness uses stub tool responses — every non-final_answer tool call returns a canned response. This isolates exploration-strategy non-determinism from data-path non-determinism.

The full-execution scope (tool calls executed against a real Σ-registry KG) is not equivalent. Stub-environment chain-divergence will typically be higher than full-execution chain-divergence (the agent gets less semantically informative feedback; exploration strategies vary more across replays). Treat the harness as a necessary condition test: if a vendor passes under stub conditions, it likely passes under full execution; if a vendor fails under stub conditions, full execution may yet improve it.

Endpoints supported

Anything OpenAI-compatible:

Provider Setup
OpenAI direct OPENAI_API_KEY=sk-...
OpenRouter OPENROUTER_API_KEY=sk-or-... (default base_url auto-detected)
Together AI --base-url https://api.together.xyz/v1 + Together API key in OPENAI_API_KEY
vLLM (local) --base-url http://localhost:8000/v1
Modal vLLM --base-url https://your-app.modal.run/v1 + your Modal token

CLI flags

chain-determinism-harness eval --model X [options]

  --model MODEL              Model identifier (required)
  --n-queries N              Held-out queries to use (max 10; default 5)
  --n-replays K              Replays per query (default 10)
  --temperature T            Sampling temperature (default 0.0)
  --concurrency C            Max concurrent replays per query (default 5)
  --base-url URL             OpenAI-compatible API base URL (auto-detected if unset)
  --timeout SEC              Per-request timeout (default 60)
  --out PATH                 Write summary JSON
  --runs-out PATH            Write per-replay JSONL (downstream analysis)

Cost guidance (rough)

Configuration Total calls Approx. cost
--n-queries 5 --n-replays 10 50 $0.10–$2 (depends on vendor)
--n-queries 10 --n-replays 20 200 $0.50–$10
--n-queries 10 --n-replays 50 (extended per-cell N) 500 $1–$25

Most frontier vendors on OpenRouter or direct API run well under a dollar for the default 5×10 = 50 calls. Mistral Large 2411 on OpenRouter (the most expensive measurable vendor we observed) is roughly $2 for 50 calls.

License

MIT (this harness). The companion dataset is CC-BY-4.0; vendor responses are not redistributed (hash-only).

Caveats

  • Vendor model versions are mutable. Today's gpt-5.4 is not 2027's gpt-5.4. Pin model snapshots where the API supports it (OpenAI dated IDs; Anthropic anthropic-version headers).
  • OpenRouter routing variance. OpenRouter load-balances across backend providers. Different runs may route to different providers; this is a confound for chain-divergence measurement.
  • Gameability. Chain-divergence rewards superficial template-locking (whitespace, key ordering — already enforced by sort_keys=True). It is not adversarially robust as a certification metric in its present form.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chain_determinism_harness-0.1.2.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chain_determinism_harness-0.1.2-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file chain_determinism_harness-0.1.2.tar.gz.

File metadata

File hashes

Hashes for chain_determinism_harness-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ab78ffbc5e29bc26943c1191728d08275760b2816f4d41cb98285656cbd6ce5a
MD5 88005813d91a711d599710ff8f26fa60
BLAKE2b-256 ee96e5bf5b3b6fb860d84db61e32a1d99907b8b260deaffbe6bb4c04fd244a69

See more details on using hashes here.

File details

Details for the file chain_determinism_harness-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for chain_determinism_harness-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 132012a3e386bf6eab103f580970a488a2fec3fa3f873e9bed7c4e8903b334c1
MD5 b155d04ef9dca0e629d3b5addfa55266
BLAKE2b-256 179c486c7637a5ba955dd5322d2f03f6f219506a513a66415529f3483d08c457

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page