Skip to main content

Measure chain-determinism (per-trace structural consistency across replays) for any OpenAI-compatible LLM endpoint.

Project description

chain-determinism-harness

Measure chain-determinism for any OpenAI-compatible LLM endpoint in under 5 minutes for under $2.

Companion harness for the chain-determinism-bench-v1 dataset.

Chain-determinism = the property that N replays of the same query against the same vendor at T=0 produce identical tool-call sequences. Across 9 frontier vendors at T=0 we observe chain-divergence rates from 17.5% (Anthropic Claude Sonnet 4.5) to 97.1% (Mistral Large 2411). This harness lets you measure your own.

Install

pip install chain-determinism-harness

Quickstart (5 minutes, ~$0.50–$2)

export OPENROUTER_API_KEY=sk-or-...        # or OPENAI_API_KEY=sk-...

python -m chain_determinism_harness eval \
    --model qwen/qwen-2.5-72b-instruct \
    --n-queries 5 --n-replays 10

Sample output:

==> chain-determinism-harness eval
    model:       qwen/qwen-2.5-72b-instruct
    queries:     5  (out of 10 embedded)
    replays:     10 per query
    total calls: 50
    ...

  [1/5] heldout::heldout_xenobiology_taxonomy_00::author_productivity::endolithic_autotroph_order
      ok=10/10  unique_sequences=4

  ...

Chain-divergence rate for qwen/qwen-2.5-72b-instruct:
  60.0%  [Wilson 95% CI 23.1%, 88.2%]

  Diverged queries:    3 / 5
  Total replays:       50
  Error replays:       0

What it measures

For each (query, vendor) cell, the harness runs N replays of the same prompt. The agent makes tool calls (against a stub Σ-registry of 21 primitives); the harness records the tool-call sequence for each replay; chain-divergence is True for that query if the sequences are not byte-identical across all N replays. Wilson 95% CIs at the per-query level.

Hashes are byte-identical to those in the chain-determinism-bench-v1 dataset (see chain_determinism_harness/metrics.py::seq_full).

Stub-environment scope

This harness uses stub tool responses — every non-final_answer tool call returns a canned response. This isolates exploration-strategy non-determinism from data-path non-determinism.

The full-execution scope (tool calls executed against a real Σ-registry KG) is not equivalent. Stub-environment chain-divergence will typically be higher than full-execution chain-divergence (the agent gets less semantically informative feedback; exploration strategies vary more across replays). Treat the harness as a necessary condition test: if a vendor passes under stub conditions, it likely passes under full execution; if a vendor fails under stub conditions, full execution may yet improve it.

Endpoints supported

Anything OpenAI-compatible:

Provider Setup
OpenAI direct OPENAI_API_KEY=sk-...
OpenRouter OPENROUTER_API_KEY=sk-or-... (default base_url auto-detected)
Together AI --base-url https://api.together.xyz/v1 + Together API key in OPENAI_API_KEY
vLLM (local) --base-url http://localhost:8000/v1
Modal vLLM --base-url https://your-app.modal.run/v1 + your Modal token

CLI flags

chain-determinism-harness eval --model X [options]

  --model MODEL              Model identifier (required)
  --n-queries N              Held-out queries to use (max 10; default 5)
  --n-replays K              Replays per query (default 10)
  --temperature T            Sampling temperature (default 0.0)
  --concurrency C            Max concurrent replays per query (default 5)
  --base-url URL             OpenAI-compatible API base URL (auto-detected if unset)
  --timeout SEC              Per-request timeout (default 60)
  --out PATH                 Write summary JSON
  --runs-out PATH            Write per-replay JSONL (downstream analysis)

Cost guidance (rough)

Configuration Total calls Approx. cost
--n-queries 5 --n-replays 10 50 $0.10–$2 (depends on vendor)
--n-queries 10 --n-replays 20 200 $0.50–$10
--n-queries 10 --n-replays 50 (extended per-cell N) 500 $1–$25

Most frontier vendors on OpenRouter or direct API run well under a dollar for the default 5×10 = 50 calls. Mistral Large 2411 on OpenRouter (the most expensive measurable vendor we observed) is roughly $2 for 50 calls.

License

MIT (this harness). The companion dataset is CC-BY-4.0; vendor responses are not redistributed (hash-only).

Caveats

  • Vendor model versions are mutable. Today's gpt-5.4 is not 2027's gpt-5.4. Pin model snapshots where the API supports it (OpenAI dated IDs; Anthropic anthropic-version headers).
  • OpenRouter routing variance. OpenRouter load-balances across backend providers. Different runs may route to different providers; this is a confound for chain-divergence measurement.
  • Gameability. Chain-divergence rewards superficial template-locking (whitespace, key ordering — already enforced by sort_keys=True). It is not adversarially robust as a certification metric in its present form.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chain_determinism_harness-0.1.5.tar.gz (37.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chain_determinism_harness-0.1.5-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file chain_determinism_harness-0.1.5.tar.gz.

File metadata

File hashes

Hashes for chain_determinism_harness-0.1.5.tar.gz
Algorithm Hash digest
SHA256 1c296eceed60a84003dc10ca98f4d5847aacc829a73a524cbe8a618673ad7648
MD5 61c9372a273951ddab3c81ab318a1228
BLAKE2b-256 882913a7f31d76de6f3259d0a8b6006abae86e936f5ba4e68b019ee90c1ecdef

See more details on using hashes here.

File details

Details for the file chain_determinism_harness-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for chain_determinism_harness-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6688f8b688b0abebff128f24d089c32456af8d8763fd720e80ccc1dcbe251a85
MD5 f375f95972919c4160a80decb0656686
BLAKE2b-256 b941f27f2e761027c6414dc7b877dc2c7b4f05ed2743ccc5a1f61d2ca66269ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page