Skip to main content

Measure chain-determinism (per-trace structural consistency across replays) for any OpenAI-compatible LLM endpoint.

Project description

chain-determinism-harness

Measure chain-determinism for any OpenAI-compatible LLM endpoint in under 5 minutes for under $2.

Companion harness for the chain-determinism-bench-v1 dataset.

Chain-determinism = the property that N replays of the same query against the same vendor at T=0 produce identical tool-call sequences. Across 9 frontier vendors at T=0 we observe chain-divergence rates from 17.5% (Anthropic Claude Sonnet 4.5) to 97.1% (Mistral Large 2411). This harness lets you measure your own.

Install

pip install chain-determinism-harness

Quickstart (5 minutes, ~$0.50–$2)

export OPENROUTER_API_KEY=sk-or-...        # or OPENAI_API_KEY=sk-...

python -m chain_determinism_harness eval \
    --model qwen/qwen-2.5-72b-instruct \
    --n-queries 5 --n-replays 10

Sample output:

==> chain-determinism-harness eval
    model:       qwen/qwen-2.5-72b-instruct
    queries:     5  (out of 10 embedded)
    replays:     10 per query
    total calls: 50
    ...

  [1/5] heldout::heldout_xenobiology_taxonomy_00::author_productivity::endolithic_autotroph_order
      ok=10/10  unique_sequences=4

  ...

Chain-divergence rate for qwen/qwen-2.5-72b-instruct:
  60.0%  [Wilson 95% CI 23.1%, 88.2%]

  Diverged queries:    3 / 5
  Total replays:       50
  Error replays:       0

What it measures

For each (query, vendor) cell, the harness runs N replays of the same prompt. The agent makes tool calls (against a stub Σ-registry of 21 primitives — same registry the paper measures); the harness records the tool-call sequence for each replay; chain-divergence is True for that query if the sequences are not byte-identical across all N replays. Wilson 95% CIs at the per-query level.

Same operationalization as the paper: hashes are byte-identical to those in the chain-determinism-bench-v1 dataset (see chain_determinism_harness/metrics.py::seq_full).

Stub-environment scope

This harness uses stub tool responses — every non-final_answer tool call returns a canned response (matching the paper's §3.5 SWE-bench Lite scope). This isolates exploration-strategy non-determinism from data-path non-determinism. The full Phase 1a methodology in the paper (where tool calls are executed against a real Σ-registry KG) is implemented at scripts/paper5_multivendor_replay.py in the companion repository; this harness is the lighter-weight self-contained alternative for spot-checking new vendors.

The two scopes are not equivalent. Stub-environment chain-divergence will typically be higher than full-execution chain-divergence (the agent gets less semantically informative feedback; exploration strategies vary more across replays). Treat the harness as a necessary condition test: if a vendor passes under stub conditions, it likely passes under full execution; if a vendor fails under stub conditions, full execution may yet improve it.

Endpoints supported

Anything OpenAI-compatible:

Provider Setup
OpenAI direct OPENAI_API_KEY=sk-...
OpenRouter OPENROUTER_API_KEY=sk-or-... (default base_url auto-detected)
Together AI --base-url https://api.together.xyz/v1 + Together API key in OPENAI_API_KEY
vLLM (local) --base-url http://localhost:8000/v1
Modal vLLM --base-url https://your-app.modal.run/v1 + your Modal token

CLI flags

chain-determinism-harness eval --model X [options]

  --model MODEL              Model identifier (required)
  --n-queries N              Held-out queries to use (max 10; default 5)
  --n-replays K              Replays per query (default 10)
  --temperature T            Sampling temperature (default 0.0)
  --concurrency C            Max concurrent replays per query (default 5)
  --base-url URL             OpenAI-compatible API base URL (auto-detected if unset)
  --timeout SEC              Per-request timeout (default 60)
  --out PATH                 Write summary JSON
  --runs-out PATH            Write per-replay JSONL (downstream analysis)

Cost guidance (rough)

Configuration Total calls Approx. cost
--n-queries 5 --n-replays 10 50 $0.10–$2 (depends on vendor)
--n-queries 10 --n-replays 20 200 $0.50–$10
--n-queries 10 --n-replays 50 (paper's Phase 1a per-cell N) 500 $1–$25

Most frontier vendors on OpenRouter or direct API run well under a dollar for the default 5×10 = 50 calls. Mistral Large 2411 on OpenRouter (the paper's most expensive measurable vendor) is roughly $2 for 50 calls.

License

MIT (this harness). The companion dataset is CC-BY-4.0; vendor responses are not redistributed (hash-only).

Caveats

  • Vendor model versions are mutable. Today's gpt-5.4 is not 2027's gpt-5.4. Pin model snapshots where the API supports it (OpenAI dated IDs; Anthropic anthropic-version headers).
  • OpenRouter routing variance. OpenRouter load-balances across backend providers. Different runs may route to different providers; this is a confound for chain-divergence measurement (see paper §3.4).
  • Gameability. Chain-divergence rewards superficial template-locking (whitespace, key ordering — already enforced by sort_keys=True). It is not adversarially robust as a certification metric in its present form.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chain_determinism_harness-0.1.1.tar.gz (36.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chain_determinism_harness-0.1.1-py3-none-any.whl (32.2 kB view details)

Uploaded Python 3

File details

Details for the file chain_determinism_harness-0.1.1.tar.gz.

File metadata

File hashes

Hashes for chain_determinism_harness-0.1.1.tar.gz
Algorithm Hash digest
SHA256 aedbf253a551373f72040ce347686774f03eb12de4362e7146baec1a394b02a9
MD5 69e81a390aea566b15f2dbebdcb8e980
BLAKE2b-256 bf0222c32c757b315332877028a122a03d6896600c006bcdbadac063c4ad77c0

See more details on using hashes here.

File details

Details for the file chain_determinism_harness-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for chain_determinism_harness-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc67ab1ada5cd3f16967d1ac01355482ae834a3db5648606944cbf0e060c8e2c
MD5 f3525ebf227accb6a3a7e3ce8a947045
BLAKE2b-256 770866a7358b0674c015eef0589e5f010b0c80eca64aee8a8dc443f41084f2ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page