Inferentialist evaluation of LLMs: derive implication frames from a model's endorsement verdicts and measure model–analyst agreement on labeled inference benchmarks. Evidence bearing on inferential-mastery attribution.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bradleypallen

These details have not been verified by PyPI

Project description

infereval

Inferentialist evaluation of LLMs: derive an implication frame from a language model's endorsement verdicts, then measure the model's agreement with an analyst-labeled benchmark via coverage and Cohen's / Fleiss' kappa. The agreement is evidence bearing on an inferential-mastery attribution — not a measurement of mastery itself (per the paper's Remark 8).

infereval is the executable companion to Note on Simonelli's Stop Sign Dialogue: An Implication-Space Instrument for Probing LLM Endorsement of Material Inferential Rules (Allen, 2026), which is maintained as a separate paper. The framework formalizes the procedure β → η → (cov, κ_C, κ_F, κ_F*) for any analyst-supplied benchmark.

Status

Beta (0.x, pre-1.0). The public Python API and CLI surface may shift between minor releases until 1.0. Methodology defaults are locked, and the JSON schemas are versioned independently (schema_version: "1.0") and promised stable from 1.0 onward regardless of the framework version. See the CHANGELOG for the current release.

Documentation

The docs site at https://www.bradleypallen.org/infereval/ covers a Concepts page (methodology mental model), Authoring benchmarks, Interpreting metrics (κ_C / κ_F / κ_F*, test-retest κ, decompositions, subsampling CIs, sensitivity sweeps), Providers (Anthropic seed handling, DeepSeek reasoning-token budgets, OpenRouter attribution), and Construct validity of the instrument — the R1–R22 requirements catalogue and end-to-end workflow in one place. Four executable tutorial notebooks (quickstart, authoring, paraphrase-axis triangulation, pulmonology visualization). Plus an auto-generated API reference, an Architecture dataflow diagram, a Glossary of paper symbols, and a JSON-schema reference.

Findings

A 13-model × 3-paraphrase-variant cross-family sweep against the stop-sign benchmark (2026-06-09) is committed at experiments/results/stop_sign_2026-06-09.md. Headline: under the paper-aligned δ(ra), 12 of 13 frontier LLMs across six families reproduce Simonelli's analyst row exactly (κ_C = +1.000) — a thirteen-model independent replication of the paper's empirical anchor under fresh v0.15.2 captures. The perceptual variant is the cleavage axis. A companion 6-model pulmonary edema sweep (n=30 items) at experiments/results/pulmonology_2026-06-09.md characterizes within-day and day-out R22 reliability; the deepseek-v4-pro cell exhibits monotone κ decay across three time-scales (0.867 → 0.792 → 0.729) — the clearest published example of detectable across-update model drift the framework has produced.

Install

pip install infereval

Provider SDKs are optional extras (the framework runs without them — use the mock or replay providers):

pip install 'infereval[anthropic]'   # Anthropic Claude
pip install 'infereval[openai]'      # OpenAI + OpenRouter (OpenAI-API-compatible)
pip install 'infereval[all]'

From source (editable):

git clone https://github.com/bradleypallen/infereval
cd infereval
python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'

60-second quickstart

Inspect the bundled stop-sign benchmark (Example 1 of the paper), then run an evaluation against the deterministic replay fixture — no API key needed:

# 1. Look at the benchmark.
infereval describe examples/stop_sign/benchmark.json

# 2. Validate it against the JSON schema.
infereval validate examples/stop_sign/benchmark.json

# 3. Run a deterministic evaluation against the committed replay fixture.
infereval evaluate examples/stop_sign/benchmark.json \
    --replay-from tests/fixtures/stop_sign_replay.jsonl \
    --output /tmp/eta.json \
    --n-samples 5 \
    --log /tmp/run.jsonl

# 4. Compute metrics.
infereval metrics /tmp/eta.json --benchmark examples/stop_sign/benchmark.json

To run against a real model, replace step 3 with:

export ANTHROPIC_API_KEY=...
infereval evaluate examples/stop_sign/benchmark.json \
    --provider anthropic --model claude-haiku-4-5-20251001 \
    --output /tmp/eta.json --n-samples 5 --log /tmp/run.jsonl

The JSONL run log under /tmp/run.jsonl records one event per provider call (prompt hash, raw response, parsed verdict, usage, timing) so the evaluation is auditable end to end.

For the R22 test-retest discipline, infereval retest --auto --benchmark <b> --provider X --model Y [--interval-s N] [--save-etas DIR] (v0.11.0+) collapses the four-step manual workflow (evaluate, evaluate again, retest, optionally --claims) into one CLI call — runs the same evaluation twice with an optional inter-capture sleep and emits the standard RetestResult directly. v0.12.0+ accepts --interval-s multiple times for cumulative drift-since-baseline analysis in one orchestrated call. v0.13.0+ surfaces test-retest κ as a co-equal ### Reliability (R22) subhead in infereval report §2 alongside ### Agreement, and auto-detects single vs multi-interval retest shape (multi-interval renders a per-interval table with a worst-case overall verdict). v0.14.0+ adds the staged-composition pattern: --baseline-from <eta-path> runs one fresh capture against a saved baseline and --append-to <multi.json> appends a new pair to an existing multi-result — both compute interval_s from actual elapsed wall clock, so day-out / week-out R22 evidence can ship as incremental commits without the CLI process needing to stay alive for the elapsed window. See the stop-sign R22 capture for the worked end-to-end demo.

What this is and isn't

This is: a research tool that formalizes Simonelli's stop-sign dialogue into a repeatable evaluation procedure. Given (i) a bearer set, (ii) expression and context-construction functions, (iii) a benchmark of implications labeled by one or more analysts, the framework drives an LLM through endorsement-probing for each implication and reports the resulting agreement with analyst practice along three axes:

Coverage — how often the model takes a substantive position (cov(η)).
Cohen's kappa — agreement against a chosen reference (analyst consensus c_i or a single analyst v_{:,j}).
Fleiss' kappa — agreement with the model treated as the (m+1)th annotator, alongside the inter-analyst baseline κ_F*(β) (Remark 4 of the paper).

Each metric can be decomposed by tag or by RSR target.

This is not: a factuality benchmark, a leaderboard, or an answer to whether LLMs are sapient. The methodology is carving-relative: results depend on the analyst-supplied bearer carving, context construction, and benchmark. The framework provides the machinery; the analyst supplies the practice the machinery is comparing against. See the Discussion in the paper for what carving-relativity buys and costs.

API surface

from infereval import (
    Verdict, Bearer, Implication,           # core data types
    DerivedFrame,                            # ⟨B, I_M⟩ per Definition 3
)
from infereval.benchmark import Benchmark
from infereval.evaluation import Evaluation, evaluate, EndorsementConfig, ProviderParams
from infereval.providers import get_provider
from infereval.metrics import MetricsReport

bench = Benchmark.load("examples/stop_sign/benchmark.json")
provider = get_provider("anthropic", "claude-haiku-4-5-20251001")
eta = evaluate(bench, provider,
               config=EndorsementConfig(n_samples=5),
               params=ProviderParams(temperature=1.0),
               log_path="/tmp/run.jsonl")
report = MetricsReport(eta=eta, benchmark=bench)
print(report.to_dict())

Locked methodology defaults

These are framework defaults, overridable per evaluation:

Setting	Default
`n_samples`	5 (odd, clean 3-way majority)
Tie-break	`abstain` (configurable: `good`, `bad`, `first`)
Verification prompt	`default-v1` (GOOD/BAD/ABSTAIN tokens with brief glosses)
TeX in expressions	Stripped at prompt time; LaTeX-source-friendly in benchmark JSON
Cohen's kappa reference	Analyst consensus `c_i` (override with `--reference analyst:<id>`)
Provider seed	Honored by OpenAI; ignored (with one-time warning) by Anthropic

See CLAUDE.md and the paper for the full list and the rationale behind each choice.

Development

pip install -e '.[dev]'
pytest                                # all unit + replay tests
pytest -m live                        # opt-in live provider tests (requires API keys)
mypy src/infereval
ruff check src tests

Live provider tests require RUN_LIVE_PROVIDER_TESTS=1 and the relevant API key in the environment. They are skipped by default.

Citation

@unpublished{allen2026inferential,
  author = {Allen, Bradley P.},
  title  = {Note on {S}imonelli's Stop Sign Dialogue: An Implication-Space Instrument for Probing {LLM} Endorsement of Material Inferential Rules},
  year   = {2026},
  note   = {University of Amsterdam}
}

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bradleypallen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.17.7

Jul 7, 2026

0.17.6

Jul 4, 2026

0.17.5

Jul 3, 2026

0.17.4

Jul 3, 2026

0.17.3

Jul 3, 2026

0.17.2

Jul 3, 2026

0.17.1

Jul 3, 2026

0.17.0

Jul 3, 2026

0.16.0

Jun 9, 2026

0.15.0

Jun 8, 2026

0.14.0 yanked

Jun 7, 2026

Reason this release was yanked:

Ships three framework bugs (silent empty-response → ABSTAIN, cross-thread logger contamination, no rate-limit retry on burst-parallel OpenRouter calls). Fixed in v0.15.0+. Bundled experimental captures from this release were artifacts and have been retracted; see KNOWN_ISSUES_v0.14.0.md at the repo root. Upgrade to v0.16.0+ for clean re-captures under the fixed framework.

0.13.0

Jun 6, 2026

0.12.0

Jun 6, 2026

0.11.0

Jun 6, 2026

0.10.0

Jun 6, 2026

0.9.2

Jun 4, 2026

0.9.1

Jun 4, 2026

0.9.0

Jun 4, 2026

0.8.0

Jun 3, 2026

0.7.0

May 29, 2026

0.6.3

May 28, 2026

0.6.2

May 28, 2026

0.6.1

May 28, 2026

0.6.0

May 28, 2026

0.5.10

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infereval-0.17.7.tar.gz (443.2 kB view details)

Uploaded Jul 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

infereval-0.17.7-py3-none-any.whl (237.4 kB view details)

Uploaded Jul 7, 2026 Python 3

File details

Details for the file infereval-0.17.7.tar.gz.

File metadata

Download URL: infereval-0.17.7.tar.gz
Upload date: Jul 7, 2026
Size: 443.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infereval-0.17.7.tar.gz
Algorithm	Hash digest
SHA256	`1fc124d3433e5fb623568e06cf817ddfba4dd866f0b2f70ba2e2191915eb621a`
MD5	`9ed3a62abde49862d8c505db7b4e5260`
BLAKE2b-256	`012e61e831e22b5a53c6620556b995cdbd814d206e49ab07ad0294ac2bea7628`

See more details on using hashes here.

Provenance

The following attestation bundles were made for infereval-0.17.7.tar.gz:

Publisher: publish.yml on bradleypallen/infereval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: infereval-0.17.7.tar.gz
- Subject digest: 1fc124d3433e5fb623568e06cf817ddfba4dd866f0b2f70ba2e2191915eb621a
- Sigstore transparency entry: 2102815436
- Sigstore integration time: Jul 7, 2026
Source repository:
- Permalink: bradleypallen/infereval@2089901dc292b42bf559d3ec31453647f1a6df0a
- Branch / Tag: refs/tags/v0.17.7
- Owner: https://github.com/bradleypallen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2089901dc292b42bf559d3ec31453647f1a6df0a
- Trigger Event: release

File details

Details for the file infereval-0.17.7-py3-none-any.whl.

File metadata

Download URL: infereval-0.17.7-py3-none-any.whl
Upload date: Jul 7, 2026
Size: 237.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infereval-0.17.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12a6d71fb699f786faae5c184cde0c24d7440812c32be17ea602d50621bca0e6`
MD5	`140d5acee240df8150e60877bd7a5f48`
BLAKE2b-256	`bf5ef448503171f7ad6841e330b4612b890cbd199760579ff5ddafbc31ed1e9d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for infereval-0.17.7-py3-none-any.whl:

Publisher: publish.yml on bradleypallen/infereval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: infereval-0.17.7-py3-none-any.whl
- Subject digest: 12a6d71fb699f786faae5c184cde0c24d7440812c32be17ea602d50621bca0e6
- Sigstore transparency entry: 2102815518
- Sigstore integration time: Jul 7, 2026
Source repository:
- Permalink: bradleypallen/infereval@2089901dc292b42bf559d3ec31453647f1a6df0a
- Branch / Tag: refs/tags/v0.17.7
- Owner: https://github.com/bradleypallen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2089901dc292b42bf559d3ec31453647f1a6df0a
- Trigger Event: release

infereval 0.17.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

infereval

Status

Documentation

Findings

Install

60-second quickstart

What this is and isn't

API surface

Locked methodology defaults

Development

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance