invarlock

Paired model release-regression evaluation with independently verifiable evidence

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

invarlock.dev

These details have not been verified by PyPI

Project links

Homepage

Project description

InvarLock

Run paired release-regression checks. Verify the evidence independently. Render one clear report.

InvarLock is an open-source assurance engine for one paired baseline-versus-subject release-regression decision. A closed request pins the two model artifacts, a local JSONL evaluation source, runtime settings, one built-in metric or scorer binding, and policy. evaluate runs both sides on the same deterministic schedule, publishes a signed evidence bundle, and records whether the selected paired interval satisfies the policy. A separate verifier replays the bundle against independently supplied trust anchors.

invarlock evaluate request.yaml
invarlock verify evidence/
invarlock report evidence/

A pinned paired evaluation request runs baseline and subject providers, publishes signed evidence, is independently verified, and is rendered as a report

The release-regression decision

Both sides score the same authenticated records in the same order. InvarLock derives one of two built-in paired comparisons:

Metric	Point comparison	Policy verdict
`exact_match`	Subject accuracy minus baseline accuracy, with paired regression/improvement counts and an exact McNemar test	Lower bound of the paired Newcombe 95% interval is at least `delta_min_pp`
`normalized_nll_per_utf8_byte`	Ratio of arithmetic means of per-record byte-normalized expected-continuation NLL	Upper bound of the paired schedule-resampling interval is at most `ratio_max`

For exact match, InvarLock reports baseline-pass to subject-fail regressions, baseline-fail to subject-pass improvements, the exact two-sided McNemar probability, and a continuity-corrected paired Newcombe 95% effect-size interval. For normalized NLL, it uses 2,048 paired percentile-bootstrap replicates whose index draws are derived from the authenticated schedule digest. In both cases the policy reads the conservative interval bound, not the point value alone.

A policy may also require a minimum paired-record count and a maximum interval width. Those controls are supplied together. When present, the report passes only when the metric bound, record-count minimum, and precision ceiling all pass. Preflight can prove the schedule count before execution; it reports the interval-width check as pending until paired results exist.

Normalized NLL is teacher-forced expected-continuation likelihood regression. It does not measure general model quality. When both artifacts bind the same authenticated tokenizer and every pair has the same positive target-token count, the verifier also renders a token-weighted perplexity ratio as a derived likelihood interpretation. That derived value has no threshold, interval, or verdict authority.

For a task-specific deterministic text scorer, comparison can select one fully bound scorer_extension instead of a built-in metric. The runtime still collects authenticated expected output, output text, and output digest facts. An explicitly authorized scorer replays those facts into one [0,1] higher-is-better value per record; core owns the arithmetic means, paired percentage-point delta, deterministic interval, and policy decision. Separately installed scorer packages can implement deterministic F1, structured extraction, or VQA answer normalization and require explicit authorization through this extension contract. The public CLI loads the exact installed scorer bound by the request only when --allow-installed-scorers is supplied to both evaluate and verify.

Executable SQL or code scoring, model-based semantic similarity, network or human scoring, external models, and LLM judges are outside this acceptance contract. Judge results can be attached as authenticated observations until a separate deterministic replay contract and calibration justify more.

Run a comparison

Install the built-in Hugging Face provider and prepare:

local baseline and subject snapshots;
a digest-pinned JSONL file with prompt, expected-output, and optional stable ID fields;
a one-metric policy file;
a digest-addressed InvarLock runtime image available to Docker or Podman; and
an Ed25519 evidence-signing key available only to the host transaction.

python -m pip install "invarlock[hf]"

Run requests bind a dataset object. evaluate authenticates the JSONL bytes and deterministically prepares the canonical paired schedule inside the transaction:

format_version: invarlock/evaluation-request-v1
comparison:
  baseline:
    artifact:
      path: artifacts/baseline
      model_id: acme/baseline
      locator: hf://acme/baseline@0123456789abcdef0123456789abcdef01234567
    runtime:
      provider: hf_transformers
      settings:
        batch_size: 1
        checkpoint_tree_sha256: "1111111111111111111111111111111111111111111111111111111111111111"
        context_length: 2048
        immutable_revision: 0123456789abcdef0123456789abcdef01234567
        max_output_tokens: 64
        offline: true
        seed: 7
        timeout_seconds: 300
        tokenizer_metadata_sha256: "3333333333333333333333333333333333333333333333333333333333333333"
  subject:
    artifact:
      path: artifacts/subject
      model_id: acme/subject
      locator: hf://acme/subject@fedcba9876543210fedcba9876543210fedcba98
    runtime:
      provider: hf_transformers
      settings:
        batch_size: 1
        checkpoint_tree_sha256: "2222222222222222222222222222222222222222222222222222222222222222"
        context_length: 2048
        immutable_revision: fedcba9876543210fedcba9876543210fedcba98
        max_output_tokens: 64
        offline: true
        seed: 7
        timeout_seconds: 300
        tokenizer_metadata_sha256: "3333333333333333333333333333333333333333333333333333333333333333"
  dataset:
    path: inputs/release-regression.jsonl
    sha256: "4444444444444444444444444444444444444444444444444444444444444444"
    format: jsonl
    name: release-regression
    split: validation
    input_field: prompt
    expected_output_field: expected
    id_field: case_id
  policy: policy/acceptance.json
  task: text_causal
  metric: normalized_nll_per_utf8_byte
execution:
  mode: run
output:
  evidence: artifacts/evidence-001

evaluate always performs the complete execution-free validation before it starts model runtimes. Use --preflight to stop after that validation and inspect its machine-readable result without creating output:

invarlock evaluate request.yaml --signing-key evidence-signer.pem \
  --runtime-image registry.example/invarlock-runtime@sha256:... \
  --preflight --json

Preflight emits invarlock/evaluation-preflight-v2 and checks configuration and local availability. When the policy includes sample qualification, it also reports the observed record count and leaves interval width explicitly pending_execution. Continuing with the real evaluation is still required to establish runtime execution, interval precision, and the policy result.

Replace the illustrative digests with values derived from the exact inputs. Then invoke the host CLI. In run mode, the host prepares the authenticated schedule and launches a separately pinned worker for each side; Docker is the default engine and Podman is supported.

Build the authenticated Git source bundle as shown in the runtime-provider guide, then build and smoke-test the image that matches the intended device:

mkdir -p artifacts

# CPU, including Apple Silicon through the matching multi-architecture lock
make runtime-image \
  RUNTIME_SOURCE_COMMIT="$SOURCE_COMMIT" \
  RUNTIME_SOURCE_BUNDLE="$SOURCE_BUNDLE" \
  RUNTIME_SOURCE_BUNDLE_SHA256="$SOURCE_BUNDLE_SHA256" \
  RUNTIME_BUILD_STATEMENT="$PWD/artifacts/runtime-build-cpu.json"
make runtime-smoke

# x86_64 NVIDIA CUDA 12.6
make runtime-image-cuda \
  RUNTIME_SOURCE_COMMIT="$SOURCE_COMMIT" \
  RUNTIME_SOURCE_BUNDLE="$SOURCE_BUNDLE" \
  RUNTIME_SOURCE_BUNDLE_SHA256="$SOURCE_BUNDLE_SHA256" \
  RUNTIME_BUILD_STATEMENT="$PWD/artifacts/runtime-build-cuda.json"
make runtime-smoke-cuda

invarlock evaluate request.yaml \
  --signing-key evidence-signer.pem \
  --baseline-runtime-image registry.example/invarlock-runtime@sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \
  --baseline-runtime-image-digest sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \
  --subject-runtime-image registry.example/invarlock-runtime@sha256:dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd \
  --subject-runtime-image-digest sha256:dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd \
  --container-engine docker \
  --baseline-runtime-device cuda:0 \
  --subject-runtime-device cuda:1

Shared --runtime-image, --runtime-image-digest, --runtime-device, and --runtime-entrypoint options remain convenient defaults when both sides use the same settings. --runtime-device cuda exposes a GPU only to an image that already contains the CUDA runtime; it does not turn the CPU image into a CUDA image.

Each worker receives its own artifact and support resources read-only plus an isolated writable output directory. The host validates both outputs, publishes the no-clobber evidence bundle, and signs it without exposing the private key to either worker. Workers sharing a generic or identical CUDA device run sequentially; explicitly different CUDA indexes can run in parallel.

Verify and report

Verification supplies the expected artifact identities, canonical schedule, policy, runtime identities, and evidence signer independently of the bundle. Keep those inputs in one closed verifier-owned profile:

invarlock verify artifacts/evidence-001/ \
  --trust-profile trust/trust-inputs.json \
  --receipt verification.receipt.json

invarlock report artifacts/evidence-001/ --html evidence.html --explain

The evidence signer authenticates the comparison bytes. The verifier decides whether those bytes satisfy the independently maintained anchors and signs a separate receipt that binds the profile digest. report renders the signature-authenticated comparison; independent verification remains the acceptance record. The CLI reference defines the closed profile and the equivalent explicit options.

Import existing measurements

The repository's examples/integrations/ directory contains maintained Hugging Face, PEFT, TorchAO, GGUF/llama.cpp, TensorRT-LLM, Hugging Face vision-text, and LM Evaluation Harness journeys. They create or obtain real artifacts and complete the source-bound evaluate, verify, and report transaction. The TensorRT-LLM journey builds BF16 and calibrated FP8 Qwen3-0.6B engines concurrently on the target H100 GPUs before authenticating their resulting identities. The vision-text journey compares two pinned Qwen2-VL checkpoints on authenticated image content. The repository also includes an offline evidence-handoff journey for complete per-record results produced elsewhere. Import mode requires the canonical schedule, typed observations, runtime bindings, and paired records; InvarLock re-derives the comparison before publication.

The model-change workflow guide maps fine-tuning, pruning, quantization, GGUF, TensorRT-LLM, multimodal, harness, and endpoint outputs to the appropriate built-in, optional-runtime, or import boundary.

Providers and diagnostics

Hugging Face Transformers is the built-in reference provider and supports both built-in metrics. First-party optional GGUF/llama.cpp, TensorRT-LLM, and Hugging Face vision-text packages are independently installable runtime integrations with their own dependency sets. The vision-text add-in supports exact-match comparisons over authenticated prompt and image parts. See runtime providers.

Spectral, random-matrix, and variance summaries live in the optional invarlock-diagnostics package. They are observation-only diagnostics; the selected paired comparison and policy exclusively determine acceptance. Their canonical JSON can be attached to the signed bundle and appears in a separate report section without changing the verdict. See diagnostics.

Documentation

Run and review: getting started, evaluation requests, schedule and policy, and evidence and verification.
Understand the claim: assurance case, decision semantics, and trust model.
Integrate: CLI, contracts, runtime providers, and Python API.

InvarLock is pre-1.0. Canonical artifact formats carry explicit format versions; the Python embedding facade may evolve between minor releases.

Questions and design discussions belong in GitHub Discussions. Report bugs through GitHub Issues and security concerns through SECURITY.md.

Apache-2.0 — see the license.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

invarlock.dev

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.13.0

Jul 22, 2026

0.12.1

Jul 5, 2026

0.12.0

Jun 30, 2026

0.11.0

Jun 16, 2026

0.10.0

Jun 3, 2026

0.9.0

May 25, 2026

0.8.0

Apr 24, 2026

0.7.2

Apr 15, 2026

0.7.1

Apr 13, 2026

0.7.0

Apr 9, 2026

0.6.0

Apr 4, 2026

0.5.1

Apr 2, 2026

0.5.0

Mar 25, 2026

0.4.0

Mar 14, 2026

0.3.12

Feb 27, 2026

0.3.11

Feb 13, 2026

0.3.10

Feb 8, 2026

0.3.9

Feb 3, 2026

0.3.8

Feb 2, 2026

0.3.7

Jan 22, 2026

0.3.6

Jan 13, 2026

0.3.5

Jan 3, 2026

0.3.4

Dec 28, 2025

0.3.3

Dec 22, 2025

0.3.2

Dec 14, 2025

0.3.1

Dec 10, 2025

0.3.0

Dec 5, 2025

0.2.0

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invarlock-0.13.0.tar.gz (216.9 kB view details)

Uploaded Jul 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

invarlock-0.13.0-py3-none-any.whl (246.7 kB view details)

Uploaded Jul 22, 2026 Python 3

File details

Details for the file invarlock-0.13.0.tar.gz.

File metadata

Download URL: invarlock-0.13.0.tar.gz
Upload date: Jul 22, 2026
Size: 216.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.14

File hashes

Hashes for invarlock-0.13.0.tar.gz
Algorithm	Hash digest
SHA256	`f84fe7c1b9a8ecc232ba65d0642135666be9f9fcdfe88f976d5b21ea06f333de`
MD5	`873becf54fb10d396d0334e6aa781146`
BLAKE2b-256	`451d6942dfaeb3874175b9fe9e3226f40b040de9bb1edbddd7d8b154dcc8b137`

See more details on using hashes here.

Provenance

The following attestation bundles were made for invarlock-0.13.0.tar.gz:

Publisher: release.yml on invarlock/invarlock

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: invarlock-0.13.0.tar.gz
- Subject digest: f84fe7c1b9a8ecc232ba65d0642135666be9f9fcdfe88f976d5b21ea06f333de
- Sigstore transparency entry: 2215839311
- Sigstore integration time: Jul 22, 2026
Source repository:
- Permalink: invarlock/invarlock@2785f3be765e025651b8f1e92bd21a2e2cb14cca
- Branch / Tag: refs/tags/v0.13.0
- Owner: https://github.com/invarlock
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2785f3be765e025651b8f1e92bd21a2e2cb14cca
- Trigger Event: workflow_dispatch

File details

Details for the file invarlock-0.13.0-py3-none-any.whl.

File metadata

Download URL: invarlock-0.13.0-py3-none-any.whl
Upload date: Jul 22, 2026
Size: 246.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.14

File hashes

Hashes for invarlock-0.13.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb15cfad270e771a847a94bfeb6bf6b3f324d66514e55fbd5114d0e70e389749`
MD5	`22b581cd85e0f585eb058934aded49d8`
BLAKE2b-256	`1ec5b881ffe5736839600d2af223abe04eb73ac3e08299dad77ae715cabbf76d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for invarlock-0.13.0-py3-none-any.whl:

Publisher: release.yml on invarlock/invarlock

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: invarlock-0.13.0-py3-none-any.whl
- Subject digest: eb15cfad270e771a847a94bfeb6bf6b3f324d66514e55fbd5114d0e70e389749
- Sigstore transparency entry: 2215839337
- Sigstore integration time: Jul 22, 2026
Source repository:
- Permalink: invarlock/invarlock@2785f3be765e025651b8f1e92bd21a2e2cb14cca
- Branch / Tag: refs/tags/v0.13.0
- Owner: https://github.com/invarlock
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2785f3be765e025651b8f1e92bd21a2e2cb14cca
- Trigger Event: workflow_dispatch

invarlock 0.13.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

The release-regression decision

Run a comparison

Verify and report

Import existing measurements

Providers and diagnostics

Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance