Correctness and reliability testing for LLM inference engines

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

NullPointerDepressiveDisorder

These details have not been verified by PyPI

Project description

infer-check

Catches the correctness bugs that benchmarks miss in LLM inference engines.

Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — infer-check tests whether engines are correct.

Read the full documentation

The problem

Every LLM inference engine has correctness bugs that benchmarks don't catch:

KV cache NaN pollution in vLLM-Ascend permanently corrupts all subsequent requests
FP8 KV quantization in vLLM causes repeated garbage output
32.5% element mismatches in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
Batch-size-dependent output where tokens change depending on concurrent request count

These aren't model quality problems — they're engine correctness failures. infer-check is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.

Installation

pip install infer-check

# With MLX backend support (Apple Silicon)
pip install "infer-check[mlx]"

Quick start

Compare two quantizations head-to-head:

infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  mlx-community/Llama-3.1-8B-Instruct-8bit \
  --prompts adversarial-numerics

Run a full quantization sweep:

infer-check sweep \
  --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
            8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
            4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
  --prompts reasoning

Commands

Command	Purpose	Docs
`sweep`	Compare pre-quantized models against a baseline	docs
`compare`	Head-to-head comparison of two models or quantizations	docs
`diff`	Compare outputs across different backends for the same model	docs
`determinism`	Test output reproducibility at temperature=0	docs
`stress`	Test correctness under concurrent load	docs
`report`	Generate HTML/JSON reports from saved results	docs

Example results

Results from running infer-check on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.

Quantization sweep

                                 Sweep Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ quant_level         ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ bf16 (self-check)   │     50/50 │  0/50 │     0/50 │   0/50 │          1.0000 │
│ 8bit                │     20/50 │  9/50 │    12/50 │   9/50 │          0.8067 │
│ 4bit                │      1/50 │  3/50 │    11/50 │  35/50 │          0.3837 │
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘

A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.

Cross-backend diff

mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.

Determinism & stress

100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.

Supported backends

Backend	Type	Use case
mlx-lm	In-process	Local Apple Silicon inference with logprobs
llama-cpp	HTTP	`llama-server` via `/completion` endpoint
vllm-mlx	HTTP	Continuous batching on Apple Silicon
openai-compat	HTTP	Any OpenAI-compatible server (vLLM, SGLang, Ollama)

See the backends documentation for setup and configuration details.

Prompt suites

Six curated suites ship with the package — no need to clone the repo:

Suite	Count	Purpose
`reasoning`	50	Multi-step math and logic
`code`	49	Python, JSON, SQL generation
`adversarial-numerics`	30	IEEE 754 edge cases, overflow, precision
`long-context`	10	Tables and transcripts with recall questions
`quant-sensitive`	20	Multi-digit arithmetic, long CoT, precise syntax
`determinism`	50	High-entropy continuations for determinism testing

Custom suites are JSONL files with one object per line:

{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}

Roadmap

GGUF backend (direct llama.cpp integration without HTTP)
CUDA vLLM backend for GPU-based differential testing
Logprobs-based divergence scoring where backends support it
Automated regression CI mode (infer-check ci with pass/fail exit codes)
Expanded prompt suites for tool use and multi-turn conversations

Requirements

Python >= 3.11
macOS with Apple Silicon (for mlx-lm backend) or Linux
At least one backend installed

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

NullPointerDepressiveDisorder

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.1 yanked

Mar 31, 2026

Reason this release was yanked:

Accidental version jump, please use 0.2.2

This version

0.2.6

Apr 20, 2026

0.2.5

Apr 16, 2026

0.2.4

Apr 8, 2026

0.2.3

Apr 6, 2026

0.2.2

Apr 2, 2026

0.2.0

Mar 24, 2026

0.1.1.1

Mar 12, 2026

0.1.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infer_check-0.2.6.tar.gz (168.9 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

infer_check-0.2.6-py3-none-any.whl (152.9 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file infer_check-0.2.6.tar.gz.

File metadata

Download URL: infer_check-0.2.6.tar.gz
Upload date: Apr 20, 2026
Size: 168.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infer_check-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`03688d780257041b0e13f33accb1a9d5a85bd632caad35db8deb5eba4d598ff8`
MD5	`3d0b7a580b605995f5c7d7f6d12b9dce`
BLAKE2b-256	`27bd373c099d6f0f89bc4ff595d3ca3f521bd8e7757ce81968214aef572926cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for infer_check-0.2.6.tar.gz:

Publisher: release.yml on NullPointerDepressiveDisorder/infer-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: infer_check-0.2.6.tar.gz
- Subject digest: 03688d780257041b0e13f33accb1a9d5a85bd632caad35db8deb5eba4d598ff8
- Sigstore transparency entry: 1341492649
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: NullPointerDepressiveDisorder/infer-check@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb
- Branch / Tag: refs/tags/v0.2.6
- Owner: https://github.com/NullPointerDepressiveDisorder
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb
- Trigger Event: release

File details

Details for the file infer_check-0.2.6-py3-none-any.whl.

File metadata

Download URL: infer_check-0.2.6-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 152.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infer_check-0.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1774143ce6f1cec3b43c5688a5d3ae9431cf20815ac55ca3b153fe6493d8b8a6`
MD5	`73bd849f9f5cae6208262067a30800ec`
BLAKE2b-256	`cc4297cdea378b8f574e8ed548059fa089c925c3cdf8ceecab14e77f16ed4d1d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for infer_check-0.2.6-py3-none-any.whl:

Publisher: release.yml on NullPointerDepressiveDisorder/infer-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: infer_check-0.2.6-py3-none-any.whl
- Subject digest: 1774143ce6f1cec3b43c5688a5d3ae9431cf20815ac55ca3b153fe6493d8b8a6
- Sigstore transparency entry: 1341492747
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: NullPointerDepressiveDisorder/infer-check@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb
- Branch / Tag: refs/tags/v0.2.6
- Owner: https://github.com/NullPointerDepressiveDisorder
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb
- Trigger Event: release

infer-check 0.2.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

infer-check

The problem

Installation

Quick start

Commands

Example results

Quantization sweep

Cross-backend diff

Determinism & stress

Supported backends

Prompt suites

Roadmap

Requirements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance