Correctness and reliability testing for LLM inference engines
Project description
infer-check
Catches the correctness bugs that benchmarks miss in LLM inference engines.
Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — infer-check tests whether engines are correct.
The problem
Every LLM inference engine has correctness bugs that benchmarks don't catch:
- KV cache NaN pollution in vLLM-Ascend permanently corrupts all subsequent requests
- FP8 KV quantization in vLLM causes repeated garbage output
- 32.5% element mismatches in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
- Batch-size-dependent output where tokens change depending on concurrent request count
These aren't model quality problems — they're engine correctness failures. infer-check is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
Installation
pip install infer-check
# With MLX backend support (Apple Silicon)
pip install "infer-check[mlx]"
Quick start
Compare two quantizations head-to-head:
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
mlx-community/Llama-3.1-8B-Instruct-8bit \
--prompts adversarial-numerics
Run a full quantization sweep:
infer-check sweep \
--models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
--prompts reasoning
Commands
| Command | Purpose | Docs |
|---|---|---|
sweep |
Compare pre-quantized models against a baseline | docs |
compare |
Head-to-head comparison of two models or quantizations | docs |
diff |
Compare outputs across different backends for the same model | docs |
determinism |
Test output reproducibility at temperature=0 | docs |
stress |
Test correctness under concurrent load | docs |
report |
Generate HTML/JSON reports from saved results | docs |
Example results
Results from running infer-check on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.
Quantization sweep
Sweep Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ quant_level ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ bf16 (self-check) │ 50/50 │ 0/50 │ 0/50 │ 0/50 │ 1.0000 │
│ 8bit │ 20/50 │ 9/50 │ 12/50 │ 9/50 │ 0.8067 │
│ 4bit │ 1/50 │ 3/50 │ 11/50 │ 35/50 │ 0.3837 │
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.
Cross-backend diff
mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.
Determinism & stress
100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.
Supported backends
| Backend | Type | Use case |
|---|---|---|
| mlx-lm | In-process | Local Apple Silicon inference with logprobs |
| llama-cpp | HTTP | llama-server via /completion endpoint |
| vllm-mlx | HTTP | Continuous batching on Apple Silicon |
| openai-compat | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
See the backends documentation for setup and configuration details.
Prompt suites
Six curated suites ship with the package — no need to clone the repo:
| Suite | Count | Purpose |
|---|---|---|
reasoning |
50 | Multi-step math and logic |
code |
49 | Python, JSON, SQL generation |
adversarial-numerics |
30 | IEEE 754 edge cases, overflow, precision |
long-context |
10 | Tables and transcripts with recall questions |
quant-sensitive |
20 | Multi-digit arithmetic, long CoT, precise syntax |
determinism |
50 | High-entropy continuations for determinism testing |
Custom suites are JSONL files with one object per line:
{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
Roadmap
- GGUF backend (direct llama.cpp integration without HTTP)
- CUDA vLLM backend for GPU-based differential testing
- Logprobs-based divergence scoring where backends support it
- Automated regression CI mode (
infer-check ciwith pass/fail exit codes) - Expanded prompt suites for tool use and multi-turn conversations
Requirements
- Python >= 3.11
- macOS with Apple Silicon (for mlx-lm backend) or Linux
- At least one backend installed
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file infer_check-0.2.6.tar.gz.
File metadata
- Download URL: infer_check-0.2.6.tar.gz
- Upload date:
- Size: 168.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03688d780257041b0e13f33accb1a9d5a85bd632caad35db8deb5eba4d598ff8
|
|
| MD5 |
3d0b7a580b605995f5c7d7f6d12b9dce
|
|
| BLAKE2b-256 |
27bd373c099d6f0f89bc4ff595d3ca3f521bd8e7757ce81968214aef572926cf
|
Provenance
The following attestation bundles were made for infer_check-0.2.6.tar.gz:
Publisher:
release.yml on NullPointerDepressiveDisorder/infer-check
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
infer_check-0.2.6.tar.gz -
Subject digest:
03688d780257041b0e13f33accb1a9d5a85bd632caad35db8deb5eba4d598ff8 - Sigstore transparency entry: 1341492649
- Sigstore integration time:
-
Permalink:
NullPointerDepressiveDisorder/infer-check@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb -
Branch / Tag:
refs/tags/v0.2.6 - Owner: https://github.com/NullPointerDepressiveDisorder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb -
Trigger Event:
release
-
Statement type:
File details
Details for the file infer_check-0.2.6-py3-none-any.whl.
File metadata
- Download URL: infer_check-0.2.6-py3-none-any.whl
- Upload date:
- Size: 152.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1774143ce6f1cec3b43c5688a5d3ae9431cf20815ac55ca3b153fe6493d8b8a6
|
|
| MD5 |
73bd849f9f5cae6208262067a30800ec
|
|
| BLAKE2b-256 |
cc4297cdea378b8f574e8ed548059fa089c925c3cdf8ceecab14e77f16ed4d1d
|
Provenance
The following attestation bundles were made for infer_check-0.2.6-py3-none-any.whl:
Publisher:
release.yml on NullPointerDepressiveDisorder/infer-check
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
infer_check-0.2.6-py3-none-any.whl -
Subject digest:
1774143ce6f1cec3b43c5688a5d3ae9431cf20815ac55ca3b153fe6493d8b8a6 - Sigstore transparency entry: 1341492747
- Sigstore integration time:
-
Permalink:
NullPointerDepressiveDisorder/infer-check@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb -
Branch / Tag:
refs/tags/v0.2.6 - Owner: https://github.com/NullPointerDepressiveDisorder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bf8cd7d49e80a1e8155c80ff279f5740d36a9bdb -
Trigger Event:
release
-
Statement type: