Correctness and reliability testing for LLM inference engines
Project description
infer-check
Correctness and reliability testing for LLM inference engines.
infer-check is a CLI tool that tests whether LLM inference backends produce correct, stable, and deterministic output. It catches the bugs that benchmarks miss — quantization-induced failures, cross-backend divergence, KV cache corruption under load, and non-determinism at temperature=0.
Key findings
Tested across Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx.
4-bit quantization degrades task-dependently. Numerical tasks break worst:
Llama-3.1-8B: bf16 vs 4bit
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ prompt suite ┃ identical ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ adversarial-numerics │ 0/30 │ 23/30 │ 0.311 │
│ reasoning │ 1/50 │ 35/50 │ 0.384 │
│ code │ 0/49 │ 30/49 │ 0.452 │
└───────────────────────┴───────────┴──────────┴─────────────────┘
Dense and MoE architectures degrade similarly at 4-bit. Qwen3.5-4B (Gated Delta Networks + sparse MoE) shows 35/50 severe on reasoning — the same rate as dense Llama-3.1-8B.
vllm-mlx's serving layer is perfectly faithful. mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). The serving layer introduces zero divergence.
Both engines are deterministic at temperature=0. Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 perfect determinism across 20 runs per prompt.
vllm-mlx handles concurrent load without corruption. Stress test at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels.
Installation
pip install infer-check
# With MLX backend support (Apple Silicon)
pip install "infer-check[mlx]"
Usage
Quantization sweep
Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo.
infer-check sweep \
--models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
--backend mlx-lm \
--prompts reasoning \
--output ./results/sweep/
--prompts accepts either a bundled suite name (reasoning, code, adversarial-numerics, determinism, long-context) or a path to any .jsonl file.
The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.
Sweep Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ quant_level ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ bf16 (self-check) │ 50/50 │ 0/50 │ 0/50 │ 0/50 │ 1.0000 │
│ 8bit │ 20/50 │ 9/50 │ 12/50 │ 9/50 │ 0.8067 │
│ 4bit │ 1/50 │ 3/50 │ 11/50 │ 35/50 │ 0.3837 │
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
Cross-backend diff
Same model, same quant, different inference paths. Catches serving-layer bugs.
# Start vllm-mlx in another terminal:
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
infer-check diff \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backends "mlx-lm,openai-compat" \
--base-urls ",http://localhost:8000" \
--prompts reasoning \
--output ./results/diff/
Uses /v1/chat/completions by default (--chat) so server-side chat templates match the local backend. Pass --no-chat for raw /v1/completions.
Determinism
Same prompt N times at temperature=0. Output should be bit-identical every run.
infer-check determinism \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend mlx-lm \
--prompts determinism \
--runs 20 \
--output ./results/determinism/
Stress test
Concurrent requests through a serving backend. Tests KV cache correctness under load.
infer-check stress \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend openai-compat \
--base-url http://localhost:8000 \
--prompts reasoning \
--concurrency 1,2,4,8 \
--output ./results/stress/
Report
Generate an HTML report from all saved results.
infer-check report ./results/ --format html
Prompt suites
Curated prompts targeting known quantization failure modes:
| Suite | Count | Purpose |
|---|---|---|
reasoning.jsonl |
50 | Multi-step math and logic |
code.jsonl |
49 | Python, JSON, SQL generation |
adversarial-numerics.jsonl |
30 | IEEE 754 edge cases, overflow, precision |
long-context.jsonl |
10 | Tables and transcripts with recall questions |
determinism.jsonl |
50 | High-entropy continuations for determinism testing |
All suites ship with the package — no need to clone the repo. Custom suites are JSONL files: {"id": "...", "text": "...", "category": "...", "max_tokens": N} per line.
Supported backends
| Backend | Type | Use case |
|---|---|---|
| mlx-lm | In-process | Local Apple Silicon inference with logprobs |
| llama.cpp | HTTP | llama-server via /completion endpoint |
| vllm-mlx | HTTP | Continuous batching on Apple Silicon |
| openai-compat | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
Why this exists
Every LLM inference engine has correctness bugs that benchmarks don't catch:
- KV cache NaN pollution in vLLM-Ascend permanently corrupts all subsequent requests
- FP8 KV quantization in vLLM causes repeated garbage output
- 32.5% element mismatches in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
- Batch-size-dependent output where tokens change depending on concurrent request count
These aren't model quality problems — they're engine correctness failures. Benchmarks like lm-evaluation-harness test whether models are smart. infer-check tests whether engines are correct.
Requirements
- Python >= 3.11
- macOS with Apple Silicon (for mlx-lm backend) or Linux
- At least one backend installed
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file infer_check-0.1.0.tar.gz.
File metadata
- Download URL: infer_check-0.1.0.tar.gz
- Upload date:
- Size: 72.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bded9fa06ca1356c4b623dcc35aa33d67fc613803942027d1d02af250c3add06
|
|
| MD5 |
44dc26d6a0439899f121491bc21768d6
|
|
| BLAKE2b-256 |
7fdcfff287c912dda4a3e27ba9286adf8a0f5cc6456b5af460a527a63928aee7
|
Provenance
The following attestation bundles were made for infer_check-0.1.0.tar.gz:
Publisher:
release.yml on NullPointerDepressiveDisorder/infer-check
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
infer_check-0.1.0.tar.gz -
Subject digest:
bded9fa06ca1356c4b623dcc35aa33d67fc613803942027d1d02af250c3add06 - Sigstore transparency entry: 1088900861
- Sigstore integration time:
-
Permalink:
NullPointerDepressiveDisorder/infer-check@9eefe22ea51b74367052097aeb1860bb2a22aa8d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/NullPointerDepressiveDisorder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9eefe22ea51b74367052097aeb1860bb2a22aa8d -
Trigger Event:
push
-
Statement type:
File details
Details for the file infer_check-0.1.0-py3-none-any.whl.
File metadata
- Download URL: infer_check-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a51b6c59329c2adc626d9065855c07a5153777d605c4d174f81b8fc5d293ab9
|
|
| MD5 |
25b904da8ca10d6eb16bc46a23fdb76c
|
|
| BLAKE2b-256 |
d315ff3a4b2d88b6cccebd5d43715dbeb8bc17f7717533be4f05c36f323be121
|
Provenance
The following attestation bundles were made for infer_check-0.1.0-py3-none-any.whl:
Publisher:
release.yml on NullPointerDepressiveDisorder/infer-check
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
infer_check-0.1.0-py3-none-any.whl -
Subject digest:
0a51b6c59329c2adc626d9065855c07a5153777d605c4d174f81b8fc5d293ab9 - Sigstore transparency entry: 1088900874
- Sigstore integration time:
-
Permalink:
NullPointerDepressiveDisorder/infer-check@9eefe22ea51b74367052097aeb1860bb2a22aa8d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/NullPointerDepressiveDisorder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9eefe22ea51b74367052097aeb1860bb2a22aa8d -
Trigger Event:
push
-
Statement type: