Skip to main content

Correctness and reliability testing for LLM inference engines

Project description

infer-check

PyPI - Version Run tests and upload coverage codecov Docs

Catches the correctness bugs that benchmarks miss in LLM inference engines.

Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — infer-check tests whether engines are correct.

Read the full documentation

The problem

Every LLM inference engine has correctness bugs that benchmarks don't catch:

  • KV cache NaN pollution in vLLM-Ascend permanently corrupts all subsequent requests
  • FP8 KV quantization in vLLM causes repeated garbage output
  • 32.5% element mismatches in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
  • Batch-size-dependent output where tokens change depending on concurrent request count

These aren't model quality problems — they're engine correctness failures. infer-check is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.

Installation

pip install infer-check

# With MLX backend support (Apple Silicon)
pip install "infer-check[mlx]"

Quick start

Compare two quantizations head-to-head:

infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  mlx-community/Llama-3.1-8B-Instruct-8bit \
  --prompts adversarial-numerics

Run a full quantization sweep:

infer-check sweep \
  --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
            8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
            4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
  --prompts reasoning

Commands

Command Purpose Docs
sweep Compare pre-quantized models against a baseline docs
compare Head-to-head comparison of two models or quantizations docs
diff Compare outputs across different backends for the same model docs
determinism Test output reproducibility at temperature=0 docs
stress Test correctness under concurrent load docs
report Generate HTML/JSON reports from saved results docs

Example results

Results from running infer-check on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.

Quantization sweep

                                 Sweep Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ quant_level         ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ bf16 (self-check)   │     50/50 │  0/50 │     0/50 │   0/50 │          1.0000 │
│ 8bit                │     20/50 │  9/50 │    12/50 │   9/50 │          0.8067 │
│ 4bit                │      1/50 │  3/50 │    11/50 │  35/50 │          0.3837 │
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘

A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.

Cross-backend diff

mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.

Determinism & stress

100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.

Supported backends

Backend Type Use case
mlx-lm In-process Local Apple Silicon inference with logprobs
llama-cpp HTTP llama-server via /completion endpoint
vllm-mlx HTTP Continuous batching on Apple Silicon
openai-compat HTTP Any OpenAI-compatible server (vLLM, SGLang, Ollama)

See the backends documentation for setup and configuration details.

Prompt suites

Six curated suites ship with the package — no need to clone the repo:

Suite Count Purpose
reasoning 50 Multi-step math and logic
code 49 Python, JSON, SQL generation
adversarial-numerics 30 IEEE 754 edge cases, overflow, precision
long-context 10 Tables and transcripts with recall questions
quant-sensitive 20 Multi-digit arithmetic, long CoT, precise syntax
determinism 50 High-entropy continuations for determinism testing

Custom suites are JSONL files with one object per line:

{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}

Roadmap

  • GGUF backend (direct llama.cpp integration without HTTP)
  • CUDA vLLM backend for GPU-based differential testing
  • Logprobs-based divergence scoring where backends support it
  • Automated regression CI mode (infer-check ci with pass/fail exit codes)
  • Expanded prompt suites for tool use and multi-turn conversations

Requirements

  • Python >= 3.11
  • macOS with Apple Silicon (for mlx-lm backend) or Linux
  • At least one backend installed

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infer_check-0.2.6.tar.gz (168.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infer_check-0.2.6-py3-none-any.whl (152.9 kB view details)

Uploaded Python 3

File details

Details for the file infer_check-0.2.6.tar.gz.

File metadata

  • Download URL: infer_check-0.2.6.tar.gz
  • Upload date:
  • Size: 168.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infer_check-0.2.6.tar.gz
Algorithm Hash digest
SHA256 03688d780257041b0e13f33accb1a9d5a85bd632caad35db8deb5eba4d598ff8
MD5 3d0b7a580b605995f5c7d7f6d12b9dce
BLAKE2b-256 27bd373c099d6f0f89bc4ff595d3ca3f521bd8e7757ce81968214aef572926cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for infer_check-0.2.6.tar.gz:

Publisher: release.yml on NullPointerDepressiveDisorder/infer-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file infer_check-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: infer_check-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 152.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infer_check-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1774143ce6f1cec3b43c5688a5d3ae9431cf20815ac55ca3b153fe6493d8b8a6
MD5 73bd849f9f5cae6208262067a30800ec
BLAKE2b-256 cc4297cdea378b8f574e8ed548059fa089c925c3cdf8ceecab14e77f16ed4d1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for infer_check-0.2.6-py3-none-any.whl:

Publisher: release.yml on NullPointerDepressiveDisorder/infer-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page