Skip to main content

Correctness and reliability testing for LLM inference engines

Project description

infer-check

PyPI - Version Run tests and upload coverage

Catches the correctness bugs that benchmarks miss in LLM inference engines.

Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — infer-check tests whether engines are correct.

The problem

Every LLM inference engine has correctness bugs that benchmarks don't catch:

  • KV cache NaN pollution in vLLM-Ascend permanently corrupts all subsequent requests
  • FP8 KV quantization in vLLM causes repeated garbage output
  • 32.5% element mismatches in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
  • Batch-size-dependent output where tokens change depending on concurrent request count

These aren't model quality problems — they're engine correctness failures. infer-check is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.

Example results

Results from running infer-check on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.

Quantization sweep

4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:

                       Llama-3.1-8B: bf16 vs 4bit
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ prompt suite          ┃ identical ┃ severe   ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ adversarial-numerics  │     0/30  │   23/30  │          0.311  │
│ reasoning             │     1/50  │   35/50  │          0.384  │
│ code                  │     0/49  │   30/49  │          0.452  │
└───────────────────────┴───────────┴──────────┴─────────────────┘

A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.

Dense vs. MoE comparison

Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.

Cross-backend diff

mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.

Determinism

Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.

Stress test

vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.

Installation

pip install infer-check

# With MLX backend support (Apple Silicon)
pip install "infer-check[mlx]"

Usage

Quantization sweep

Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo. Use --max-tokens to control generation length (defaults to 1024) and --num-prompts to limit the number of prompts used.

infer-check sweep \
  --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
            8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
            4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
  --backend mlx-lm \
  --prompts reasoning \
  --max-tokens 512 \
  --num-prompts 10 \
  --output ./results/sweep/

--prompts accepts either a bundled suite name (reasoning, code, adversarial-numerics, determinism, long-context, quant-sensitive) or a path to any .jsonl file.

The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.

                                 Sweep Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ quant_level         ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ bf16 (self-check)   │     50/50 │  0/50 │     0/50 │   0/50 │          1.0000 │
│ 8bit                │     20/50 │  9/50 │    12/50 │   9/50 │          0.8067 │
│ 4bit                │      1/50 │  3/50 │    11/50 │  35/50 │          0.3837 │
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘

Cross-backend diff

Same model, same quant, different inference paths. Catches serving-layer bugs.

# Start vllm-mlx in another terminal:
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000

infer-check diff \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backends "mlx-lm,openai-compat" \
  --base-urls ",http://127.0.0.1:8000" \
  --prompts reasoning \
  --output ./results/diff/

Uses /v1/chat/completions by default (--chat) so server-side chat templates match the local backend. Pass --no-chat for raw /v1/completions.

Determinism

Same prompt N times at temperature=0. Output should be bit-identical every run.

infer-check determinism \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backend mlx-lm \
  --prompts determinism \
  --runs 20 \
  --output ./results/determinism/

Stress test

Concurrent requests through a serving backend. Tests KV cache correctness under load.

infer-check stress \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backend openai-compat \
  --base-url http://127.0.0.1:8000 \
  --prompts reasoning \
  --concurrency 1,2,4,8 \
  --output ./results/stress/

Report

Generate an HTML report from all saved results.

infer-check report ./results/ --format html

Prompt suites

Curated prompts targeting known quantization failure modes:

Suite Count Purpose
reasoning.jsonl 50 Multi-step math and logic
code.jsonl 49 Python, JSON, SQL generation
adversarial-numerics.jsonl 30 IEEE 754 edge cases, overflow, precision
long-context.jsonl 10 Tables and transcripts with recall questions
quant-sensitive.jsonl 20 Multi-digit arithmetic, long CoT, precise syntax
determinism.jsonl 50 High-entropy continuations for determinism testing

All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line (default max_tokens is 1024):

{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}

Supported backends

Backend Type Use case
mlx-lm In-process Local Apple Silicon inference with logprobs
llama.cpp HTTP llama-server via /completion endpoint
vllm-mlx HTTP Continuous batching on Apple Silicon
openai-compat HTTP Any OpenAI-compatible server (vLLM, SGLang, Ollama)

Roadmap

  • GGUF backend (direct llama.cpp integration without HTTP)
  • CUDA vLLM backend for GPU-based differential testing
  • Logprobs-based divergence scoring where backends support it
  • Automated regression CI mode (infer-check ci with pass/fail exit codes)
  • Expanded prompt suites for tool use and multi-turn conversations

Requirements

  • Python >= 3.11
  • macOS with Apple Silicon (for mlx-lm backend) or Linux
  • At least one backend installed

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infer_check-0.2.4.tar.gz (148.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infer_check-0.2.4-py3-none-any.whl (148.8 kB view details)

Uploaded Python 3

File details

Details for the file infer_check-0.2.4.tar.gz.

File metadata

  • Download URL: infer_check-0.2.4.tar.gz
  • Upload date:
  • Size: 148.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infer_check-0.2.4.tar.gz
Algorithm Hash digest
SHA256 01e2cd9268aac644bfca7cfc742730802b00620283f6d53209ff54670b3495a1
MD5 14eedc28117fd51321ccf65e0ea3db57
BLAKE2b-256 fd22056369c3ab8c716d9ef5a18a15b674a82a43ad48b98ad695348884dd9df4

See more details on using hashes here.

Provenance

The following attestation bundles were made for infer_check-0.2.4.tar.gz:

Publisher: release.yml on NullPointerDepressiveDisorder/infer-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file infer_check-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: infer_check-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 148.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infer_check-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3133d862ee7d1aeaf77c87978d3c93048f00c4f87c8e9ee5a0277b5d4a16d697
MD5 a40f5b51a47e933c8a7febf5e5718fb0
BLAKE2b-256 edc90b1d7fe1eea59fdb18748678ac9f04e7ccc3360238c73c8a9c5d16d9267c

See more details on using hashes here.

Provenance

The following attestation bundles were made for infer_check-0.2.4-py3-none-any.whl:

Publisher: release.yml on NullPointerDepressiveDisorder/infer-check

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page