Skip to main content

Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights

Project description

mlx-quant-fidelity

Measure how much quality a quantization costs on Apple Silicon. mlx-quant-fidelity scores a quantized KV cache against full precision on the same model and reports the drift as numbers you can act on: KL divergence, top-token flip rate, perplexity delta. No more choosing a bit-width by file size.

The CUDA/GGUF world has had this for years: llama.cpp's --kl-divergence-base, EleutherAI's lm-evaluation-harness. MLX had nothing. This is the MLX version, and it covers the KV-cache and attention angle those tools skip.

Version 0.1.0 measures KV-cache quantization. Weight-quantization fidelity is next; see the roadmap.

Install

pip install mlx-quant-fidelity

Apple Silicon (MLX), Python 3.11+.

Use it

mlx-quant-fidelity kv mlx-community/Llama-3.2-3B-Instruct-4bit --kv-bits 8

Prints a Markdown report. Add --format json for JSON, --kv-bits 4, --kv-group-size 64, or --max-chunks N to bound the corpus.

from mlx_quant_fidelity import measure_kv_fidelity

report = measure_kv_fidelity("mlx-community/Llama-3.2-3B-Instruct-4bit", kv_bits=8)
print(report.kl.mean, report.flip_rate, report.verdict)

What a report looks like

# KV-fidelity: `mlx-community/Llama-3.2-3B-Instruct-4bit` @ 8-bit (group 64)

**Verdict:** good · **mode:** stress (quantize_start=0)

| metric | value |
|---|---|
| KL mean | 0.0002 nats |
| KL median | 0.0001 nats |
| KL p99 | 0.0015 nats |
| KL max | 0.1129 nats |
| flip rate | 0.0065 |
| perplexity Δ | +0.0054 (17.722 → 17.728) |

Measured on **wikitext-2-raw/test**, 51100 positions across 100 chunks of length 512 ...

How much does KV quantization cost?

M1 Max, WikiText-2 test (100 chunks of 512 tokens), stress mode (quantize from token 0). Reproduce any row with mlx-quant-fidelity kv <model> --kv-bits <bits> --max-chunks 100; the full committed reports are under _artifacts/samples/.

Model KV bits KL mean (nats) flip rate verdict
Llama-3.2-1B 4 0.148 0.20 bad
Llama-3.2-1B 8 0.0004 0.013 marginal
Llama-3.2-3B 4 0.051 0.11 bad
Llama-3.2-3B 8 0.0002 0.007 good
Qwen2.5-7B 4 9.36 0.99 bad
Qwen2.5-7B 8 0.009 0.032 marginal

8-bit KV is near-lossless on all three models. 4-bit is another matter, and Qwen2.5-7B at 4-bit in stress mode falls apart: nearly every token flips. That is the attention sink at work: stress mode quantizes the cache from token 0, including the first tokens attention leans on most, and Qwen2.5 does not tolerate it. mlx-lm's own default keeps the first 5000 tokens full-precision for exactly this reason. The point of the tool is that you can see this for your model before you pick a bit-width.

How it works

Teacher-forced scoring, not generation. For each fixed-length corpus chunk the model runs twice on the same tokens — once with a full-precision KV cache, once with a quantized one — and the two next-token distributions are compared position by position. Generation would let the runs diverge in their own inputs the moment quantization changed a sampled token, turning the measurement into trajectory drift instead of cache cost. Logits collapse to per-position scalars inside the chunk loop and are released before the next chunk, so a long corpus never holds full distributions in memory.

Two modes:

  • stress (--quantize-start 0, the default): quantize from token 0. The harsh, apples-to-apples quantizer test.
  • deployment (quantize_start > 0): what mlx-lm users actually run, with the first N tokens kept full-precision. Planned for 0.2.0.

A run that returns exactly zero drift raises instead of reporting a silent "perfect fidelity." That almost always means quantization never engaged, not that it was free.

What the numbers don't say

  • A fidelity number is corpus- and context-length-specific. WikiText-2 at temperature 0 measures short-prose distributional drift; the paper this builds on, Accuracy Is Not All You Need, shows that under-predicts task-specific and long-context degradation. Every report records the corpus and the token count so the number is never read as a bare score.
  • Perplexity delta is reported for continuity with llama.cpp. It correlates with mean KLD by construction, so treat it as a familiar restatement, not independent corroboration.
  • The measured drift bundles the quantizer's error with the quantized-attention kernel's numerics. That is the real end-to-end cost; a quantizer-only control is on the roadmap.

Status

0.1.0, released on PyPI as mlx-quant-fidelity — the KV-cache fidelity probe (CLI + Python API, JSON and Markdown reports). Weight-quantization fidelity, downstream-task accuracy, and memory-normalized method ranking are on the roadmap.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_quant_fidelity-0.1.0.tar.gz (114.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_quant_fidelity-0.1.0-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file mlx_quant_fidelity-0.1.0.tar.gz.

File metadata

  • Download URL: mlx_quant_fidelity-0.1.0.tar.gz
  • Upload date:
  • Size: 114.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_quant_fidelity-0.1.0.tar.gz
Algorithm Hash digest
SHA256 64462b1995692714d6af0237668768b716424ecca3377491608524fc8f34f2ba
MD5 a01fc63ae8c9035be683fe65f6de72ed
BLAKE2b-256 53927f5c52c83088c7fddd8c411a9dcb28702e4d3491c75158b72c2868584097

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_quant_fidelity-0.1.0.tar.gz:

Publisher: release.yml on IonDen/mlx-quant-fidelity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_quant_fidelity-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mlx_quant_fidelity-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf4df8231da7e392413733b8ac56481825f82909cd83d7bac6c93d06c15790ab
MD5 b60f7e2998087609638a724ca7c8edf2
BLAKE2b-256 a2a3f3e89f64227745c946c032bba415a8567e6b0b0ed5491451e0606d86e6a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_quant_fidelity-0.1.0-py3-none-any.whl:

Publisher: release.yml on IonDen/mlx-quant-fidelity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page