Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

mlx-quant-fidelity

Measure how much quality a quantization costs on Apple Silicon. mlx-quant-fidelity scores a quantized model against a higher-precision reference on the same corpus and reports the drift as numbers you can act on: KL divergence, top-token flip rate, perplexity delta. It measures both KV-cache quantization and weight quantization. No more choosing a bit-width by file size.

The CUDA/GGUF world has had this for years: llama.cpp's --kl-divergence-base, EleutherAI's lm-evaluation-harness. MLX had nothing. This is the MLX version, and it covers the KV-cache and attention angle those tools skip.

Install

pip install mlx-quant-fidelity

Apple Silicon (MLX), Python 3.11+.

Use it

mlx-quant-fidelity kv mlx-community/Llama-3.2-3B-Instruct-4bit --kv-bits 8

Prints a Markdown report. Add --format json for JSON, --format badge for a shields.io badge line, --kv-bits 4, --kv-group-size 64, or --max-chunks N to bound the corpus.

from mlx_quant_fidelity import measure_kv_fidelity

report = measure_kv_fidelity("mlx-community/Llama-3.2-3B-Instruct-4bit", kv_bits=8)
print(report.kl.mean, report.flip_rate, report.verdict)

Or measure weight quantization — a quantized repo against a higher-precision reference:

mlx-quant-fidelity weights mlx-community/Llama-3.2-3B-Instruct-4bit --reference mlx-community/Llama-3.2-3B-Instruct-bf16

from mlx_quant_fidelity import measure_weight_fidelity

# measure_weight_fidelity(quantized_repo, reference_repo)
report = measure_weight_fidelity(
    "mlx-community/Llama-3.2-3B-Instruct-4bit",  # quantized
    "mlx-community/Llama-3.2-3B-Instruct-bf16",  # reference
)
print(report.kl.mean, report.flip_rate, report.verdict)

What a report looks like

# KV-fidelity: `mlx-community/Llama-3.2-3B-Instruct-4bit` @ 8-bit (group 64)

**Verdict:** good · **mode:** stress (quantize_start=0)

| metric | value |
|---|---|
| KL mean | 0.0002 nats |
| KL median | 0.0001 nats |
| KL p99 | 0.0015 nats |
| KL max | 0.1129 nats |
| flip rate | 0.0065 |
| perplexity Δ | +0.0054 (17.722 → 17.728) |

Measured on **wikitext-2-raw/test**, 51100 positions across 100 chunks of length 512 ...

Badge output

--format badge prints a single shields.io Markdown line instead of the full report:

mlx-quant-fidelity kv mlx-community/Llama-3.2-3B-Instruct-4bit --kv-bits 8 --format badge

Output:

![KV fidelity](https://img.shields.io/badge/KV_fidelity-good_%C2%B7_8--bit_%C2%B7_wikitext--2--raw%2F512_%C2%B7_stress-brightgreen)

Green for good, yellow for marginal, red for bad. The badge message includes the bit width, corpus, chunk length, and mode so badges from different configurations are distinguishable. Threshold values and the color map are in docs/threshold-policy.md.

How much does KV quantization cost?

M1 Max, WikiText-2 test (100 chunks of 512 tokens), stress mode (quantize from token 0). Reproduce any row with mlx-quant-fidelity kv <model> --kv-bits <bits> --max-chunks 100; the full committed reports are under _artifacts/samples/.

Model	KV bits	KL mean (nats)	flip rate	verdict
Llama-3.2-1B	4	0.148	0.20	bad
Llama-3.2-1B	8	0.0004	0.013	marginal
Llama-3.2-3B	4	0.051	0.11	bad
Llama-3.2-3B	8	0.0002	0.007	good
Qwen2.5-7B	4	9.36	0.99	bad
Qwen2.5-7B	8	0.009	0.032	marginal

8-bit KV is near-lossless on all three models. 4-bit is another matter, and Qwen2.5-7B at 4-bit in stress mode falls apart: nearly every token flips. That is the attention sink at work: stress mode quantizes the cache from token 0, including the first tokens attention leans on most, and Qwen2.5 does not tolerate it. mlx-lm's own default keeps the first 5000 tokens full-precision for exactly this reason. Run the tool first and you see it coming.

How much does weight quantization cost?

Same corpus and recipe, but the comparison is now a quantized model repo against a higher-precision reference repo. Reproduce any row with mlx-quant-fidelity weights <quant> --reference <reference> --max-chunks 100; the committed reports are under _artifacts/samples/weights/.

Model	quant	reference	KL mean (nats)	flip rate	perplexity Δ	verdict
Llama-3.2-1B	4-bit	bf16	0.158	0.21	+3.5	marginal
Llama-3.2-1B	8-bit	bf16	0.001	0.023	−0.01	good
Llama-3.2-3B	4-bit	bf16	0.085	0.15	+1.4	marginal
Llama-3.2-3B	8-bit	bf16	0.0009	0.021	0.00	good
Qwen2.5-7B	4-bit	8-bit	0.109	0.16	+0.9	marginal

8-bit weights are near-lossless: about 2% of top tokens flip and perplexity barely moves. 4-bit is a real trade: 15 to 21% of top tokens flip and perplexity climbs a point or more, worst on the small 1B model. The Qwen row compares 4-bit against 8-bit rather than bf16, so its drift is relative to an already-quantized reference, not full precision; the report records that the reference is 8-bit and says so in plain text. The verdict tiers are provisional, anchored to these q8 and q4 reference points on short prose rather than to downstream task accuracy.

Unlike the KV probe, both runs use standard attention, so the drift is the deployed quantized model's weight-quant cost with no quantized-attention kernel folded in. It does still include the quantized-matmul kernel's numerics, which is exactly what you run when you load the model.

Comparing quantizations

compare ranks a set of quantizations on a memory-normalized Pareto frontier: quality (mean KL divergence) on one axis, memory cost on the other. It identifies any configuration that is both worse quality and more expensive than another option on the list — those are dominated and you would never choose them.

# rank weight quantizations against a bf16 reference
mlx-quant-fidelity compare weights q4 q6 q8 --reference fp16

# rank KV configs on a single model
mlx-quant-fidelity compare kv <model> --configs 4:32,4:64,8:64

Add --max-kld 0.05 to get the cheapest configuration whose mean KLD stays under a threshold, or --min-tier good to get the cheapest one that passes the good-tier verdict. docs/ranking-principles.md explains how each axis is computed, what Pareto domination means in practice, and where the ranking has limits.

How it works

Teacher-forced scoring, not generation. For each fixed-length corpus chunk the model runs twice on the same tokens — once with a full-precision KV cache, once with a quantized one — and the two next-token distributions are compared position by position. Generation would let the runs diverge in their own inputs the moment quantization changed a sampled token, turning the measurement into trajectory drift instead of cache cost. Logits collapse to per-position scalars inside the chunk loop and are released before the next chunk, so a long corpus never holds full distributions in memory.

Two modes:

stress (--quantize-start 0, the default): quantize from token 0. The harsh, apples-to-apples quantizer test.
deployment (--quantize-start N): keeps the first N positions full-precision, matching mlx-lm's --quantized-kv-start behavior. Metrics cover only the post-boundary region; per-token drift there is close to stress mode because those positions attend through an already-quantized cache. docs/measurement-principles.md has the details and why deployment numbers are not a long-context real-deployment average.

A run that returns exactly zero drift raises instead of reporting a silent "perfect fidelity." That almost always means quantization never engaged, not that it was free.

The weight probe works the same way with two models instead of two caches: a quantized repo and a reference repo, scored on the same corpus tokens. A compatibility gate refuses a mismatched pair before loading, and a memory pre-flight refuses a pair too large for the device rather than risking a kernel panic.

See docs/measurement-principles.md for the zero-probability policy, the exact-zero guard, and how perplexity delta relates to mean KLD.

What the numbers don't say

A fidelity number is corpus- and context-length-specific. WikiText-2 at temperature 0 measures short-prose distributional drift; the paper this builds on, Accuracy Is Not All You Need, shows that under-predicts task-specific and long-context degradation. Every report records the corpus and the token count so the number is never read as a bare score.
Perplexity delta is reported for continuity with llama.cpp. It is related to but distinct from mean KLD — it scores the realized next token and can diverge from full-vocabulary drift — so it is not independent corroboration.
The measured drift bundles the quantizer's error with the quantized-attention kernel's numerics. That is the real end-to-end cost; a quantizer-only control is on the roadmap.

Status

0.4.0, released on PyPI as mlx-quant-fidelity — adds deployment mode (--quantize-start), a shareable fidelity badge (--format badge), and isolation of a malformed cached comparison partial. 0.3.x added the compare command for memory-normalized Pareto ranking of KV-cache and weight quantizations and hardened its error handling. Downstream-task accuracy and more are on the roadmap.

License

Apache-2.0.

Sister projects

Other MLX libraries for Apple Silicon:

mlx-taef — tiny autoencoders for fast diffusion-latent previews and low-memory decode (FLUX / SD).
mlx-teacache — TeaCache residual caching to skip redundant FLUX denoising steps.
mlx-model-doctor — validate an MLX / Hugging Face model repo before you load it (config, tokenizer, safetensors, memory).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

IonDen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Jul 4, 2026

0.3.1

Jun 23, 2026

0.3.0

Jun 18, 2026

0.2.0

Jun 15, 2026

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_quant_fidelity-0.4.0.tar.gz (822.1 kB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_quant_fidelity-0.4.0-py3-none-any.whl (47.7 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file mlx_quant_fidelity-0.4.0.tar.gz.

File metadata

Download URL: mlx_quant_fidelity-0.4.0.tar.gz
Upload date: Jul 4, 2026
Size: 822.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_quant_fidelity-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`e2750feb9c9f72d458a5d3e4f8ee621b8b88e5f384593297e8073b186dc43606`
MD5	`5693fb0c09dfba072f15c864763f6e5a`
BLAKE2b-256	`b946ec763808dd576a1d047e173d44ca54837809588ea706ab09757cb1d8e996`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_quant_fidelity-0.4.0.tar.gz:

Publisher: release.yml on IonDen/mlx-quant-fidelity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mlx_quant_fidelity-0.4.0.tar.gz
- Subject digest: e2750feb9c9f72d458a5d3e4f8ee621b8b88e5f384593297e8073b186dc43606
- Sigstore transparency entry: 2071409780
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: IonDen/mlx-quant-fidelity@fbd59c04ad65ba3d02d5828a50dcb3a96fd2adb9
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/IonDen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fbd59c04ad65ba3d02d5828a50dcb3a96fd2adb9
- Trigger Event: push

File details

Details for the file mlx_quant_fidelity-0.4.0-py3-none-any.whl.

File metadata

Download URL: mlx_quant_fidelity-0.4.0-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 47.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_quant_fidelity-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51cc17c652decb78c6bf737e467fe37400c5504d3609c59a47dd052f9b311ec7`
MD5	`71ddb8a0ad52130d0f108178d1887e42`
BLAKE2b-256	`e94c4f70a5f21c5048d0962b7b141474dafd77f911fadb95e2620e2a03611c58`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_quant_fidelity-0.4.0-py3-none-any.whl:

Publisher: release.yml on IonDen/mlx-quant-fidelity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mlx_quant_fidelity-0.4.0-py3-none-any.whl
- Subject digest: 51cc17c652decb78c6bf737e467fe37400c5504d3609c59a47dd052f9b311ec7
- Sigstore transparency entry: 2071409800
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: IonDen/mlx-quant-fidelity@fbd59c04ad65ba3d02d5828a50dcb3a96fd2adb9
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/IonDen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fbd59c04ad65ba3d02d5828a50dcb3a96fd2adb9
- Trigger Event: push

mlx-quant-fidelity 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

mlx-quant-fidelity

Install

Use it

What a report looks like

Badge output

How much does KV quantization cost?

How much does weight quantization cost?

Comparing quantizations

How it works

What the numbers don't say

Status

License

Sister projects

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance