Mixed-precision quantization optimizer for MLX models on Apple Silicon

These details have not been verified by PyPI

Project links

Models

Project description

mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

OptiQ produces better quantized models that mlx-lm loads natively. It also provides TurboQuant KV cache — rotation-based vector quantization that preserves attention inner products better than standard affine quantization.

Install

pip install mlx-optiq

What It Does

1. Mixed-Precision Weight Quantization

Instead of uniform quantization (all layers at 4-bit), OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths.

Qwen3.5-0.8B (GSM8K, 200 samples):

Model	GSM8K	Size
OptiQ mixed (4.5 BPW)	27.0%	570 MB
Uniform 4-bit	11.5%	404 MB

2.3x better accuracy at modest size increase. Models work with standard mlx-lm — no special code needed:

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

2. TurboQuant KV Cache

Implements rotation-based vector quantization from TurboQuant (ICLR 2026) for KV cache compression. Random orthogonal rotation before scalar quantization preserves the inner products that attention's Q·K^T computation needs.

Rotated-space attention eliminates per-key rotation overhead — only the query and output are rotated once each (O(d²)), while all stored keys/values are accessed via cheap centroid lookups (O(d)). Result: near-zero speed overhead.

Qwen3.5-0.8B (6 self-attention layers):

Method	PPL	Needle Retrieval	Speed vs FP16
FP16 KV (reference)	22.50	73%	baseline
Affine 4-bit KV	22.98	80%	-0%
TurboQuant 4-bit KV	22.87	100%	-2%
TurboQuant 3-bit KV	23.66	100%	+4%

TurboQuant 4-bit beats affine on PPL (+0.37 vs +0.48) and needle retrieval (100% vs 80%)
TurboQuant enables 3-bit KV cache where affine can't (head_dim=256 packing incompatibility)
GSM8K reasoning preserved: TQ 4-bit gets 32% vs FP16's 30% (50-sample test)

from mlx_lm import load
from optiq.core.turbo_kv_cache import TurboQuantKVCache, patch_attention

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
patch_attention()  # Install rotated-space attention (once)

# Replace self-attention KV caches with TurboQuant
cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim, bits=4, seed=42+i
        )

# Use as normal
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, prompt_cache=cache)

3. YOLO26 Object Detection

Mixed-precision quantized YOLO26 models for real-time object detection on Apple Silicon. ~4x compression with near-zero detection loss.

pip install mlx-optiq[yolo]

from optiq.models.yolo import load_quantized_yolo

model = load_quantized_yolo("mlx-community/YOLO26n-OptiQ-6bit")
results = model.predict("image.jpg")

COCO128 Benchmark:

Model	Original	OptiQ	Compression	Detection Delta
YOLO26n	9.9 MB	2.5 MB	3.9x	-1.6%
YOLO26s	38.4 MB	8.9 MB	4.3x	-7.0%
YOLO26m	83.8 MB	18.9 MB	4.4x	+0.1%
YOLO26l	100.7 MB	22.9 MB	4.4x	0.0%
YOLO26x	225.5 MB	50.6 MB	4.5x	-1.1%

Pre-built Models

Available on HuggingFace:

LLMs (work with standard mlx-lm):

YOLO26 (require mlx-optiq[yolo]):

Convert Your Own Models

pip install mlx-optiq[convert]

# Mixed-precision quantization
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8

# Evaluate
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

How It Works

Weight quantization pipeline:

Load PyTorch model from HuggingFace
Per-layer KL divergence sensitivity analysis on calibration data
Greedy knapsack optimization to assign bit-widths within BPW budget
MLX conversion via mlx-lm with custom per-layer quantization config

TurboQuant KV cache:

Random orthogonal rotation makes vector coordinates near-independent
Optimal Lloyd-Max scalar quantization per coordinate
Rotated-space attention: pre-rotate Q, compute SDPA in centroid space, post-rotate output
Incremental quantization: only new tokens are processed each step

Architecture Note (Hybrid Models)

Qwen3.5 uses 18 GatedDeltaNet layers (recurrent state) + 6 standard self-attention layers (KV cache). TurboQuant is applied to the KV cache layers only. The recurrent state uses a read-modify-write pattern where quantization errors accumulate — keeping it at FP16 is recommended for generation tasks.

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

Python >= 3.11
Apple Silicon Mac (for MLX)
mlx-lm >= 0.20

Project details

These details have not been verified by PyPI

Project links

Models

Release history Release notifications | RSS feed

0.0.11

Apr 26, 2026

0.0.10

Apr 25, 2026

0.0.9

Apr 20, 2026

0.0.8

Apr 20, 2026

0.0.7

Apr 20, 2026

0.0.6

Apr 15, 2026

0.0.5

Apr 15, 2026

This version

0.0.4

Apr 11, 2026

0.0.3

Apr 11, 2026

0.0.2

Mar 26, 2026

0.0.1

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.4.tar.gz (60.7 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_optiq-0.0.4-py3-none-any.whl (70.8 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file mlx_optiq-0.0.4.tar.gz.

File metadata

Download URL: mlx_optiq-0.0.4.tar.gz
Upload date: Apr 11, 2026
Size: 60.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`c46a2763242b89da33e9c11a6c525e95a69a4d4bf89224e0e2027dbd155f8bc0`
MD5	`186e6076159446fd74d1cc5c07dc451d`
BLAKE2b-256	`54d6e5ee73252c8ce63f533156793bde8053a6b9d15a90287774fb329e91db47`

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.4-py3-none-any.whl.

File metadata

Download URL: mlx_optiq-0.0.4-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 70.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c7ebe678457640cea98c42afcb4e8a2192dca771f34bd87090634965654d9f9e`
MD5	`a5511c43a4a7e4383edf6e2a79b2a253`
BLAKE2b-256	`84a11652ce5cd6efab451a732826b5c0971948b05de44c8a942735647dddb2d8`

See more details on using hashes here.

mlx-optiq 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-optiq

Install

What It Does

1. Mixed-Precision Weight Quantization

2. TurboQuant KV Cache

3. YOLO26 Object Detection

Pre-built Models

Convert Your Own Models

How It Works

Architecture Note (Hybrid Models)

Article

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes