Mixed-precision quantization optimizer for MLX models on Apple Silicon

These details have not been verified by PyPI

Project links

Models

Project description

mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

Website: https://mlx-optiq.pages.dev/ | PyPI: https://pypi.org/project/mlx-optiq/

OptIQ turns "uniform 4-bit" into a data-driven, per-layer budget. Sensitive layers stay at 8-bit; the rest get 4-bit. The same per-layer sensitivity signal runs across the full deployment stack:

weight quantization (optiq convert)
KV-cache quantization at serving time (optiq kv-cache, optiq serve)
TurboQuant rotated-space state compression (optiq.core.turbo_kv_cache)
unused-component stripping — for multi-modal base models, drop vision / audio metadata when the target is text-only
sensitivity-aware LoRA fine-tuning (new in v0.0.7) — layers OptIQ identified as sensitive get higher adapter rank than robust ones, since they benefit more from adaptation capacity
OpenAI-compatible server with optional LoRA adapter loading direct from a HuggingFace repo id

Everything ships behind optiq * subcommands and drops into stock mlx-lm at serve time.

Install

pip install mlx-optiq

What you get

	What it does	Where
`optiq convert`	Per-layer sensitivity analysis + mixed-precision weight quantization. For multi-modal base models, auto-strips unused vision/audio metadata to produce a clean text-only OptIQ variant.	Experiments →
`optiq kv-cache`	Per-layer KV-cache sensitivity. Writes `kv_config.json` with per-layer bit-widths.	Experiments →
`optiq serve`	OpenAI-compatible HTTP server with mixed-precision KV + `--adapter` flag that accepts a HuggingFace LoRA repo id directly. Drop-in `mlx_lm.server` replacement.	Results →
`optiq lora train` (new in v0.0.7)	Sensitivity-aware LoRA fine-tuning on OptIQ models. Reads per-layer bit assignments from `optiq_metadata.json` and scales LoRA rank by layer sensitivity. PEFT-compatible adapter output.	README ↓
`optiq.core.turbo_kv_cache`	TurboQuant rotated-space KV (library). Research path for attention-inner-product-preserving quantization.	Experiments →

Pre-built OptIQ-quantized models on HuggingFace: Models →

Quickstart

Use a pre-built model (stock mlx-lm, no OptIQ code required):

from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Hello", max_tokens=100)

Serve with mixed-precision KV (new in v0.0.5):

# One-time sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# OpenAI-compatible server on :8080
optiq serve \
  --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
  --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20

Convert a new model:

pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

Sensitivity-aware LoRA fine-tuning (new in v0.0.7):

# Train a LoRA adapter with per-layer rank derived from OptIQ's
# sensitivity measurements (by_bits scaling: 8-bit layers get 2× rank)
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Inspect what was adapted
optiq lora info ./my_adapter

# Serve with the adapter applied
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter ./my_adapter \
            --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json

# Or serve with a community adapter direct from HF (auto-downloads)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter codelion/my-agent-lora

The adapter is saved in PEFT-compatible format (adapter_config.json + adapters.safetensors) plus an OptIQ sidecar (optiq_lora_config.json) that records the per-layer rank distribution. You can load it with any tool that speaks PEFT.

Headline numbers

Weight quantization — GSM8K vs uniform 4-bit:

Qwen3.5-0.8B: 27.0% vs 11.5% (+15.5pp)
gemma-4-e4b-it: 55.5% vs 23.5% (+32.0pp)

KV-cache serving — decode tok/s at 64k context (Apple M3 Max 36GB):

Qwen3.5-2B: 41.8 vs fp16 27.9 (+50%)
Qwen3.5-4B: 13.1 vs fp16 8.1 (+62%)
Qwen3.5-9B: 27.1 vs fp16 20.7 (+31%)

Full tables, methodology, and per-layer configs on the Results page.

How it works

Weight quantization pipeline:

Load PyTorch model from HuggingFace.
Per-layer KL-divergence sensitivity on calibration data (WikiText-2 for LLMs).
Greedy knapsack: start all layers at min-bit, upgrade by KL-reduction-per-bit until BPW budget is spent. Sensitive layers like lm_head, embed_tokens, and the first/last blocks are protected at the max bit-width.
MLX conversion via mlx-lm with per-layer quant_predicate.

KV-cache serving pipeline:

optiq kv-cache runs the same sensitivity analysis but on KV quantization — per full-attention layer, measures KL divergence when that layer's KV is quantized to each candidate bit-width.
optiq serve monkey-patches mlx_lm.server's generation loop to use mlx_lm.models.cache.QuantizedKVCache at per-layer bit-widths (via maybe_quantize_kv_cache). The patched hook replaces mlx-lm's uniform kv_bits with a per-layer dict.
At SDPA time, mx.quantized_matmul reads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel path is faster than the 4-bit one, so protecting a single sensitive layer at 8-bit gives both quality preservation and a decode speedup.

TurboQuant KV (research path — not default in optiq serve):

Random orthogonal rotation makes coordinates near-independent.
Optimal Lloyd-Max scalar quantization per coordinate.
Rotated-space attention: rotate Q once and output once, work in centroid space in between. O(d²) fixed cost vs O(seq_len × d²) for naïve rotated-then-dequant.

Hybrid-attention note

Qwen3.5 interleaves linear-attention (GatedDeltaNet) and full-attention layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. optiq kv-cache automatically skips the linear layers. On Qwen3.5-4B/9B, this means 8 of 32 layers get per-layer KV bit assignments; the typical output is 7 @ 4-bit + 1 @ 8-bit protecting layer 3 (the first full-attention layer).

Status / roadmap

✅ Weight quantization: production
✅ KV cache serving (Qwen3.5): production in v0.0.5
🚧 Gemma-4 KV serving: blocked on upstream mlx-lm shared-KV attention not supporting QuantizedKVCache
🔬 TurboQuant serving with a fused Metal kernel: research

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

Python ≥ 3.11
Apple Silicon Mac (for MLX)
mlx-lm ≥ 0.20

License

MIT

Project details

These details have not been verified by PyPI

Project links

Models

Release history Release notifications | RSS feed

0.0.11

Apr 26, 2026

0.0.10

Apr 25, 2026

0.0.9

Apr 20, 2026

This version

0.0.8

Apr 20, 2026

0.0.7

Apr 20, 2026

0.0.6

Apr 15, 2026

0.0.5

Apr 15, 2026

0.0.4

Apr 11, 2026

0.0.3

Apr 11, 2026

0.0.2

Mar 26, 2026

0.0.1

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.8.tar.gz (81.2 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_optiq-0.0.8-py3-none-any.whl (95.0 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file mlx_optiq-0.0.8.tar.gz.

File metadata

Download URL: mlx_optiq-0.0.8.tar.gz
Upload date: Apr 20, 2026
Size: 81.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`d1bf018abdc3c9a9444476c169007e638867a729aa066698b6518f03577ccd1f`
MD5	`bd378ecf3132df96cf9765526d4ea162`
BLAKE2b-256	`bdf4a89575c5870d4f0f50190cbe9ea19d582e6fc2b6ffe20447204f24f3d049`

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.8-py3-none-any.whl.

File metadata

Download URL: mlx_optiq-0.0.8-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 95.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7c34903ee8e4acbe407bbf00f2c6f2782f30dfa52eb767826e476a450cf0210`
MD5	`860dde9e44c5d2cd5a5f6c0d5093d5f0`
BLAKE2b-256	`7f677c09c6059a807262cfec6973fe80efabcf154efc2596228b2a11abc1d8f7`

See more details on using hashes here.

mlx-optiq 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-optiq

Install

What you get

Quickstart

Headline numbers

How it works

Hybrid-attention note

Status / roadmap

Article

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes