Mixed-precision quantization optimizer for MLX models on Apple Silicon

These details have not been verified by PyPI

Project links

Models

Project description

mlx-optiq

Optimized deployment of LLMs, VLMs, and vision models on Apple Silicon.

Website: https://mlx-optiq.pages.dev/ | PyPI: https://pypi.org/project/mlx-optiq/ | Models: https://huggingface.co/mlx-community?search_models=OptiQ

OptIQ is an optimizing compiler and runtime for MLX. It takes a full-precision model and turns it into the best version for a given memory/latency budget on your Mac, using per-layer sensitivity measurements instead of "uniform 4-bit everywhere". The same sensitivity signal drives every layer of the stack — weights, KV cache, LoRA fine-tuning, runtime adapter swapping.

pip install mlx-optiq

Why mlx-optiq

Stock mlx-lm treats every layer of a quantized model the same. OptIQ doesn't:

Some layers are 50× more sensitive to quantization than others. OptIQ measures this once per model and assigns bits per-layer, holding the same average bits-per-weight while cutting quality loss. On GSM8K, this recovers +15–32 percentage points over uniform 4-bit on the same model, same quant budget (results).
The same is true of the KV cache. Some attention layers' KV are catastrophic to quantize (layer 0 KV is ~56× more sensitive than the median), others are essentially lossless at 4-bit. optiq serve runs a per-layer KV quant pipeline that keeps your quality while cutting decode memory — up to +62% decode tok/s at 64k context vs fp16 KV on M3 Max.
LoRA fine-tuning should reuse that sensitivity signal too. optiq lora train assigns higher adapter rank to layers OptIQ identified as sensitive, and lower rank to robust ones — so your adapter budget goes where it helps most.
Multi-adapter serving shouldn't reload the base model every time. optiq serve implements reversible mounted LoRA: mount multiple adapters on one base, switch per-request via a ContextVar-isolated activation gate, all without touching the frozen base weights.

Plus everything a deployment framework actually needs: vision-stripping for pure-text variants of VLMs, TurboQuant rotated-space KV compression (research path), YOLO26 quantization for object detection, and a roofline latency model calibrated to Apple Silicon bandwidth.

The full stack at a glance

Feature	CLI	Description
Weight quantization	`optiq convert`	Per-layer sensitivity + greedy knapsack. Auto-strips vision/audio metadata when quantizing multi-modal base models for text-only use.
KV cache quantization	`optiq kv-cache`	Writes per-layer `kv_config.json`. Same sensitivity method applied to the attention cache.
TurboQuant compression	`optiq.core.turbo_quant`	Rotation + optimal Lloyd-Max scalar quantization. Library API for research/custom pipelines.
OpenAI-compatible server	`optiq serve`	Drop-in `mlx_lm.server` replacement with mixed-precision KV, mounted LoRA adapters, and `--adapter <HF repo id>` auto-download.
Sensitivity-aware LoRA	`optiq lora train`	Per-layer rank scaling from OptIQ's bit assignments. PEFT-compatible output + OptIQ sidecar metadata.
Mounted hot-swap adapters	`optiq.adapters.mount`	Reversible per-request adapter activation via ContextVar. N adapters co-resident with one base.
VLM → text-only	`optiq convert --strip-unused-modalities`	Drops vision/audio weights + config cleanup. Output routes through `gemma4_text` / `qwen3_5_text` instead of the VLM wrapper.
YOLO26 quantization	`optiq convert --model-type yolo`	Full pipeline including per-layer detection-output KL sensitivity. Outputs a `yolo-mlx` compatible model.
Latency prediction	`optiq latency`	Roofline model calibrated to Apple Silicon memory bandwidth. Predicts decode tok/s for a given model + bit layout before running it.
Benchmarking	`optiq benchmark` / `optiq eval`	GSM8K, AI2D, WER, perplexity. Side-by-side vs baselines.

Quickstart

Running a pre-built OptIQ model

Every mlx-community OptiQ model works out of the box with stock mlx-lm:

from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=50))

Installing mlx-optiq unlocks the rest — mixed-precision KV serving, LoRA fine-tuning, runtime hot-swap adapters. Bit-identical inference quality either way.

Serving with mixed-precision KV

# One-time per-layer KV sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# OpenAI-compatible server on :8080
optiq serve \
  --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
  --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20

Fine-tuning with sensitivity-aware LoRA

# Train a LoRA adapter. --rank-scaling by_bits assigns rank proportional
# to OptIQ's per-layer bit assignments: 8-bit layers get 2× the rank of
# 4-bit layers, at the same average.
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Inspect what was adapted (shows per-layer rank distribution)
optiq lora info ./my_adapter

Adapter output is PEFT-compatible (adapter_config.json + adapters.safetensors) plus an OptIQ sidecar (optiq_lora_config.json) that records per-layer rank. Loads with any PEFT tool.

Serving with hot-swap adapters

# Preload an adapter at startup (HF repo id or local path)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter ./my_adapter

# ...or use the mount API directly for multi-adapter serving:
#   mount N adapters on one base, switch per request in the same Python
#   process without reloading the model. See optiq/adapters/mount.py.

Converting a fresh model

pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

For multi-modal base models quantized for text-only deployment, OptIQ auto-strips vision/audio metadata by default. Pass --keep-unused-modalities to disable.

YOLO26

from optiq.models.yolo import run_yolo_pipeline
results = run_yolo_pipeline(
    "optiq_output/yolo26n.safetensors",
    "optiq_output/yolo26n_optiq",
)

Full pipeline: per-layer sensitivity on detection outputs → greedy knapsack → yolo-mlx-compatible output.

Headline numbers

Weight quantization on GSM8K — OptIQ vs uniform 4-bit (same avg BPW):

Model	Uniform-4b	OptIQ-4b	Δ
Qwen3.5-0.8B	11.5%	27.0%	+15.5pp
gemma-4-e4b-it	23.5%	55.5%	+32.0pp

Decode throughput at 64k context — optiq serve mixed-precision KV vs fp16 on M3 Max 36GB:

Model	fp16 tok/s	OptIQ tok/s	Δ
Qwen3.5-2B	27.9	41.8	+50%
Qwen3.5-4B	8.1	13.1	+62%
Qwen3.5-9B	20.7	27.1	+31%

Full tables + methodology: Results page.

How it works

Weight quantization pipeline (optiq convert):

Load the base model from HuggingFace via PyTorch.
For each linear layer × each candidate bit-width, simulate quantization and measure KL divergence between full-precision and quantized logits on a calibration set (WikiText-2 for LLMs, COCO-captions for VLMs).
Greedy knapsack: start every layer at the minimum bits, upgrade the layer with the best "KL-reduction-per-bit" ratio each step until the target BPW budget is spent. Protected layers (lm_head, embed_tokens, first/last transformer blocks) always get the max bit-width.
MLX conversion via mlx-lm.convert() with the per-layer quant_predicate from step 3.
For multi-modal base models, strip vision/audio metadata from config.json (auto; opt out with --keep-unused-modalities).

KV cache pipeline (optiq kv-cache + optiq serve):

Same sensitivity measurement but applied to the KV cache: for each full-attention layer, replace that layer's KV with a quantized copy and measure KL on held-out prompts.
optiq serve monkey-patches mlx_lm.server.stream_generate to use mlx_lm.models.cache.QuantizedKVCache at per-layer bit-widths (via a patched maybe_quantize_kv_cache hook).
At attention time, mx.quantized_matmul reads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel is faster than the 4-bit kernel, so protecting one sensitive layer at 8-bit gives both quality AND a throughput bump.

TurboQuant KV (optiq.core.turbo_kv_cache — research path):

Random orthogonal rotation makes coordinates near-independent → better-conditioned quantization.
Optimal Lloyd-Max scalar quantization per coordinate (1/2/3/4-bit centroid tables).
Rotated-space attention: rotate Q once and output once, work in centroid space in between. Attention cost stays O(seq × d); rotation is O(d²) fixed.

Sensitivity-aware LoRA (optiq lora train):

Read optiq_metadata.json.per_layer — OptIQ's per-layer bit assignment.
Per target linear, derive rank: rank_scaling=by_bits gives r = base_rank × (bits / 4), so 8-bit layers get 2× the rank of 4-bit at the same base.
Apply mounted LoRA across all target blocks with the per-layer rank.
Train via mlx_lm.tuner.trainer.train, with the mx.compile decorator monkey-patched out to avoid a known Metal OOM on 9B-class models.
Save in PEFT-compatible format + OptIQ sidecar recording the per-layer rank distribution.

Mounted LoRA hot-swap (optiq.adapters.mount):

prepare_model_for_mounted_lora walks every transformer block and wraps each target linear (q_proj, v_proj by default) in a MountedLoRALinear that holds a dict of {adapter_id: (A, B, scale)} plus the frozen base linear.
mount_adapter_on_model(model, adapter_id, adapter_dir) loads adapter weights off disk and registers them on every MountedLoRALinear.
At inference time, a ContextVar decides which adapter is active. None → base only. with AdapterActivation("A"): → forward pass adds adapter A's residual.
ContextVar semantics mean concurrent asyncio tasks / threads with different active adapters don't step on each other.

When you need mlx-optiq vs bare mlx-lm

Scenario	Bare `mlx-lm`	`mlx-optiq`
Load + generate from an OptIQ HF model	✅	✅
Mixed-precision KV cache at serve time	—	✅
LoRA fine-tuning that uses OptIQ's sensitivity data	—	✅
Hot-swappable adapters in one serving process	—	✅
Fresh conversion of a new base model with OptIQ	—	✅
TurboQuant research pipelines	—	✅
YOLO26 quantization	—	✅

For pure inference on published OptIQ models: bare mlx-lm is enough and gets bit-identical output. For everything else, install mlx-optiq.

Hybrid-attention note

Qwen3.5 interleaves linear-attention (GatedDeltaNet) and full-attention layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. optiq kv-cache skips the linear layers automatically. On Qwen3.5-4B/9B, you end up with 8 of 32 layers getting per-layer KV bit assignments, typically 7 @ 4-bit + 1 @ 8-bit protecting layer 3 (the first full-attention layer).

Status / roadmap

✅ Weight quantization: production
✅ KV-cache serving (Qwen3.5): production since v0.0.5
✅ Sensitivity-aware LoRA + mounted hot-swap: production since v0.0.8
✅ VLM-to-text metadata stripping: production since v0.0.8
✅ YOLO26 pipeline: production
🚧 Gemma-4 KV serving: blocked on upstream mlx-lm shared-KV attention not supporting QuantizedKVCache
🚧 Per-request adapter routing in the HTTP layer: mount/swap API is production; HTTP X-OptIQ-Adapter header plumbing is next
🔬 TurboQuant serving path with a fused Metal kernel: research

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

Python ≥ 3.11
Apple Silicon Mac (for MLX)
mlx-lm ≥ 0.30

License

MIT

Project details

These details have not been verified by PyPI

Project links

Models

Release history Release notifications | RSS feed

0.0.11

Apr 26, 2026

0.0.10

Apr 25, 2026

This version

0.0.9

Apr 20, 2026

0.0.8

Apr 20, 2026

0.0.7

Apr 20, 2026

0.0.6

Apr 15, 2026

0.0.5

Apr 15, 2026

0.0.4

Apr 11, 2026

0.0.3

Apr 11, 2026

0.0.2

Mar 26, 2026

0.0.1

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.9.tar.gz (95.9 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_optiq-0.0.9-py3-none-any.whl (104.2 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file mlx_optiq-0.0.9.tar.gz.

File metadata

Download URL: mlx_optiq-0.0.9.tar.gz
Upload date: Apr 20, 2026
Size: 95.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`c54f77a6721999c2f5299b71aee58b90a1a3e282a06be4aed145c8279dc3b99d`
MD5	`ae391737f0fdab28d370bd3d8238ce17`
BLAKE2b-256	`6d1eb2599bb9a47cf7bec1e606b3df0b0eb2d2a10d7b98a060433817bd9b180a`

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.9-py3-none-any.whl.

File metadata

Download URL: mlx_optiq-0.0.9-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 104.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`75cb94a2ded7866feceb6607bce5a007ba9cafc8cb65e32a11f146d91d169aad`
MD5	`db4dbcc869e61d8eb3b63c0d7d5be823`
BLAKE2b-256	`f82251d9bc57794ee4de1f785a41695fe005b2a917de5365c3884895772f7888`

See more details on using hashes here.

mlx-optiq 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-optiq

Why mlx-optiq

The full stack at a glance

Quickstart

Running a pre-built OptIQ model

Serving with mixed-precision KV

Fine-tuning with sensitivity-aware LoRA

Serving with hot-swap adapters

Converting a fresh model

YOLO26

Headline numbers

How it works

When you need mlx-optiq vs bare mlx-lm

Hybrid-attention note

Status / roadmap

Article

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes