Skip to main content

Mixed-precision quantization optimizer for MLX models on Apple Silicon

Project description

mlx-optiq

Optimized deployment of LLMs, VLMs, and vision models on Apple Silicon.

Website: https://mlx-optiq.pages.dev/  |  PyPI: https://pypi.org/project/mlx-optiq/  |  Models: https://huggingface.co/mlx-community?search_models=OptiQ

OptIQ is an optimizing compiler and runtime for MLX. It takes a full-precision model and turns it into the best version for a given memory/latency budget on your Mac, using per-layer sensitivity measurements instead of "uniform 4-bit everywhere". The same sensitivity signal drives every layer of the stack — weights, KV cache, LoRA fine-tuning, runtime adapter swapping.

pip install mlx-optiq

Why mlx-optiq

Stock mlx-lm treats every layer of a quantized model the same. OptIQ doesn't:

  • Some layers are 50× more sensitive to quantization than others. OptIQ measures this once per model and assigns bits per-layer, holding the same average bits-per-weight while cutting quality loss. On GSM8K, this recovers +15–32 percentage points over uniform 4-bit on the same model, same quant budget (results).
  • The same is true of the KV cache. Some attention layers' KV are catastrophic to quantize (layer 0 KV is ~56× more sensitive than the median), others are essentially lossless at 4-bit. optiq serve runs a per-layer KV quant pipeline that keeps your quality while cutting decode memory — up to +62% decode tok/s at 64k context vs fp16 KV on M3 Max.
  • LoRA fine-tuning should reuse that sensitivity signal too. optiq lora train assigns higher adapter rank to layers OptIQ identified as sensitive, and lower rank to robust ones — so your adapter budget goes where it helps most.
  • Multi-adapter serving shouldn't reload the base model every time. optiq serve implements reversible mounted LoRA: mount multiple adapters on one base, switch per-request via a ContextVar-isolated activation gate, all without touching the frozen base weights.

Plus everything a deployment framework actually needs: vision-stripping for pure-text variants of VLMs, TurboQuant rotated-space KV compression (research path), YOLO26 quantization for object detection, and a roofline latency model calibrated to Apple Silicon bandwidth.

The full stack at a glance

Feature CLI Description
Weight quantization optiq convert Per-layer sensitivity + greedy knapsack. Auto-strips vision/audio metadata when quantizing multi-modal base models for text-only use.
KV cache quantization optiq kv-cache Writes per-layer kv_config.json. Same sensitivity method applied to the attention cache.
TurboQuant compression optiq.core.turbo_quant Rotation + optimal Lloyd-Max scalar quantization. Library API for research/custom pipelines.
OpenAI-compatible server optiq serve Drop-in mlx_lm.server replacement with mixed-precision KV, mounted LoRA adapters, and --adapter <HF repo id> auto-download.
Sensitivity-aware LoRA optiq lora train Per-layer rank scaling from OptIQ's bit assignments. PEFT-compatible output + OptIQ sidecar metadata.
Mounted hot-swap adapters optiq.adapters.mount Reversible per-request adapter activation via ContextVar. N adapters co-resident with one base.
VLM → text-only optiq convert --strip-unused-modalities Drops vision/audio weights + config cleanup. Output routes through gemma4_text / qwen3_5_text instead of the VLM wrapper.
YOLO26 quantization optiq convert --model-type yolo Full pipeline including per-layer detection-output KL sensitivity. Outputs a yolo-mlx compatible model.
Latency prediction optiq latency Roofline model calibrated to Apple Silicon memory bandwidth. Predicts decode tok/s for a given model + bit layout before running it.
Benchmarking optiq benchmark / optiq eval GSM8K, AI2D, WER, perplexity. Side-by-side vs baselines.

Quickstart

Running a pre-built OptIQ model

Every mlx-community OptiQ model works out of the box with stock mlx-lm:

from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=50))

Installing mlx-optiq unlocks the rest — mixed-precision KV serving, LoRA fine-tuning, runtime hot-swap adapters. Bit-identical inference quality either way.

Serving with mixed-precision KV

# One-time per-layer KV sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# OpenAI-compatible server on :8080
optiq serve \
  --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
  --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20

Fine-tuning with sensitivity-aware LoRA

# Train a LoRA adapter. --rank-scaling by_bits assigns rank proportional
# to OptIQ's per-layer bit assignments: 8-bit layers get 2× the rank of
# 4-bit layers, at the same average.
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Inspect what was adapted (shows per-layer rank distribution)
optiq lora info ./my_adapter

Adapter output is PEFT-compatible (adapter_config.json + adapters.safetensors) plus an OptIQ sidecar (optiq_lora_config.json) that records per-layer rank. Loads with any PEFT tool.

Serving with hot-swap adapters

# Preload an adapter at startup (HF repo id or local path)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
            --adapter ./my_adapter

# ...or use the mount API directly for multi-adapter serving:
#   mount N adapters on one base, switch per request in the same Python
#   process without reloading the model. See optiq/adapters/mount.py.

Converting a fresh model

pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

For multi-modal base models quantized for text-only deployment, OptIQ auto-strips vision/audio metadata by default. Pass --keep-unused-modalities to disable.

YOLO26

from optiq.models.yolo import run_yolo_pipeline
results = run_yolo_pipeline(
    "optiq_output/yolo26n.safetensors",
    "optiq_output/yolo26n_optiq",
)

Full pipeline: per-layer sensitivity on detection outputs → greedy knapsack → yolo-mlx-compatible output.

Headline numbers

Weight quantization on GSM8K — OptIQ vs uniform 4-bit (same avg BPW):

Model Uniform-4b OptIQ-4b Δ
Qwen3.5-0.8B 11.5% 27.0% +15.5pp
gemma-4-e4b-it 23.5% 55.5% +32.0pp

Decode throughput at 64k contextoptiq serve mixed-precision KV vs fp16 on M3 Max 36GB:

Model fp16 tok/s OptIQ tok/s Δ
Qwen3.5-2B 27.9 41.8 +50%
Qwen3.5-4B 8.1 13.1 +62%
Qwen3.5-9B 20.7 27.1 +31%

Full tables + methodology: Results page.

How it works

Weight quantization pipeline (optiq convert):

  1. Load the base model from HuggingFace via PyTorch.
  2. For each linear layer × each candidate bit-width, simulate quantization and measure KL divergence between full-precision and quantized logits on a calibration set (WikiText-2 for LLMs, COCO-captions for VLMs).
  3. Greedy knapsack: start every layer at the minimum bits, upgrade the layer with the best "KL-reduction-per-bit" ratio each step until the target BPW budget is spent. Protected layers (lm_head, embed_tokens, first/last transformer blocks) always get the max bit-width.
  4. MLX conversion via mlx-lm.convert() with the per-layer quant_predicate from step 3.
  5. For multi-modal base models, strip vision/audio metadata from config.json (auto; opt out with --keep-unused-modalities).

KV cache pipeline (optiq kv-cache + optiq serve):

  1. Same sensitivity measurement but applied to the KV cache: for each full-attention layer, replace that layer's KV with a quantized copy and measure KL on held-out prompts.
  2. optiq serve monkey-patches mlx_lm.server.stream_generate to use mlx_lm.models.cache.QuantizedKVCache at per-layer bit-widths (via a patched maybe_quantize_kv_cache hook).
  3. At attention time, mx.quantized_matmul reads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel is faster than the 4-bit kernel, so protecting one sensitive layer at 8-bit gives both quality AND a throughput bump.

TurboQuant KV (optiq.core.turbo_kv_cache — research path):

  1. Random orthogonal rotation makes coordinates near-independent → better-conditioned quantization.
  2. Optimal Lloyd-Max scalar quantization per coordinate (1/2/3/4-bit centroid tables).
  3. Rotated-space attention: rotate Q once and output once, work in centroid space in between. Attention cost stays O(seq × d); rotation is O(d²) fixed.

Sensitivity-aware LoRA (optiq lora train):

  1. Read optiq_metadata.json.per_layer — OptIQ's per-layer bit assignment.
  2. Per target linear, derive rank: rank_scaling=by_bits gives r = base_rank × (bits / 4), so 8-bit layers get 2× the rank of 4-bit at the same base.
  3. Apply mounted LoRA across all target blocks with the per-layer rank.
  4. Train via mlx_lm.tuner.trainer.train, with the mx.compile decorator monkey-patched out to avoid a known Metal OOM on 9B-class models.
  5. Save in PEFT-compatible format + OptIQ sidecar recording the per-layer rank distribution.

Mounted LoRA hot-swap (optiq.adapters.mount):

  1. prepare_model_for_mounted_lora walks every transformer block and wraps each target linear (q_proj, v_proj by default) in a MountedLoRALinear that holds a dict of {adapter_id: (A, B, scale)} plus the frozen base linear.
  2. mount_adapter_on_model(model, adapter_id, adapter_dir) loads adapter weights off disk and registers them on every MountedLoRALinear.
  3. At inference time, a ContextVar decides which adapter is active. None → base only. with AdapterActivation("A"): → forward pass adds adapter A's residual.
  4. ContextVar semantics mean concurrent asyncio tasks / threads with different active adapters don't step on each other.

When you need mlx-optiq vs bare mlx-lm

Scenario Bare mlx-lm mlx-optiq
Load + generate from an OptIQ HF model
Mixed-precision KV cache at serve time
LoRA fine-tuning that uses OptIQ's sensitivity data
Hot-swappable adapters in one serving process
Fresh conversion of a new base model with OptIQ
TurboQuant research pipelines
YOLO26 quantization

For pure inference on published OptIQ models: bare mlx-lm is enough and gets bit-identical output. For everything else, install mlx-optiq.

Hybrid-attention note

Qwen3.5 interleaves linear-attention (GatedDeltaNet) and full-attention layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. optiq kv-cache skips the linear layers automatically. On Qwen3.5-4B/9B, you end up with 8 of 32 layers getting per-layer KV bit assignments, typically 7 @ 4-bit + 1 @ 8-bit protecting layer 3 (the first full-attention layer).

Status / roadmap

  • ✅ Weight quantization: production
  • ✅ KV-cache serving (Qwen3.5): production since v0.0.5
  • ✅ Sensitivity-aware LoRA + mounted hot-swap: production since v0.0.8
  • ✅ VLM-to-text metadata stripping: production since v0.0.8
  • ✅ YOLO26 pipeline: production
  • 🚧 Gemma-4 KV serving: blocked on upstream mlx-lm shared-KV attention not supporting QuantizedKVCache
  • 🚧 Per-request adapter routing in the HTTP layer: mount/swap API is production; HTTP X-OptIQ-Adapter header plumbing is next
  • 🔬 TurboQuant serving path with a fused Metal kernel: research

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

  • Python ≥ 3.11
  • Apple Silicon Mac (for MLX)
  • mlx-lm ≥ 0.30

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.9.tar.gz (95.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_optiq-0.0.9-py3-none-any.whl (104.2 kB view details)

Uploaded Python 3

File details

Details for the file mlx_optiq-0.0.9.tar.gz.

File metadata

  • Download URL: mlx_optiq-0.0.9.tar.gz
  • Upload date:
  • Size: 95.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.9.tar.gz
Algorithm Hash digest
SHA256 c54f77a6721999c2f5299b71aee58b90a1a3e282a06be4aed145c8279dc3b99d
MD5 ae391737f0fdab28d370bd3d8238ce17
BLAKE2b-256 6d1eb2599bb9a47cf7bec1e606b3df0b0eb2d2a10d7b98a060433817bd9b180a

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: mlx_optiq-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 104.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 75cb94a2ded7866feceb6607bce5a007ba9cafc8cb65e32a11f146d91d169aad
MD5 db4dbcc869e61d8eb3b63c0d7d5be823
BLAKE2b-256 f82251d9bc57794ee4de1f785a41695fe005b2a917de5365c3884895772f7888

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page