Skip to main content

Fast KV cache quantization for Apple Silicon โ€” TurboQuant, RVQ, VecInfer (with Metal kernels), RateQuant, PolarQuant, and QJL in MLX

Project description

VeloxQuant-MLX

๐ŸŒ veloxquant-mlx.netlify.app โ€” interactive landing page with benchmarks, algorithm explainers, and copy-paste code snippets.

Fast KV-cache quantization for Apple Silicon โ€” TurboQuant, RVQ, VecInfer, RateQuant, PolarQuant, and QJL in MLX.

Landing page PyPI version Python Platform License: MIT

A drop-in KV-cache replacement for mlx_lm that compresses the Key tensor by 3โ€“16ร— with near-lossless quality at 4-bit, functional 2-bit and 1-bit via Residual Vector Quantization, up to 16ร— via VecInfer product VQ, and per-layer mixed-precision allocation via RateQuant. Validated end-to-end on 10 production models (Mistral, Falcon, Phi-4, Qwen3, Qwen2.5, Llama 3.1/3.2, Gemma3, SmolLM2).

Uniform 1-bit RVQ โ€” 7.5ร— key compression, 95โ€“104% of fp16 throughput

import mlx_lm
from veloxquant_mlx import KVCacheBuilder, KVCacheConfig

model, tokenizer = mlx_lm.load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=1, seed=42)
caches = KVCacheBuilder.for_model(model, config)
model.make_cache = lambda *_a, **_k: caches

response = mlx_lm.generate(model, tokenizer,
    prompt="Explain the theory of relativity in simple terms.",
    max_tokens=200,
)

Per-layer RateQuant โ€” match fp16 throughput at fractional average bits

from veloxquant_mlx import (
    KVCacheBuilder, KVCacheConfig,
    calibrate_layer_sensitivities, allocate_bits_ratequant,
)

# 1.6s one-time calibration on real model activations
weights = calibrate_layer_sensitivities(model, tokenizer)

# Theorem 2 closed-form: high-sensitivity layers get more bits
alloc = allocate_bits_ratequant(weights, target_avg_bits=1.5, beta=3.5)

# Pass the list directly to KVCacheConfig โ€” for_model() consumes per layer
config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=alloc, seed=42)
caches = KVCacheBuilder.for_model(model, config)

Table of contents

  1. Highlights
  2. Installation
  3. Quick start
  4. RateQuant โ€” per-layer mixed precision
  5. What's inside
  6. Algorithm guide
  7. Per-model benchmark results
  8. Throughput optimization journey
  9. Architecture
  10. CLI
  11. Development
  12. References

Highlights

  • VecInfer product VQ (new in 0.5.0) โ€” smooth scaling + Walsh-Hadamard dual transform + K-means codebook delivers 16ร— key compression at 1 bit/elem. On Qwen2.5-7B (strong GQA) VecInfer-1bit exceeds fp16 throughput at 16ร— compression. Benchmarked across 10 models โ€” see v6 results.
  • RVQ 1-bit (new in 0.3.4) โ€” sign-quantizer stage + Laplacian residual delivers 7.5ร— key compression with cosine 0.92, generates full 200-token output on every tested model, and matches or beats fp16 throughput on most 7โ€“8B models.
  • RateQuant per-layer allocation (new in 0.3.5) โ€” Theorem 2 reverse-waterfilling on real activation sensitivities. Pass bit_width_inlier=alloc_list to KVCacheConfig, let KVCacheBuilder.for_model() consume per layer. 1.6s one-time calibration, zero inference overhead.
  • RVQ 2-bit โ€” two-pass residual quantization brings 2-bit cosine from 0.69 โ†’ 0.98.
  • End-to-end fp16 throughput parity on Mistral, Falcon, Phi-4, Qwen3, Gemma3 after the throughput optimizations.
  • Four quantizers, one interface โ€” turboquant_rvq, turboquant_prod, turboquant_mse, plus polar and qjl.
  • Native MLX integration โ€” no Metal kernel writing required; uses mx.hadamard_transform for O(d log d) rotation.
  • Production patterns โ€” Factory + Strategy + Registry + Builder. Drop-in for mlx_lm.cache.KVCache.
  • Apple Silicon first โ€” designed and tested on M-series unified-memory.

Installation

pip install VeloxQuant-MLX

From source:

git clone https://github.com/rajveer43/VeloxQuant-MLX
cd VeloxQuant-MLX
pip install -e ".[dev]"

Requires Python โ‰ฅ 3.11 and an Apple Silicon Mac with MLX โ‰ฅ 0.18.


Quick start

Standalone KV cache (synthetic streaming)

from veloxquant_mlx import KVCacheBuilder
import mlx.core as mx, numpy as np

cache = (
    KVCacheBuilder()
    .with_method("turboquant_rvq")     # try also: "turboquant_prod", "polar", "qjl"
    .with_head_dim(128)
    .with_bit_width(inlier=2)          # 2-bit RVQ uses 2*b = 4 bits/dim total
    .with_seed(42)
    .build()
)

rng = np.random.default_rng(0)
for _ in range(1000):
    cache.append(
        mx.array(rng.standard_normal(128).astype(np.float16)),
        mx.array(rng.standard_normal(128).astype(np.float16)),
    )

q = mx.array(rng.standard_normal(128).astype(np.float16))
out = cache.attend(q)
print(f"Memory: {cache.memory_bytes()/1024:.1f} KB for {len(cache)} tokens")

Drop-in replacement for mlx_lm generation (recommended)

KVCacheBuilder.for_model() handles per-layer construction, dtype detection, and VLM wrappers automatically:

import mlx_lm
from veloxquant_mlx import KVCacheBuilder, KVCacheConfig

model, tokenizer = mlx_lm.load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=1, seed=42)
caches = KVCacheBuilder.for_model(model, config)
model.make_cache = lambda *_a, **_k: caches

response = mlx_lm.generate(model, tokenizer, prompt="...", max_tokens=200)

Per-cache byte accounting is available via cache.fp16_key_bytes / cache.compressed_key_bytes for benchmark reporting.


RateQuant โ€” per-layer mixed precision

The default is uniform bit-width across layers. RateQuant (arxiv:2605.06675) allocates more bits to high-sensitivity layers and fewer to low-sensitivity ones, with the average held at a user-chosen target. The library exposes both the sensitivity probe and the closed-form allocator:

from veloxquant_mlx import (
    KVCacheBuilder, KVCacheConfig,
    calibrate_layer_sensitivities,   # 1.6s, real-activation probe
    allocate_bits_ratequant,         # Theorem 2 reverse-waterfilling
)

# Step 1 โ€” one-time calibration on 8 default prompts (overridable)
weights = calibrate_layer_sensitivities(model, tokenizer)
# weights[i] is the mean-squared key L2 norm at layer i

# Step 2 โ€” closed-form allocation. Average is exact; per-layer bits are integer.
alloc = allocate_bits_ratequant(
    weights,
    target_avg_bits=1.5,   # fractional โ€” integer alloc straddles it
    beta=3.5,              # RVQ decay constant (paper-reported)
    bit_choices=(1, 2, 3), # RVQ supports any positive integer b
)

# Step 3 โ€” pass directly to KVCacheConfig. for_model() consumes per-layer.
config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=alloc, seed=42)
caches = KVCacheBuilder.for_model(model, config)

When does it help? When per-layer sensitivity is heterogeneous. The calibration printout reports the min/max range; a ratio above ~2ร— indicates RateQuant will give measurable gains. Empirically:

Model Sensitivity ratio Notes
Falcon3 7B (28 layers, head_dim=256) 6.48ร— Mixed alloc: 14 layers at b=2, 14 at b=1
Gemma3 4B (34 layers, head_dim=256) 14.39ร— Mixed alloc: 3 at b=3, 11 at b=2, 20 at b=1

Distortion model. The paper notes that the decay rate ฮฒ varies across quantizers (3.5 for TurboQuant, โ‰ˆ5.0 for KIVI/QuaRot). The library default of beta=3.5 is correct for RVQ; if you adapt the allocator to another quantizer, call fit_distortion_curve() first to estimate ฮฒ.

What's NOT (yet) implemented from the paper: per-head allocation (paper: Lร—H groups, ours: L), gradient-based sensitivity (paper notes activation is ~1 PPL worse but both beat uniform), and K/V separation (paper's biggest single fix on KIVI). These remain open extensions โ€” the per-layer subset already gives most of the benefit on RVQ at โ‰ฅ1.5 bits.


VecInfer โ€” vector quantization with outlier-suppressing dual transform

New in 0.5.0. VecInfer (Yao et al. 2025) compresses the KV cache via product vector quantization against a pre-trained K-means codebook. To handle outlier channels that wreck codebook utilization at low bit-widths, VecInfer applies a dual transform before quantization: per-channel smooth scaling (lambda_i = sqrt(max|K_i|)) followed by a Walshโ€“Hadamard rotation. The inverse transform is absorbed into the queries, so q_tilde @ K_tilde.T == q @ K.T exactly.

import mlx.core as mx
from mlx_lm import load, generate
from veloxquant_mlx import KVCacheConfig, KVCacheBuilder

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

config = KVCacheConfig(
    method="vecinfer",
    key_codebook_bits=8, key_sub_dim=4,       # 2 bits/elem on keys
    value_codebook_bits=8, value_sub_dim=4,   # 2 bits/elem on values
    # smooth_factors / key_codebook / value_codebook can be passed pre-calibrated;
    # omitted = random init (useful only for wiring tests)
)
caches = KVCacheBuilder.for_model(model, config)
out = generate(model, tokenizer, prompt="Explain Hadamard rotation", max_tokens=80,
               prompt_cache=caches)

Realized compression (Llama-3.2-3B-Instruct-4bit, this repo): 8ร— on keys at 2 bits/elem, 16ร— at 1 bit/elem.

Tradeoff to know up front: the paper's CUDA kernel fuses dequantization into attention, eliminating the dequant overhead. That fusion is not portable to Metal โ€” on Apple Silicon you should expect throughput to drop vs fp16 (the win is memory, not speed). See benchmark_scripts/benchmark_vecinfer.py and figures/vecinfer/ for measured numbers.


What's inside

Module Purpose
veloxquant_mlx.quantizers.turboquant_prod Rotation + Lloyd-Max + QJL residual (b-1 + 1 bits)
veloxquant_mlx.quantizers.turboquant_mse Rotation + Lloyd-Max only (no residual correction)
veloxquant_mlx.quantizers.turboquant_rvq Two-pass scalar RVQ (Gaussian + Laplacian codebooks), b=1/2/3+
veloxquant_mlx.quantizers.polarquant Recursive polar coordinate decomposition
veloxquant_mlx.quantizers.qjl Pure 1-bit JL sign sketch
veloxquant_mlx.cache.turboquant_rvq_cache TurboQuantRVQKVCache mlx_lm-compatible cache wrapper
veloxquant_mlx.cache.vecinfer_cache NEW (0.5.0) โ€” VecInferKVCache smooth + Hadamard + product VQ
veloxquant_mlx.allocators.vecinfer NEW (0.5.0) โ€” calibrate_smooth_factors, walsh_hadamard_matrix, train_codebook, quantize_vq
veloxquant_mlx.allocators allocate_bits_ratequant, calibrate_layer_sensitivities
veloxquant_mlx.observers DistortionObserver, LatencyObserver, MemoryObserver, KeyNormObserver (new)
veloxquant_mlx.codebooks ScalarCodebook, Lloyd-Max strategies, AdaptiveScalarCodebook
veloxquant_mlx.preconditioners RotationPreconditioner (QR), HadamardPreconditioner (Metal)
veloxquant_mlx.cache TurboQuantKVCache standalone, mlx_lm KVCache subclasses
veloxquant_mlx.weight QuantizedLinear for model weight quantization
veloxquant_mlx.dsa.bit_pack Sub-byte index packing
veloxquant_mlx.outlier Two-stream cache for high-variance channels

Algorithm guide โ€” which method to pick

Method Bits/dim Per-vector storage (d=128) Quality (cosine) Best for
turboquant_mse b bยทd/8 + 4 B norm 0.86 @ 3b, 0.95 @ 4b Lowest overhead at 3โ€“4 bit
turboquant_prod b-1 + 1 (b-1)ยทd/8 + JL signs + 2 norms 0.86 @ 3b, 0.95 @ 4b Unbiased IP estimator at 3โ€“4 bit
turboquant_rvq @ b=2 2ยทb = 4 64 B 0.98 Functional 2-bit, 3.9ร— compression
turboquant_rvq @ b=1 2ยทb = 2 34 B 0.92 Aggressive 7.5ร— compression โ€” full output on every tested model
polar bยทlevels varies medium Geometric structure, very low bits
qjl 1 d/8 + 2 B norm 0.62 Topology-only retrieval, extreme compression

Rule of thumb:

  • 3โ€“4 bit, maximum compression at uniform precision โ†’ turboquant_mse
  • 3โ€“4 bit, best uniform-precision quality โ†’ turboquant_prod
  • 2 bit (3.9ร— key compression, full coherent output) โ†’ turboquant_rvq with b=2
  • 1 bit (7.5ร— key compression, full coherent output on 7/7 tested models) โ†’ turboquant_rvq with b=1
  • Fractional average bits (mixed-precision) โ†’ turboquant_rvq + allocate_bits_ratequant
  • Ranking-only retrieval, extreme compression โ†’ qjl

Per-model benchmark results

All measurements on Apple M4 MacBook ยท 16/24 GB unified memory ยท Python 3.12. Prompt: structured 200-token explanation of relativity.

v4 results โ€” RVQ 1-bit at 7.5ร— compression (8-model sweep, 0.3.4)

Model fp16 tok/s RVQ 1-bit tok/s tokens vs fp16
Mistral 7B v0.3 23.3 22.2 201/201 95%
Falcon3 7B 24.0 23.1 200/200 96%
Phi-4 11.9 11.8 200/200 99%
Qwen3 4B 40.2 34.3 187/200 85%
Qwen3 8B 20.5 21.1 200/200 103%
Llama 3.1 8B 22.0 21.5 201/201 98%
Gemma3 4B 32.5 30.5 201/201 94%
Qwen2.5 32B 3.7 โ€” โ€” memory-constrained on 24 GB, see docs

Generated by benchmark_scripts/run_outlier_ratequant.py. Source figures: figures/outlier_token_ratequant/<model>/.

v5 results โ€” RateQuant V2 mixed-precision (2-model trial, 0.3.5)

Per-layer allocation via allocate_bits_ratequant at target bฬ„=1.5, measured on Apple M4 24 GB. Source figures: figures/2026-05-16/.

Model fp16 RVQ 1-bit RVQ 1-bit + Outlier RVQ + RateQuant V2 sens. ratio
Falcon3 7B 22.9 23.1 (101%) 22.0 (96%) 22.8 (100%) at 5.22ร— compression 6.48ร—
Gemma3 4B 39.8 37.8 (95%) 34.7 (87%) 36.3 (91%) at 5.22ร— compression 14.39ร—

Per-layer allocations were computed from a 1.6s real-activation calibration: Falcon3 split 14/14 (b=2/b=1); Gemma3 split 3/11/20 (b=3/b=2/b=1).

v6 results โ€” VecInfer 10-model comparative study (0.5.0)

8-config head-to-head across 10 models. Full data and per-model plots in figures/vecinfer/. Cross-model chart: figures/vecinfer/_summary/cross_model_comparison.png.

Key compression ratio:

Model head_dim TQ-2bit RVQ-1bit VecInfer-2bit VecInfer-1bit
SmolLM2-135M 64 6.4ร— 7.1ร— 8.0ร— 16.0ร—
Llama-3.2-1B 64 6.4ร— 7.1ร— 8.0ร— 16.0ร—
Llama-3.2-3B 128 9.1ร— 7.5ร— 8.0ร— 16.0ร—
Llama-3.1-8B 128 9.1ร— 7.5ร— 8.0ร— 16.0ร—
Mistral-7B 128 9.1ร— 7.5ร— 8.0ร— 16.0ร—
Qwen2.5-7B 128 9.1ร— 7.5ร— 8.0ร— 16.0ร—
Qwen3-8B 128 9.1ร— 7.5ร— 8.0ร— 16.0ร—
Phi-4 128 9.1ร— 7.5ร— 8.0ร— 16.0ร—
Falcon3-7B 256 11.6ร— 7.8ร— โ€” (OOM) 16.0ร—
gemma-3-4b 256 11.6ร— 7.8ร— 8.0ร— 16.0ร—

Throughput (tok/s):

Model fp16 TQ-2bit RVQ-1bit VecInfer-2bit VecInfer-1bit
SmolLM2-135M 250.4 70.4 188.5 163.0 175.8
Llama-3.2-1B 105.4 75.5 104.3 60.4 91.2
Llama-3.2-3B 47.6 20.6 46.2 39.7 40.2
Llama-3.1-8B 20.5 19.8 20.6 10.7 19.6
Mistral-7B 23.6 22.5 22.8 21.2 9.8
Qwen2.5-7B 21.0 12.0 20.7 21.3 21.5 โ† exceeds fp16 at 16ร—
Qwen3-8B 20.3 19.4 19.6 17.9 2.4
Phi-4 10.4 9.6 8.1 7.2 4.0
Falcon3-7B 17.3 15.9 21.7 โ€” 17.0
gemma-3-4b 26.0 22.7 24.2 22.6 22.6

VecInfer wins on raw compression (16ร— on every model). RVQ-1bit wins on throughput/memory balance โ€” within 5% of fp16 on most 7โ€“8B models with zero calibration overhead. Qwen2.5-7B is the standout: VecInfer-1bit exceeds fp16 at 16ร— compression, likely due to its strong GQA ratio (28q/4kv heads). See figures/vecinfer/_summary/SUMMARY.md for the full analysis.

Cross-model summary (single-pass quality at 3-bit and 4-bit)

Model Architecture head_dim fp16 tok/s 3-bit quality 4-bit quality
Llama 3.2 3B dense 128 47.2 Repetition Near-lossless
Mistral 7B v0.3 dense 128 22.1 Near-lossless Near-lossless
Falcon3 7B dense 128 22.1 Near-lossless Near-lossless
Qwen3 4B dense 128 38.7 Near-lossless Early stop
Qwen3 8B dense 128 20.6 Partial Partial
Llama 3.1 8B dense 128 21.5 Stops @ 62 Near-lossless
Phi-4 dense 128 โ€“ Near-lossless Near-lossless
Gemma-4 hybrid (35 sliding + 7 full) 512 19.3 Near-lossless Near-lossless
Qwen2.5 32B dense 128 7.1 Near-lossless Near-lossless

Source: per-model benchmark scripts under benchmark_*.py producing 6 figures each in figures/<model>/.

v2 results โ€” with RVQ 2-bit (0.3.0 throughput optimizations active)

Both runs below use the optimized fast path (Hadamard rotation + boundary-sum quantize + cast cleanup + head batching).

Mistral 7B v0.3 โ€” 4-bit weights ยท head_dim=128 ยท 32 layers ยท 8 KV heads

Config Key compression Throughput Tokens Quality
fp16 baseline 1.00ร— 22.1 tok/s 201/201 reference
TQ 2-bit (single-pass) 9.14ร— 22.4 tok/s 201/201 coherent
TQ 3-bit 5.82ร— 22.4 tok/s 201/201 coherent
TQ 4-bit 4.27ร— 21.8 tok/s 201/201 near-lossless
TQ RVQ 2-bit 3.88ร— 22.3 tok/s 201/201 near-lossless

Mistral 7B is memory-bandwidth bound at ~22 tok/s. Every quantized config now matches fp16. Figures: figures/updated_tests/mistral7b/.

Qwen3 4B โ€” 4-bit weights ยท head_dim=128 ยท <think> mode (most quantization-sensitive)

Config Key compression Throughput Tokens Quality
fp16 baseline 1.00ร— 39.2 tok/s 200/200 reference
TQ 2-bit (single-pass) 9.14ร— 31.2 tok/s 174/200 early stop
TQ 3-bit 5.82ร— 30.7 tok/s 172/200 partial
TQ 4-bit 4.27ร— 8.6 tok/s 50/200 <think>-loop
TQ RVQ 2-bit 3.88ร— 36.0 tok/s 199/200 coherent

RVQ 2-bit is the only quantized config that produces near-full coherent output on Qwen3's <think> mode while reaching 92% of fp16 throughput. Figures: figures/updated_tests/qwen3_4b/.

Llama 3.1 8B Instruct (4-bit) โ€” head_dim=128 ยท 32 layers ยท 8 KV heads

Config Key compression Throughput Tokens Quality
fp16 baseline 1.00ร— 21.5 tok/s 201/201 reference
TQ 2-bit (single-pass) 9.14ร— 16.3 tok/s 187/201 broken
TQ 3-bit 5.82ร— 13.9 tok/s 62/201 repetition
TQ 4-bit 4.27ร— 14.8 tok/s 201/201 near-lossless

v2 (RVQ 2-bit) not yet benchmarked for this model. Figures: figures/llama31_8b/.


Throughput optimization journey

The 0.3.0 release lifts quantized throughput to fp16 parity. Four sequential changes, each independently benchmarked:

Stage Mistral 7B RVQ 2-bit Qwen3 4B RVQ 2-bit
0. Original (per-head Python loop) 17.7 tok/s 24.8 tok/s
1. Batch heads (B,H,S,D) โ†’ (BยทHยทS,D) 21.5 tok/s 34.0 tok/s
2. Hadamard rotation by default 20.0 tok/s โ€“
3. Boundary-sum quantize (replaces argmin) 22.4 tok/s โ€“
4. Drop redundant fp32โ†”fp16 casts 22.3 tok/s 36.0 tok/s

Quality verified at every step โ€” RVQ cosine 0.9766 unchanged, 100% index match on boundary-sum vs argmin, full token completion preserved on real models.

Full writeup: OPTIMIZATION_FINDINGS.md. Stage-by-stage figure: figures/updated_tests/optimization_journey.png.


Architecture

The pipeline uses a Chain of Responsibility pattern. Each handler mutates a QuantizationContext and passes it downstream:

TurboQuantProd pipeline
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
  x (fp16, batch ร— d)
       โ”‚
  Normalize โ†’ Rotate (ฮ ) โ†’ Scalar quantize โ†’ QJL residual sketch โ†’ BitPack
       โ”‚
  EncodedVector(indices, signs, residual_norm)

TurboQuantRVQ pipeline (NEW)
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
  x (fp16, batch ร— d)
       โ”‚
  Rotate (ฮ ) โ†’ Stage-1 quantize (Gaussian Lloyd-Max, b bits)
            โ†’ Compute residual rโ‚ = y โˆ’ ลทโ‚
            โ†’ Stage-2 quantize (Laplacian Lloyd-Max, b bits) โ†’ idxโ‚‚
       โ”‚
  EncodedVector(idxโ‚, idxโ‚‚)
       โ”‚
  Decode: ลท = ลทโ‚ + ลทโ‚‚ โ†’ unrotate

Design patterns used (10): Abstract Base Classes, Factory, Chain of Responsibility, Builder, Strategy, Registry + Plugin, Composite, Observer, DAO, Custom DSA (RingBuffer, MaxHeap, BitPackBuffer, VoronoiTree).


CLI

Precompute artifacts (rotation matrices, JL matrices, codebooks)

python -m veloxquant_mlx precompute \
    --head_dim 128 --bits 1 2 3 4 --jl_dim 128 --seed 42 \
    --output_dir ./artifacts/

Then pass an NpyArtifactStore to the builder to load instead of recompute:

from veloxquant_mlx.artifacts import NpyArtifactStore
cache = (KVCacheBuilder()
    .with_method("turboquant_rvq")
    .with_head_dim(128).with_bit_width(inlier=2)
    .with_artifact_store(NpyArtifactStore("./artifacts/"))
    .build())

Benchmark a single configuration

python -m veloxquant_mlx benchmark \
    --method turboquant_rvq --head_dim 128 --bits 2 --seq_len 1000

Benchmark a real model end-to-end

python benchmark_mistral7b_v2.py            # 5 configs incl. RVQ 2-bit
python benchmark_qwen3_4b_v2.py             # โ†ณ outputs to figures/updated_tests/<model>/
python benchmark_<model>.py                 # original 4-config script (figures/<model>/)

Development

# Tests
pytest veloxquant_mlx/tests/ -v

# 2-bit improvement validation (synthetic, fast)
python test_2bit_improvements.py

# Generate optimization-journey figure
python scripts/plot_optimization_journey.py

References

Implemented in this library

  • TurboQuant (ICLR 2026) โ€” Zandieh et al., "Online Vector Quantization with Near-optimal Distortion Rate"
  • RateQuant (2025) โ€” "RateQuant: Mixed-Precision KV Cache Quantization via Rate-Distortion Theory"
  • PolarQuant (AISTATS 2026) โ€” "PolarQuant: Quantizing KV Caches with Polar Transformation"
  • QJL (2024) โ€” Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization"

Related work โ€” quantization

  • KIVI (ICML 2024) โ€” Liu et al., "A Tuning-Free Asymmetric 2-Bit Quantization for KV Cache"
  • KVQuant (NeurIPS 2024) โ€” Hooper et al., "Towards 10 Million Context Length LLM Inference with KV Cache Quantization"
  • Coupled Quantization (NeurIPS 2024) โ€” Zhang et al., "KV Cache is 1 Bit Per Channel: Efficient LLM Inference with Coupled Quantization"
  • KVTuner (ICML 2025) โ€” Li et al., "Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization"
  • MixKVQ (2024) โ€” Zhang et al., "Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning"
  • VecInfer (2024) โ€” Yao et al., "Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization"
  • FibQuant (2025) โ€” "Universal Vector Quantization for Random-Access KV-Cache Compression"

Related work โ€” token eviction & sparse attention

  • SnapKV (2024) โ€” Li et al., "LLM Knows What You are Looking for Before Generation"
  • PyramidKV (2024) โ€” Cai et al., "Dynamic KV Cache Compression based on Pyramidal Information Funneling"
  • RocketKV (ICML 2025) โ€” Behnam et al., "Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression"
  • MagicPIG (ICLR 2025 Spotlight) โ€” Chen et al., "LSH Sampling for Efficient LLM Generation"

Related work โ€” low-rank & cross-layer compression

Survey

Framework


License

MIT โ€” see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veloxquant_mlx-0.5.1.tar.gz (119.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

veloxquant_mlx-0.5.1-py3-none-any.whl (157.6 kB view details)

Uploaded Python 3

File details

Details for the file veloxquant_mlx-0.5.1.tar.gz.

File metadata

  • Download URL: veloxquant_mlx-0.5.1.tar.gz
  • Upload date:
  • Size: 119.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veloxquant_mlx-0.5.1.tar.gz
Algorithm Hash digest
SHA256 788c3fad569b5759c4ec182190830022128b709fca4eb3466ae76948efa68799
MD5 cb9496d6c190bdff1377e7c407d9c250
BLAKE2b-256 edf739c2d51f9be82fab6bbf7e20bc3d7b8666f26dec3cf42d5aead29c3789de

See more details on using hashes here.

File details

Details for the file veloxquant_mlx-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: veloxquant_mlx-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 157.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veloxquant_mlx-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e4211c9042c0a00d54e86aa49b7f1c8261a817b73228ea22f98cc2bb34034ad5
MD5 a3f77f349a44d74715d6a3fb85b5aae8
BLAKE2b-256 44106e9fd479e4858e021f2fcfe2c73b9d380c1f464a0fce306d52fcdd46d3ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page