Fast KV cache quantization for Apple Silicon — TurboQuant, RVQ, VecInfer (with Metal kernels + Phase-2 fused SDPA), RateQuant, PolarQuant, SpectralQuant, and QJL in MLX

These details have not been verified by PyPI

Project links

Project description

VeloxQuant-MLX

Fast KV Cache Quantization for Apple Silicon
TurboQuant · RVQ · VecInfer · RateQuant · PolarQuant · QJL · SpectralQuant — in MLX

Platform Tests

A KV-cache compression library for mlx_lm that compresses the Key tensor up to 16× with near-lossless quality on Apple M-series chips. Ships seven quantization strategies — from zero-calibration 1-bit RVQ to the new SpectralQuant which exploits the low-dimensional structure of key vectors for 5.95× compression at higher quality than TurboQuant — plus hand-written Metal compute kernels that make the hot path 13× faster and 98% lighter on peak memory at long context lengths. Plug it in with three lines; mlx_lm.generate runs unchanged.

Numbers that matter

Metric	Value	Notes
Max key cache compression	16×	VecInfer-1bit, head_dim=128
Metal kernel speedup	13×	`quantize_vq` at S=2048+
Peak memory reduction	98%	729 MB → 12 MB, Falcon3-7B shape
RVQ-1bit compression	7.5×	Near-zero throughput cost
FP16 throughput retained	100%	Qwen2.5-7B at 16× compression
SpectralQuant compression	5.95×	vs TurboQuant 5.02× — same bit-width
SpectralQuant cosine sim	+3pp	over TurboQuant on Qwen2.5-0.5B
Production models validated	12	Llama, Mistral, Qwen, Phi, Gemma 3/4

Installation
Quickstart
SpectralQuant — new in 0.6.0
RateQuant — per-layer mixed precision
VecInfer — 16× product VQ
Metal kernels
Benchmark results
Algorithm guide
What's inside
Architecture
CLI
Development
References

Installation

pip install VeloxQuant-MLX

Requirements: Apple Silicon M1+, Python ≥ 3.11, MLX ≥ 0.18, NumPy ≥ 1.26.

Install from source

git clone https://github.com/rajveer43/VeloxQuant-MLX
cd VeloxQuant-MLX
pip install -e ".[dev]"

Quickstart

RVQ 1-bit — 7.5× compression, no calibration (recommended)

import mlx_lm
from veloxquant_mlx import KVCacheBuilder, KVCacheConfig

model, tokenizer = mlx_lm.load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=1, seed=42)
caches = KVCacheBuilder.for_model(model, config)
model.make_cache = lambda *_a, **_k: caches

response = mlx_lm.generate(model, tokenizer,
    prompt="Explain the theory of relativity in simple terms.",
    max_tokens=200,
)

VecInfer 1-bit — 16× compression, Metal kernels auto-detected

import mlx_lm
from veloxquant_mlx import KVCacheConfig, KVCacheFactory
from veloxquant_mlx.allocators.vecinfer import calibrate_smooth_factors, train_codebook

model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-4bit")

# One-time offline calibration — save and reuse
smooth   = calibrate_smooth_factors(sample_keys)
codebook = train_codebook(sample_keys_flat, n_centroids=256, sub_dim=8)

config = KVCacheConfig(
    method="vecinfer",
    head_dim=128,
    key_codebook_bits=8,
    key_sub_dim=8,
    smooth_factors=smooth,
    key_codebook=codebook,
    use_metal_kernels=None,   # None=auto-detect, True=require, False=forbid
)
caches = KVCacheFactory.create_for_model(model, config)

response = mlx_lm.generate(model, tokenizer,
    prompt="Write a 5,000-word analysis of the RLHF literature.",
    max_tokens=5000,
    prompt_cache=caches,
)

RateQuant — mixed precision per layer

from veloxquant_mlx import (
    KVCacheBuilder, KVCacheConfig,
    calibrate_layer_sensitivities,
    allocate_bits_ratequant,
)

# Step 1 — 1.6s one-time probe on real activations
weights = calibrate_layer_sensitivities(model, tokenizer)

# Step 2 — closed-form reverse-waterfilling allocation
alloc = allocate_bits_ratequant(weights, target_avg_bits=1.5, beta=3.5)
# e.g. [1, 2, 1, 1, 3, 1, 2, ...]  — one int per layer

# Step 3 — build per-layer caches
config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=alloc, seed=42)
caches = KVCacheBuilder.for_model(model, config)

SpectralQuant — new in 0.6.0

SpectralQuant implements "3% Is All You Need: Breaking TurboQuant's Compression Limit via Spectral Structure". The key insight: KV cache keys concentrate ~96% of their variance in just 3–4% of dimensions universally across all transformer architectures. SpectralQuant exploits this by rotating keys into their eigenvector basis before quantization — no more wasting bits on noise dimensions.

Three changes over TurboQuant:

Eigenvector rotation instead of random Hadamard — aligns signal dimensions first
Separate codebooks for signal dims (d_s ≈ 4) and noise dims (d − d_s)
No QJL on noise dims — applying QJL there injects variance without reducing bias, hurting quality

from mlx_lm import load
from veloxquant_mlx.spectral import calibrate_spectral_rotation
from veloxquant_mlx.cache.spectral_cache import SpectralQuantKVCache
from veloxquant_mlx.cache.base import KVCacheConfig

model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# One-time calibration (~5s on 512 tokens)
import mlx.core as mx
tokens = mx.array(tokenizer.encode(calibration_text)[:512])[None]
rotations = calibrate_spectral_rotation(model, tokens, model_name="llama31_8b")

# Build one calibrated cache per layer
import mlx_lm
cfg = KVCacheConfig(method="spectral", head_dim=128, bit_width_inlier=3)
caches = [SpectralQuantKVCache(cfg) for _ in range(model.args.num_hidden_layers)]
for i, cache in enumerate(caches):
    if i in rotations:
        cache.calibrate(rotations[i])

response = mlx_lm.generate(model, tokenizer,
    prompt="Explain the transformer architecture.",
    max_tokens=500,
)

Results on real models (3-bit, d_s=auto-calibrated):

Model	SpectralQuant noQJL	TurboQuant 3-bit	Δ cosim	SQ ratio
Qwen2.5-0.5B	0.9072	0.8329	+7.4pp	5.33×
Gemma 4 4B	0.8625	0.7581	+10.4pp	5.33×

Calibration required — a one-time ~5–30s pass over 512 representative tokens. Save and reuse with save_rotations / load_cached_rotations. Run python scripts/run_spectral_quant_eval.py --model <name> to generate all benchmark figures.

RateQuant — per-layer mixed precision

RateQuant (arxiv:2605.06675) allocates more bits to high-sensitivity layers and fewer to low-sensitivity ones via Theorem 2 reverse-waterfilling, with the average held at a user-chosen target. A sensitivity ratio above ~2× indicates measurable gains over uniform allocation.

Model	Sensitivity ratio	Allocation	Result
Falcon3-7B (28 layers, head_dim=256)	6.48×	14 × b=2, 14 × b=1	100% fp16 at 5.22× compression
Gemma3-4B (34 layers, head_dim=256)	14.39×	3 × b=3, 11 × b=2, 20 × b=1	91% fp16 at 5.22× compression

What's not yet implemented from the paper: per-head allocation, gradient-based sensitivity, K/V separation. Per-layer already captures most of the benefit at ≥1.5 bits.

VecInfer — 16× product VQ

VecInfer (arxiv:2510.06175) (Yao et al. 2025) applies a dual transform to keys before product VQ encoding:

Smooth scaling — per-channel λ = √(max|K|) suppresses outlier magnitudes
Walsh-Hadamard rotation — spreads energy uniformly across all dims
K-means product VQ — encode sub-vectors against a calibrated codebook

The inverse transform is absorbed into queries so q @ K.T is preserved exactly. At 1 bit/elem a 128-dim key becomes 16 bytes instead of 256 — 16× compression.

Standout result: Qwen2.5-7B VecInfer-1bit exceeds fp16 throughput at 16× compression, likely due to its strong GQA ratio (28q/4kv heads).

Metal kernels — new in 0.5.1

The VecInfer quantize_vq hot path is now a 30-line Metal Shading Language shader, JIT-compiled by mx.fast.metal_kernel on first use. Same Python API — no changes required.

Metal kernel benchmark — quantize latency, speedup, and peak memory

_{Benchmarked on Apple Silicon GPU. Left: quantize latency. Center: speedup factor. Right: peak memory.}

Metric	Pure MLX	Metal kernel	Delta
Quantize latency (S=8192)	228 ms	15.6 ms	14.7× faster
Peak memory (Falcon3-7B shape)	729 MB	12 MB	98% reduction
API change required	—	None	`use_metal_kernels=None` auto-detects

Why the memory win: the [N, n_centroids, sub_dim] diff tensor is never materialised — the argmin accumulator lives entirely in thread-local registers.

Honest caveat: the kernel pays a ~50–200 µs launch overhead per call. On tiny models (SmolLM2-135M, ~60 launches/token) that overhead can exceed the savings. Built for the regime that needs it: 7B+ models at realistic context lengths.

The full 30-line Metal kernel

// One thread per sub-vector. Argmin lives in registers — no diff tensor.
uint vec_idx  = thread_position_in_grid.x;
uint N_total  = x_shape[0];
if (vec_idx >= N_total) { return; }

uint n_centroids = codebook_shape[0];
uint sub_dim     = codebook_shape[1];
uint x_base      = vec_idx * sub_dim;

float best_dist = INFINITY;
uint  best_idx  = 0;

for (uint c = 0; c < n_centroids; ++c) {
    uint  cb_base = c * sub_dim;
    float dist    = 0.0f;
    for (uint i = 0; i < sub_dim; ++i) {
        float d = float(x[x_base + i]) - float(codebook[cb_base + i]);
        dist += d * d;
    }
    if (dist < best_dist) { best_dist = dist; best_idx = c; }
}

out[vec_idx] = best_idx;

Read the full writeup: MEDIUM_BLOG_METAL_KERNELS.md

Benchmark results

10-model comparative study — VecInfer vs RVQ (v0.5.0)

Cross-model comparison — VecInfer vs RVQ-1bit across 10 models

_{End-to-end mlx_lm.generate · 200-token prompt · 120-token generation · Apple M-series unified memory}

Compression ratio:

Model	RVQ-1bit	VecInfer-1bit
SmolLM2-135M	7.1×	16×
Llama-3.2-1B	7.1×	16×
Llama-3.2-3B	7.5×	16×
Llama-3.1-8B	7.5×	16×
Mistral-7B	7.5×	16×
Qwen2.5-7B	7.5×	16×
Qwen3-8B	7.5×	16×
Phi-4	7.5×	16×
Falcon3-7B	7.8×	16×
gemma-3-4b	7.8×	16×

Throughput (tok/s):

Model	fp16	RVQ-1bit	VecInfer-1bit
SmolLM2-135M	250.4	188.5	175.8
Llama-3.2-1B	105.4	104.3	91.2
Llama-3.2-3B	47.6	46.2	40.2
Llama-3.1-8B	20.5	20.6	19.6
Mistral-7B	23.6	22.8	9.8
Qwen2.5-7B	21.0	20.7	21.5 ⬆ exceeds fp16 at 16×
Qwen3-8B	20.3	19.6	2.4
Phi-4	10.4	8.1	4.0
Falcon3-7B	17.3	21.7	17.0
gemma-3-4b	26.0	24.2	22.6

RVQ-1bit is the safe default — within 5% of fp16 on most 7–8B models with zero calibration. VecInfer-1bit wins on memory (always 16×) and throughput on strong-GQA models (Qwen2.5, Gemma).

Throughput optimisation journey (v0.3.0)

Four sequential changes to lift quantized throughput to fp16 parity:

Stage	Mistral-7B RVQ-2bit	Qwen3-4B RVQ-2bit
0. Original (per-head Python loop)	17.7 tok/s	24.8 tok/s
1. Batch heads `(B,H,S,D) → (B·H·S,D)`	21.5 tok/s	34.0 tok/s
2. Hadamard rotation by default	20.0 tok/s	—
3. Boundary-sum quantize (replaces argmin)	22.4 tok/s	—
4. Drop redundant fp32↔fp16 casts	22.3 tok/s	36.0 tok/s

Full writeup: OPTIMIZATION_FINDINGS.md

RateQuant V2 mixed-precision results (v0.3.5)

Per-layer allocation at target b̄=1.5, measured on Apple M4 24 GB.

Model	fp16	RVQ-1bit	RVQ + RateQuant V2	Sens. ratio
Falcon3-7B	22.9	23.1 (101%)	22.8 (100%) at 5.22×	6.48×
Gemma3-4B	39.8	37.8 (95%)	36.3 (91%) at 5.22×	14.39×

Source figures: figures/2026-05-16/

RVQ 1-bit 8-model sweep (v0.3.4)

All on Apple M4 MacBook 16/24 GB. Prompt: 200-token explanation of relativity.

Model	fp16 tok/s	RVQ-1bit tok/s	vs fp16
Mistral-7B v0.3	23.3	22.2	95%
Falcon3-7B	24.0	23.1	96%
Phi-4	11.9	11.8	99%
Qwen3-4B	40.2	34.3	85%
Qwen3-8B	20.5	21.1	103%
Llama-3.1-8B	22.0	21.5	98%
Gemma3-4B	32.5	30.5	94%

Source figures: figures/outlier_token_ratequant/

Algorithm guide

Method	Bits/dim	Compression	Quality (cosine)	Calibration	Best for
`turboquant_mse`	b	~9× @ 2b	0.86 @ 3b	None	Lowest overhead at 3–4 bit
`turboquant_prod`	b	~9× @ 2b	0.95 @ 4b	None	Unbiased IP estimator at 3–4 bit
`turboquant_rvq` @ b=1	2	7.5×	0.92	None	Default — full output on all 12 tested models
`turboquant_rvq` @ b=2	4	3.9×	0.98	None	2-bit with near-lossless quality
`turboquant_rvq` + RateQuant	1.5 avg	5.2×	≈0.96	1.6s	Heterogeneous layer sensitivity
`vecinfer` @ 1-bit	1	16×	model-dependent	Codebook	Max compression, strong-GQA models
`spectral` @ b=3	3	5.33×	0.91 (Qwen2.5)	~5s once	Best quality-per-bit, any model
`polar`	b×levels	varies	medium	None	Geometric key distributions
`qjl`	1	~16×	0.62	None	Ranking-only retrieval, extreme compression

Quick decision:

No calibration, best default → turboquant_rvq b=1
Max compression, Qwen2.5/Gemma → vecinfer 1-bit
Best quality at moderate compression → spectral b=3 (requires ~5s calibration)
Heterogeneous layers (sens. ratio >2×) → RateQuant on top of RVQ
2-bit, near-lossless → turboquant_rvq b=2

What's inside

Module	Purpose
`veloxquant_mlx/spectral/spectral_quant`	`SpectralQuantizer` — eigenvector rotation + signal/noise codebooks, b=3
`veloxquant_mlx/spectral/calibrate`	`calibrate_spectral_rotation`, `calibrate_from_vectors`, on-disk rotation cache
`veloxquant_mlx/spectral/bit_allocator`	`water_fill_bits` — water-filling bit allocation per eigenvalue
`veloxquant_mlx/spectral/participation_ratio`	`compute_participation_ratio`, `compute_spectral_gap`
`veloxquant_mlx/quantizers/turboquant_rvq`	Two-pass scalar RVQ — Gaussian + Laplacian codebooks, b=1/2/3+
`veloxquant_mlx/quantizers/turboquant_prod`	Rotation + Lloyd-Max + QJL residual (b-1 + 1 bits)
`veloxquant_mlx/quantizers/turboquant_mse`	Rotation + Lloyd-Max, no residual correction
`veloxquant_mlx/quantizers/polarquant`	Recursive polar coordinate decomposition
`veloxquant_mlx/quantizers/qjl`	Pure 1-bit Johnson-Lindenstrauss sign sketch
`veloxquant_mlx/cache/vecinfer_cache`	`VecInferKVCache` — smooth + Hadamard + product VQ
`veloxquant_mlx/cache/turboquant_rvq_cache`	`TurboQuantRVQKVCache` — mlx_lm-compatible wrapper
`veloxquant_mlx/allocators/vecinfer`	`calibrate_smooth_factors`, `train_codebook`, `quantize_vq`
`veloxquant_mlx/allocators`	`allocate_bits_ratequant`, `calibrate_layer_sensitivities`
`veloxquant_mlx/metal`	Hand-written Metal MSL kernels, JIT via `mx.fast.metal_kernel`
`veloxquant_mlx/preconditioners`	`RotationPreconditioner` (QR), `HadamardPreconditioner`
`veloxquant_mlx/observers`	`DistortionObserver`, `LatencyObserver`, `MemoryObserver`, `KeyNormObserver`
`veloxquant_mlx/codebooks`	`ScalarCodebook`, Lloyd-Max strategies, `AdaptiveScalarCodebook`
`veloxquant_mlx/dsa/bit_pack`	Sub-byte index packing
`veloxquant_mlx/outlier`	Two-stream cache for high-variance channels
`veloxquant_mlx/weight`	`QuantizedLinear` for model weight quantization

Architecture

Pipeline diagrams & design patterns

TurboQuantRVQ pipeline:

x (fp16, batch × d)
     │
Rotate (Π)
     │
Stage-1 quantize  (Gaussian Lloyd-Max, b bits)  →  idx₁
     │
Compute residual  r₁ = y − ŷ₁
     │
Stage-2 quantize  (Laplacian Lloyd-Max, b bits) →  idx₂
     │
EncodedVector(idx₁, idx₂)
     │
Decode: ŷ = ŷ₁ + ŷ₂  →  unrotate

VecInfer pipeline:

x (fp16, B × H × S × D)
     │
Smooth scale  (λᵢ = √max|Kᵢ|, per channel)
     │
Walsh-Hadamard rotation  O(d log d)
     │
K-means product VQ  (sub-vectors against codebook)
     │
Packed indices  →  16× smaller than fp16 keys

Design patterns used (10): Abstract Base Classes, Factory, Chain of Responsibility, Builder, Strategy, Registry + Plugin, Composite, Observer, DAO, Custom DSA (RingBuffer, MaxHeap, BitPackBuffer, VoronoiTree).

CLI

# Precompute rotation matrices, JL matrices, codebooks
python -m veloxquant_mlx precompute \
    --head_dim 128 --bits 1 2 3 4 --jl_dim 128 --seed 42 \
    --output_dir ./artifacts/

# Synthetic benchmark — single config
python -m veloxquant_mlx benchmark \
    --method turboquant_rvq --head_dim 128 --bits 2 --seq_len 1000

# End-to-end model benchmarks
python benchmark_scripts/benchmark_vecinfer.py   # VecInfer 10-model sweep
python benchmark_scripts/run_outlier_ratequant.py # RateQuant mixed-precision

Load precomputed artifacts to skip re-computation at runtime:

from veloxquant_mlx.artifacts import NpyArtifactStore

cache = (KVCacheBuilder()
    .with_method("turboquant_rvq")
    .with_head_dim(128).with_bit_width(inlier=2)
    .with_artifact_store(NpyArtifactStore("./artifacts/"))
    .build())

Development

# Full test suite (212 tests, includes 7 Metal parity tests)
pytest veloxquant_mlx/tests/ -v

# 2-bit improvement validation — fast synthetic run
python test_2bit_improvements.py

# Generate optimization-journey figure
python scripts/plot_optimization_journey.py

Contributions welcome — please open an issue first for anything beyond a small bugfix. See CHANGELOG.md for release history.

References

Papers implemented in this library

SpectralQuant (2026) — "3% Is All You Need: Breaking TurboQuant's Compression Limit via Spectral Structure" — eigenvector PCA rotation + signal/noise codebooks, 5.95× at higher quality than TurboQuant
TurboQuant (ICLR 2026) — Zandieh et al., "Online Vector Quantization with Near-optimal Distortion Rate"
RateQuant (2025) — "RateQuant: Mixed-Precision KV Cache Quantization via Rate-Distortion Theory"
VecInfer (2024) — Yao et al., "Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization"
PolarQuant (AISTATS 2026) — "PolarQuant: Quantizing KV Caches with Polar Transformation"
QJL (2024) — Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization"

Related work

Quantization:

KIVI (ICML 2024) — Liu et al., "A Tuning-Free Asymmetric 2-Bit Quantization for KV Cache"
KVQuant (NeurIPS 2024) — Hooper et al., "Towards 10 Million Context Length LLM Inference with KV Cache Quantization"
Coupled Quantization (NeurIPS 2024) — Zhang et al., "KV Cache is 1 Bit Per Channel"
KVTuner (ICML 2025) — Li et al., "Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization"
MixKVQ (2024) — Zhang et al., "Query-Aware Mixed-Precision KV Cache Quantization"
FibQuant (2025) — "Universal Vector Quantization for Random-Access KV-Cache Compression"

Token eviction & sparse attention:

SnapKV (2024) — Li et al., "LLM Knows What You are Looking for Before Generation"
PyramidKV (2024) — Cai et al., "Dynamic KV Cache Compression based on Pyramidal Information Funneling"
RocketKV (ICML 2025) — Behnam et al., "Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression"
MagicPIG (ICLR 2025 Spotlight) — Chen et al., "LSH Sampling for Efficient LLM Generation"

Low-rank & cross-layer:

xKV (2025) — Chang et al., "Cross-Layer SVD for KV-Cache Compression"
KVPress (2024) — "KV Cache Compression by Estimating Attention from Future Queries Distribution"

Survey:

KV Cache Management Survey (2024) — "A Survey on LLM Acceleration based on KV Cache Management"

Framework: Apple MLX

License

MIT — see LICENSE.

_{Built for Apple Silicon · Engineered for speed · MIT License}
_{Landing page ·
Issues ·
Blog: 10-model study ·
Blog: Metal kernels}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

Jun 10, 2026

0.7.0

May 31, 2026

This version

0.6.0

May 28, 2026

0.5.1

May 25, 2026

0.5.0

May 23, 2026

0.4.0

May 23, 2026

0.3.6

May 17, 2026

0.3.5

May 16, 2026

0.3.1

May 10, 2026

0.3.0

May 10, 2026

0.2.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veloxquant_mlx-0.6.0.tar.gz (143.5 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

veloxquant_mlx-0.6.0-py3-none-any.whl (197.2 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file veloxquant_mlx-0.6.0.tar.gz.

File metadata

Download URL: veloxquant_mlx-0.6.0.tar.gz
Upload date: May 28, 2026
Size: 143.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veloxquant_mlx-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`efee9e89b259c0da51b36f1f659334d8dcfe9bdc76b5aceb428ca49c1bc36cd9`
MD5	`9a4c4b4fecc83b0671d43ec86fe6e70b`
BLAKE2b-256	`957ef7135da189a142dd545743ec4ff0893b16004b6ffcb27b273c40ce0dace0`

See more details on using hashes here.

File details

Details for the file veloxquant_mlx-0.6.0-py3-none-any.whl.

File metadata

Download URL: veloxquant_mlx-0.6.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 197.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veloxquant_mlx-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`467b29ff37d1bbe9352589f7d8ed9b8e9f8d42d0ff61bfefbf16666444c0b1c4`
MD5	`72ebe2e8f21becdacdb9accf45d7e9e6`
BLAKE2b-256	`53e8524950fabf83418d0ea84b54c18a1a6211b1a1175c163839ea8e28a4b58e`

See more details on using hashes here.

VeloxQuant-MLX 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VeloxQuant-MLX

Numbers that matter

Table of contents

Installation

Quickstart

RVQ 1-bit — 7.5× compression, no calibration (recommended)

VecInfer 1-bit — 16× compression, Metal kernels auto-detected

RateQuant — mixed precision per layer

SpectralQuant — new in 0.6.0

RateQuant — per-layer mixed precision

VecInfer — 16× product VQ

Metal kernels — new in 0.5.1

Benchmark results

10-model comparative study — VecInfer vs RVQ (v0.5.0)

Algorithm guide

What's inside

Architecture

CLI

Development

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes