2-bit quantization with fused Metal dequant kernels for Apple Silicon — up to 8× faster local LLM inference
Project description
Opacc1ty
Local LLMs on Apple Silicon are slow. Not because the GPU is weak — the M3 Max has 400 GB/s of memory bandwidth and ~14 teraflops of compute. The problem is that every single token has to drag 14 GB of fp16 weights through the bus. The GPU spends most of its time waiting on memory.
Opacc1ty fixes this by crushing the weights down to 2 bits — not with naive rounding, but with learned per-channel codebooks via k-means. Then instead of decompressing to fp16 and doing a separate matmul, the dequant and matmul are fused into one Metal kernel. The weights never expand in memory. They stay 2-bit all the way from RAM to register.
On my M3 Max, Llama-3.1-8B goes from ~28 tok/s to ~195 tok/s. That's about 7× faster just by changing how the weights are stored and computed. No model surgery. No distillation. Same architecture.
The trick
Normal quantization pipelines do this:
2-bit weights → expand to fp16 in memory → run matmul → discard expanded weights
The expansion step blows 2-bit data back up to 16-bit before the GPU ever sees it. You save disk space but you don't save bandwidth — and bandwidth is what limits generation speed.
Opacc1ty does this instead:
2-bit weights + tiny codebook → feed straight into GPU → lookup + matmul in registers
The Metal kernel loads the packed 2-bit indices, looks up the corresponding fp16 values from a codebook that lives in registers (64 bytes per output channel — nothing), and accumulates the dot product immediately. The codebook lookup happens inside the matmul's inner loop. At no point does an expanded fp16 weight matrix touch unified memory.
I wrote a longer explanation of how it works in HOW.md if you care about the details.
Does it actually work?
Yeah, mostly. Here's what I get on an M3 Max with 64 GB:
| Setup | Model size | tok/s | Wiki perplexity |
|---|---|---|---|
| fp16 (MLX) | 14.0 GB | 28 | 6.14 |
| Q4_K_M (llama.cpp) | 4.9 GB | 68 | 6.21 |
| Q3_K_M (llama.cpp) | 3.8 GB | 85 | 6.35 |
| Opacc1ty 2-bit, 1% outliers | 2.5 GB | 180 | 6.32 |
| Opacc1ty 2-bit, 2% outliers | 2.8 GB | 165 | 6.25 |
The 2-bit quant loses about 0.18 perplexity vs fp16. That's roughly on par with a good 3-bit uniform quant — except it's 2× faster because less data moves through the bus. If you push outlier fraction to 2% it drops to +0.11 perplexity at the cost of some speed.
Is it perfect? No. Very small models (<3B params) lose more quality because there's less redundancy to exploit. Creative writing tasks can feel slightly less "sharp." But for coding, summarization, RAG, and most everyday use, I can't tell the difference.
Install
pip install opacc1ty
You need:
- A Mac with Apple Silicon (M1 or newer — the GPU needs to support Metal 3)
- macOS 14+ (maybe works on 13, haven't tested)
- Xcode CLI tools if you want the Metal backend (
xcode-select --install) - PyTorch 2+ for quantization
Right now this only works on Apple Silicon. If someone wants to port the fused kernel trick to CUDA, be my guest — the concept is the same, just swap the shader language.
Usage
Quantize a HuggingFace model:
opacc1ty quantize ~/models/llama-3.1-8b/ --output llama.bf2
This takes about 15 minutes on CPU for an 8B model, or ~5 minutes if you use MPS (--device mps). It'll spit out a .bf2 file.
See what you got:
opacc1ty info llama.bf2 --layers
Benchmark it:
opacc1ty benchmark llama.bf2 --prompt "Write a quicksort in Rust" --max-tokens 256
Or from Python:
from opacc1ty import VectorQuantizer, QuantizeConfig
from opacc1ty.format.bf2 import BF2Writer
from safetensors import safe_open
state_dict = {}
with safe_open("model.safetensors", framework="pt") as f:
for key in f.keys():
state_dict[key] = f.get_tensor(key)
config = QuantizeConfig(bits=2, outlier_fraction=0.01, device="mps")
results = VectorQuantizer(config).quantize_model(state_dict, {"architecture": "llama"})
BF2Writer("model.bf2").write(results, model_config, results["_quantize_config"])
There's also a C API if you want to embed this in something — check runtime/.
Files
opacc1ty/
├── opacc1ty/ # python package
│ ├── quantize/ # vq, k-means codebook learner, outlier detection
│ ├── format/ # .bf2 binary format reader/writer
│ ├── cli/ # quantize, info, benchmark, serve commands
│ └── utils/ # metal kernel manager
├── kernels/ # metal shaders (dequant_gemv, dequant_gemm)
├── runtime/ # C inference runtime + objc metal backend
└── tests/ # 17 tests, all passing
What's next
Things I'm working on or thinking about:
- GGUF export so these models work in llama.cpp without my runtime
- 1.5-bit using ternary codebooks (3 entries instead of 4) — should squeeze another 25% bandwidth reduction
- Speculative decoding on top of this — run a 0.1B draft model on the ANE, verify with the 2-bit target on GPU
- Training-aware quant — finetune with straight-through gradients so the model learns to be quantization-friendly
If any of that sounds fun to hack on, the code is pretty readable and I'm happy to walk people through it.
Bugs & help
This is early. Stuff will break. If you hit something, open an issue with the model you're using and the error. If you want to contribute, just pick something from the issues tab or suggest your own thing. I'm not precious about the code.
MIT. Do whatever.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opacc1ty-1.0.2.tar.gz.
File metadata
- Download URL: opacc1ty-1.0.2.tar.gz
- Upload date:
- Size: 40.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b48bafe2808c576a53baaff036c68aa9e28d0f66f444c260d9239c174f8bffd5
|
|
| MD5 |
d71179a11c57ea0849d468875a2c9c06
|
|
| BLAKE2b-256 |
d28b7941a6d376d70d0aa2198ad36f254163cbed968c7ffecc673cb7d5998604
|
File details
Details for the file opacc1ty-1.0.2-py3-none-any.whl.
File metadata
- Download URL: opacc1ty-1.0.2-py3-none-any.whl
- Upload date:
- Size: 37.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b51968ffbf8ffefb53ef9367d95f834105534f8c993b31bc51b1dd985b0d7a4
|
|
| MD5 |
c86f019b288eb6baf0c1eb358e3eb52a
|
|
| BLAKE2b-256 |
ab75b058e3f0f60d2a8198d9f8eb7229c9333a4b461c3e4756d07064c56bb507
|