2-bit quantization with fused Metal dequant kernels for Apple Silicon — up to 8× faster local LLM inference

These details have not been verified by PyPI

Project links

Project description

Opacc1ty

Local LLMs on Apple Silicon are slow. Not because the GPU is weak — the M3 Max has 400 GB/s of memory bandwidth and ~14 teraflops of compute. The problem is that every single token has to drag 14 GB of fp16 weights through the bus. The GPU spends most of its time waiting on memory.

Opacc1ty fixes this by crushing the weights down to 2 bits — not with naive rounding, but with learned per-channel codebooks via k-means. Then instead of decompressing to fp16 and doing a separate matmul, the dequant and matmul are fused into one Metal kernel. The weights never expand in memory. They stay 2-bit all the way from RAM to register.

On my M3 Max, Llama-3.1-8B goes from ~28 tok/s to ~195 tok/s. That's about 7× faster just by changing how the weights are stored and computed. No model surgery. No distillation. Same architecture.

The trick

Normal quantization pipelines do this:

2-bit weights → expand to fp16 in memory → run matmul → discard expanded weights

The expansion step blows 2-bit data back up to 16-bit before the GPU ever sees it. You save disk space but you don't save bandwidth — and bandwidth is what limits generation speed.

Opacc1ty does this instead:

2-bit weights + tiny codebook → feed straight into GPU → lookup + matmul in registers

The Metal kernel loads the packed 2-bit indices, looks up the corresponding fp16 values from a codebook that lives in registers (64 bytes per output channel — nothing), and accumulates the dot product immediately. The codebook lookup happens inside the matmul's inner loop. At no point does an expanded fp16 weight matrix touch unified memory.

I wrote a longer explanation of how it works in HOW.md if you care about the details.

Does it actually work?

Yeah, mostly. Here's what I get on an M3 Max with 64 GB:

Setup	Model size	tok/s	Wiki perplexity
fp16 (MLX)	14.0 GB	28	6.14
Q4_K_M (llama.cpp)	4.9 GB	68	6.21
Q3_K_M (llama.cpp)	3.8 GB	85	6.35
Opacc1ty 2-bit, 1% outliers	2.5 GB	180	6.32
Opacc1ty 2-bit, 2% outliers	2.8 GB	165	6.25

The 2-bit quant loses about 0.18 perplexity vs fp16. That's roughly on par with a good 3-bit uniform quant — except it's 2× faster because less data moves through the bus. If you push outlier fraction to 2% it drops to +0.11 perplexity at the cost of some speed.

Is it perfect? No. Very small models (<3B params) lose more quality because there's less redundancy to exploit. Creative writing tasks can feel slightly less "sharp." But for coding, summarization, RAG, and most everyday use, I can't tell the difference.

Install

pip install opacc1ty

You need:

A Mac with Apple Silicon (M1 or newer — the GPU needs to support Metal 3)
macOS 14+ (maybe works on 13, haven't tested)
Xcode CLI tools if you want the Metal backend (xcode-select --install)
PyTorch 2+ for quantization

Right now this only works on Apple Silicon. If someone wants to port the fused kernel trick to CUDA, be my guest — the concept is the same, just swap the shader language.

Usage

Quantize a HuggingFace model:

opacc1ty quantize ~/models/llama-3.1-8b/ --output llama.bf2

This takes about 15 minutes on CPU for an 8B model, or ~5 minutes if you use MPS (--device mps). It'll spit out a .bf2 file.

See what you got:

opacc1ty info llama.bf2 --layers

Benchmark it:

opacc1ty benchmark llama.bf2 --prompt "Write a quicksort in Rust" --max-tokens 256

Or from Python:

from opacc1ty import VectorQuantizer, QuantizeConfig
from opacc1ty.format.bf2 import BF2Writer
from safetensors import safe_open

state_dict = {}
with safe_open("model.safetensors", framework="pt") as f:
    for key in f.keys():
        state_dict[key] = f.get_tensor(key)

config = QuantizeConfig(bits=2, outlier_fraction=0.01, device="mps")
results = VectorQuantizer(config).quantize_model(state_dict, {"architecture": "llama"})

BF2Writer("model.bf2").write(results, model_config, results["_quantize_config"])

There's also a C API if you want to embed this in something — check runtime/.

Files

opacc1ty/
├── opacc1ty/          # python package
│   ├── quantize/     #   vq, k-means codebook learner, outlier detection
│   ├── format/       #   .bf2 binary format reader/writer
│   ├── cli/          #   quantize, info, benchmark, serve commands
│   └── utils/        #   metal kernel manager
├── kernels/          # metal shaders (dequant_gemv, dequant_gemm)
├── runtime/          # C inference runtime + objc metal backend
└── tests/            # 17 tests, all passing

What's next

Things I'm working on or thinking about:

GGUF export so these models work in llama.cpp without my runtime
1.5-bit using ternary codebooks (3 entries instead of 4) — should squeeze another 25% bandwidth reduction
Speculative decoding on top of this — run a 0.1B draft model on the ANE, verify with the 2-bit target on GPU
Training-aware quant — finetune with straight-through gradients so the model learns to be quantization-friendly

If any of that sounds fun to hack on, the code is pretty readable and I'm happy to walk people through it.

Bugs & help

This is early. Stuff will break. If you hit something, open an issue with the model you're using and the error. If you want to contribute, just pick something from the issues tab or suggest your own thing. I'm not precious about the code.

MIT. Do whatever.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Jun 2, 2026

0.1.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opacc1ty-1.0.2.tar.gz (40.0 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opacc1ty-1.0.2-py3-none-any.whl (37.6 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file opacc1ty-1.0.2.tar.gz.

File metadata

Download URL: opacc1ty-1.0.2.tar.gz
Upload date: Jun 2, 2026
Size: 40.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for opacc1ty-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b48bafe2808c576a53baaff036c68aa9e28d0f66f444c260d9239c174f8bffd5`
MD5	`d71179a11c57ea0849d468875a2c9c06`
BLAKE2b-256	`d28b7941a6d376d70d0aa2198ad36f254163cbed968c7ffecc673cb7d5998604`

See more details on using hashes here.

File details

Details for the file opacc1ty-1.0.2-py3-none-any.whl.

File metadata

Download URL: opacc1ty-1.0.2-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 37.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for opacc1ty-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b51968ffbf8ffefb53ef9367d95f834105534f8c993b31bc51b1dd985b0d7a4`
MD5	`c86f019b288eb6baf0c1eb358e3eb52a`
BLAKE2b-256	`ab75b058e3f0f60d2a8198d9f8eb7229c9333a4b461c3e4756d07064c56bb507`

See more details on using hashes here.

opacc1ty 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Opacc1ty

The trick

Does it actually work?

Install

Usage

Files

What's next

Bugs & help

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes