8-bit and 4-bit quantization for PyTorch on Apple Silicon (M1/M2/M3/M4)

These details have not been verified by PyPI

Project links

Project description

MPS BitsAndBytes

8-bit quantization for PyTorch on Apple Silicon (M1/M2/M3/M4).

50% memory savings for storing model weights, with no speed penalty using smart caching.

Features

Linear8bit: Drop-in replacement for nn.Linear with int8 weights
Smart caching: Dequantize once, run fast fp16 matmul (AMX accelerated)
QLoRA ready: Perfect for fine-tuning large models on Mac
Pure PyTorch: No custom kernels needed, works out of the box

Installation

pip install mps-bitsandbytes

Or from source:

git clone https://github.com/mpsops/mps-bitsandbytes
cd mps-bitsandbytes
pip install -e .

Quick Start

import torch
from mps_bitsandbytes import Linear8bit, quantize_model

# Convert existing model to 8-bit
model = YourModel().to('mps')
model = quantize_model(model, device='mps')

# Or convert individual layers
linear_8bit = Linear8bit.from_linear(some_linear_layer)

# Use normally - same API, 50% less memory for weights
output = model(input)

How It Works

Storage: Weights stored as int8 (1 byte per param vs 2 bytes for fp16)
First forward: Dequantize int8 → fp16, cache the result
Subsequent forwards: Use cached fp16 weights, fast AMX matmul

This gives you:

50% memory savings on disk and when loading weights
Same inference speed as fp16 (once cached)
Compatible with QLoRA training

Memory Savings

Model Size	FP16	INT8	Savings
7B params	14 GB	7 GB	7 GB
13B params	26 GB	13 GB	13 GB
70B params	140 GB	70 GB	70 GB

Configuration

# Default: cache enabled (fast, uses memory during inference)
layer = Linear8bit.from_linear(linear, use_cache=True)

# Memory-constrained: no cache (slower, minimum memory)
layer = Linear8bit.from_linear(linear, use_cache=False)

# Clear cache to free memory
layer.clear_cache()

QLoRA Training

from mps_bitsandbytes import quantize_model
from peft import get_peft_model, LoraConfig

# Load model in 8-bit
model = AutoModelForCausalLM.from_pretrained("model_name")
model = quantize_model(model.to('mps'))

# Add LoRA adapters (these stay in fp16 for gradients)
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)

# Train - base weights frozen in int8, LoRA in fp16
trainer.train()

Benchmarks

Tested on M1 Max, batch_size=32, hidden_dim=4096:

Method	Forward Time	Memory
FP16	1.08 ms	100 MB
INT8 (cached)	0.98 ms	50 MB + cache
INT8 (no cache)	9.65 ms	50 MB

Limitations

First forward is slower: Need to dequantize weights once
Cache uses memory: During inference, cached fp16 weights use extra memory
No int8 matmul acceleration: Apple Silicon AMX only supports fp16/fp32

For maximum memory savings during inference (no cache), use use_cache=False, but expect ~10x slower inference.

Credits

bitsandbytes - Original CUDA implementation
LLM.int8() - Paper by Tim Dettmers et al.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.0

Feb 8, 2026

0.6.1

Feb 2, 2026

0.6.0

Feb 2, 2026

0.5.1

Feb 2, 2026

0.5.0

Feb 1, 2026

0.4.9

Feb 1, 2026

0.4.8

Jan 31, 2026

0.4.7

Jan 31, 2026

0.4.6

Jan 31, 2026

0.4.5

Jan 31, 2026

0.4.4

Jan 31, 2026

0.4.3

Jan 31, 2026

0.4.2

Jan 31, 2026

0.4.1

Jan 30, 2026

0.4.0

Jan 30, 2026

0.3.0

Jan 30, 2026

0.2.0

Jan 30, 2026

0.1.2

Jan 30, 2026

This version

0.1.1

Jan 29, 2026

0.1.0

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mps_bitsandbytes-0.1.1.tar.gz (11.5 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mps_bitsandbytes-0.1.1-cp314-cp314-macosx_15_0_arm64.whl (82.6 kB view details)

Uploaded Jan 29, 2026 CPython 3.14macOS 15.0+ ARM64

File details

Details for the file mps_bitsandbytes-0.1.1.tar.gz.

File metadata

Download URL: mps_bitsandbytes-0.1.1.tar.gz
Upload date: Jan 29, 2026
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mps_bitsandbytes-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0ff1a655ba726b3b99ac1a8527e8539e4161abfb65704a59ec95c66da7ae863e`
MD5	`875544b1e33c73f4a9a51e06a321974e`
BLAKE2b-256	`6f9c239af82e0e48ca23cd118f43aa5a0175153e65c597078c3f84dc0104a1ad`

See more details on using hashes here.

File details

Details for the file mps_bitsandbytes-0.1.1-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

Download URL: mps_bitsandbytes-0.1.1-cp314-cp314-macosx_15_0_arm64.whl
Upload date: Jan 29, 2026
Size: 82.6 kB
Tags: CPython 3.14, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mps_bitsandbytes-0.1.1-cp314-cp314-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`4bb05cca51a236032ecbfb8bb10e6cde2adc8240d683648644aab018ab159b9c`
MD5	`03fd1369643ec0959f2090ff012ab7bc`
BLAKE2b-256	`81527f11217048642722bcde1ee6d80ba4764fc4bf98a76093faa08fa3f7699b`

See more details on using hashes here.

mps-bitsandbytes 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MPS BitsAndBytes

Features

Installation

Quick Start

How It Works

Memory Savings

Configuration

QLoRA Training

Benchmarks

Limitations

Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes