Lossless AI model compression — make any model 34% smaller with bit-identical weights, drop-in replacement for HuggingFace from_pretrained
Project description
BigSmall — Lossless AI Model Compression
Make any AI model ~34% smaller. Bit-identical weights. Drop-in replacement for from_pretrained.
pip install bigsmall # CLI + compression/decompression
pip install bigsmall[torch] # add this for model loading (from_pretrained)
A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. The decompressed model is every weight bit-for-bit identical to the original — each tensor's md5 is verified on decompress. (Verification is tensor-level, not file-level: safetensors re-serializes the container wrapper, so the file's md5 changes, but every weight value is bit-for-bit identical.)
| ~34% smaller | ~65% smaller as a delta patch | 25+ ready-to-use models |
|---|---|---|
| any BF16 LLM | fine-tunes vs their base | on HuggingFace |
What BigSmall does
Three use cases. Pick the one that fits.
1. Make any model smaller
bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/
Before: 14.2 GB of safetensors. After: 9.3 GB .bs file. Saved: 4.9 GB (34%).
Every weight is bit-for-bit identical. Every calculation the model does is identical to the original. Works on any safetensors model — LLMs, diffusion, audio, vision, anything.
2. Store fine-tunes as tiny patches
bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/
Before: 14.2 GB Qwen2.5-7B-Instruct. After: ~5 GB patch. Saved: 9 GB (65%).
If your users already have the public base model, they only need to download what changed. This is the biggest win in BigSmall. Use it for any fine-tune: instruction tuning, DPO, RLHF, domain adaptation, LoRA-merged checkpoints.
3. Download smaller, use instantly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"wpferrell/phi-3.5-mini-instruct-bigsmall"
)
Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. 25+ pre-compressed models ready to use (browse them all).
Prefer the CLI? bigsmall decompress works on local .bs files — download first, then decompress:
hf download wpferrell/phi-3.5-mini-instruct-bigsmall --local-dir phi-3.5-mini-bs
bigsmall decompress phi-3.5-mini-bs/model-00001-of-00002.bs -o model.safetensors
(On older huggingface_hub the equivalent command is huggingface-cli download …; the huggingface-cli entrypoint is deprecated in huggingface_hub >= 1.0 in favour of hf.)
Compression numbers (every published model)
Every row is a real measurement. Click a model to download it.
| Model | Original | BigSmall | Saved |
|---|---|---|---|
| Qwen2.5-14B-Instruct | 29.5 GB | 19.5 GB | 34% |
| Gemma-3-12B-it | 22.7 GB | 14.8 GB | 35% |
| Gemma-2-9B-it | 17.2 GB | 11.3 GB | 34% |
| Llama-3.1-8B-Instruct | 15.0 GB | 9.7 GB | 35% |
| Llama-3-8B-Instruct | 15.0 GB | 9.8 GB | 34% |
| Qwen3-8B | 15.3 GB | 10.1 GB | 34% |
| Mistral-7B-Instruct v0.3 | 14.2 GB | 8.9 GB | 37% |
| Mistral-7B-Instruct v0.2 | 14.2 GB | 8.9 GB | 37% |
| Qwen2.5-7B-Instruct | 14.2 GB | 9.4 GB | 34% |
| Phi-3.5-mini-instruct | 7.1 GB | 4.7 GB | 34% |
| Gemma-3-4B-it | 8.0 GB | 5.2 GB | 35% |
| Qwen3-4B-Instruct | 7.5 GB | 5.0 GB | 34% |
| Llama-3.2-3B-Instruct | 6.4 GB | 3.9 GB | 39% |
| Gemma-2-2B-it | 4.9 GB | 3.2 GB | 34% |
| Qwen2.5-3B-Instruct | 5.7 GB | 3.8 GB | 34% |
| Qwen2.5-1.5B-Instruct | 2.9 GB | 1.9 GB | 34% |
| Llama-3.2-1B-Instruct | 2.3 GB | 1.5 GB | 34% |
| Gemma-3-1B-it | 1.9 GB | 1.2 GB | 35% |
| Qwen2.5-0.5B-Instruct | 920 MB | 610 MB | 34% |
| GPT-2 (117M) | 548 MB | 414 MB | 24% |
| Gemma-3-270M-it | 500 MB | 330 MB | 34% |
| Gemma-3-270M | 500 MB | 330 MB | 34% |
| Gemma-2-2B | 9.7 GB | 8.1 GB | 17% |
Browse all 25+ models on HuggingFace →
What "lossless" actually means
Every weight in the model is mathematically identical to the original — same bit pattern, same floating-point value, same gradient, same output.
- Not quantization. Quantization rounds weights to fewer bits and the model's behaviour changes.
- Not pruning. Pruning deletes weights.
- Not approximation. No tricks, no calibration data, no quality drop.
BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. md5 is verified on every tensor at decompression. If a single bit differs, verify fails.
How it compares
| Approach | Lossless? | Typical reduction | Behaviour change |
|---|---|---|---|
| BigSmall | Yes — bit-identical | ~34% (65% as a delta) | None |
| Quantization (GPTQ / AWQ / bitsandbytes) | No | 50–75% | Yes — weights are rounded |
| DFloat11 (BF16→FP11) | No | ~31% (fixed) | Yes — mantissa truncated |
| ZipNN | No | ~20–30% | Yes — quality loss |
| ZIP / gzip on safetensors | Yes | ~1–3% | None (but not model-aware) |
BigSmall is the only option here that is both lossless and meaningfully smaller: every weight, gradient, and output is identical to the original, so it's a drop-in for any workflow without re-evaluation. Quantization compresses further but changes the model; generic ZIP keeps fidelity but barely shrinks BF16 weights. See docs/comparison.md for the full breakdown.
CLI reference
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall info SRC.bs size, ratio, codecs used
bigsmall scan SRC analyse before compressing
bigsmall verify SRC.bs [--fast|--sample N] integrity check
bigsmall diff A.bs B.bs [--patch P.bs] compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT] recover via Reed-Solomon ECC sidecar
bigsmall benchmark SRC encode/decode throughput
bigsmall migrate SRC.bs re-encode with current codecs
bigsmall status list your BigSmall HF repos
bigsmall pipeline run SRC DST resumable download → compress → upload
bigsmall reshard SRC --output-dir DIR [--size-gb N|--shards N|--join] reshard .bs by layer
Every command has --help. See docs/cli-reference.md for full examples.
Python API
import bigsmall
# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")
# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")
# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")
# Low-VRAM streaming inference (~12× less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
"wpferrell/phi-3.5-mini-instruct-bigsmall",
device="cuda",
lru_max_vram_gb=2.0,
)
# Stream-decompress straight from the HF CDN — no .bs written to disk (V10)
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")
# Reshard .bs files along layer boundaries, no re-encoding (V11)
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)
What's new in v3.14
- GPU-resident KV cache (V9) —
GPUCompressedKVCachekeeps the compressed cache and the encode/decode passes entirely on the CUDA device, with no CPU round-trip. ~47× faster than the CPU KV codec on the reference shape, bit-identical round-trip.get_kv_cache(device, mode)auto-picks the GPU backend when CUDA is available. V9B adds fused Triton pack/unpack kernels. - Progressive HTTP streaming (V10) —
stream_from_hub(repo_id)decompresses a model directly from the HuggingFace CDN over HTTP byte-range requests. With the defaultcache=False, zero.bsbytes are written to disk. - Reshard (V11) —
bigsmall reshardsplits, joins, or rebalances.bsshards along transformer-layer boundaries with no re-encoding. Every output tensor is md5-verified. - numba is now a hard dependency (
numba>=0.61) — guarantees the JIT codec path runs everywhere instead of a silent slow fallback. - CI green across the full matrix — Ubuntu / Windows / macOS × Python 3.10 / 3.11 / 3.12.
Earlier highlights still current: delta compression (fine-tunes as ~34%-size patches), --auto-delta base detection, BF16-native F32 auto-routing (Whisper-class), --resume, verify --fast/--sample, mmap decode, Reed-Solomon --ecc + repair, and BigSmallStreamingModel(lru_max_vram_gb=…).
Research
The lossless compression ceiling for BF16 neural weights has been measured. It is ~62% of raw BF16 for any model, ~34% for fine-tunes with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.
Full findings, all experiments, all dead-ends: 10.5281/zenodo.20279247. Plain-English summary: docs/research.md.
Install
pip install bigsmall # core
pip install "bigsmall[hf]" # + HuggingFace integration
pip install "bigsmall[ecc]" # + Reed-Solomon error recovery
pip install "bigsmall[all]" # everything
Requires Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.
License
Code: Elastic License 2.0. Free for personal, research, and commercial use. SaaS providers should see LICENSING.md.
Model weights distributed in .bs format keep the license of the original model.
Links
- PyPI — https://pypi.org/project/bigsmall/
- GitHub — https://github.com/wpferrell/Bigsmall
- HuggingFace — https://huggingface.co/wpferrell
- Paper / DOI — https://doi.org/10.5281/zenodo.20279247 (always resolves to the latest version)
- Paper (PDF) — https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf
- Docs — docs/
- Changelog — CHANGELOG.md
- Contact — wpferrell@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bigsmall-3.14.4.tar.gz.
File metadata
- Download URL: bigsmall-3.14.4.tar.gz
- Upload date:
- Size: 230.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43b6cd23beae417b6f059a40019465b99864d5385404190561144db313edcfbd
|
|
| MD5 |
fa0c4f064cae7e59eaa5be6369413f56
|
|
| BLAKE2b-256 |
642f395efdbb0423413fc543f3108704710f6aca615da36ee66f9df67a30809c
|
File details
Details for the file bigsmall-3.14.4-py3-none-any.whl.
File metadata
- Download URL: bigsmall-3.14.4-py3-none-any.whl
- Upload date:
- Size: 198.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0220898f3c49f76403706375060ab1db308fbcb3b75be1d907c8bc818f3b8cf
|
|
| MD5 |
2f2e6fb2bfbcc0c97ffe37f15d5f75c4
|
|
| BLAKE2b-256 |
f657f7cf858bbe11c91274bc39f2614dda71b0ab9683ac7ad7ce800d34a24803
|