Lossless neural network weight compression - run any model, no compromises
Project description
BigSmall — Make AI Models Smaller, Instantly
Lossless compression for neural network weights. Same model, smaller files. Bit-identical weights, md5-verified.
pip install bigsmall
A 14 GB Mistral 7B becomes 9 GB. A fine-tuned model becomes a small "patch" on top of its base — often less than 35% of the full size. Drop-in compatible with HuggingFace from_pretrained.
What it does
Three things, in plain English:
1. Compress any model
bigsmall compress model.safetensors -o model.bs
bigsmall decompress model.bs -o reconstructed.safetensors
Before: 15 GB safetensors. After: 10 GB .bs file. Quality: every weight bit-for-bit identical to the original.
2. Compress a fine-tuned model as a "patch"
If you have the base model already, store only what changed:
bigsmall compress fine_tuned.safetensors --delta-from base.safetensors -o patch.bs
bigsmall apply base.safetensors patch.bs -o reconstructed.safetensors
Before: 15 GB fine-tuned model. After: ~5 GB patch (depends on how much was fine-tuned). Quality: every weight bit-for-bit identical to the original.
This is the biggest user win. If you're publishing a fine-tune of a public base, your users can store the base once and download patches.
3. Use a pre-compressed model from HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"wpferrell/mistral-7b-instruct-bigsmall"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
Works exactly like any other HuggingFace model. BigSmall transparently decompresses in the background.
How much smaller?
| Model | Original | BigSmall | Saved |
|---|---|---|---|
| GPT-2 (117M, FP32) | 548 MB | 414 MB | 24% |
| Llama 3.2-1B-Instruct | 2.5 GB | 1.5 GB | 40% |
| Llama 3.2-3B-Instruct | 6.4 GB | 3.9 GB | 39% |
| Mistral 7B Instruct v0.3 | 14.2 GB | 9.3 GB | 34% |
| Qwen 2.5-7B-Instruct | 15.2 GB | 10.0 GB | 34% |
| Llama 3-8B-Instruct | 16.1 GB | 9.8 GB | 39% |
| Qwen 2.5-14B-Instruct | 29.5 GB | 19.5 GB | 34% |
| Gemma 2-9B-it | 18.5 GB | 11.3 GB | 39% |
| Gemma 3-1B-it | 2.0 GB | 1.2 GB | 40% |
| Stable Diffusion 1.5 UNet (FP16) | 1.72 GB | 1.48 GB | 14% |
| Fine-tune patch (Instruct vs base) | 14 GB | ~5 GB | ~65% |
Browse all pre-compressed models →
Why lossless matters
- Exact same model. Bit-identical weights. Every floating-point value is mathematically identical to the original.
- Not quantization. Quantization (INT8, INT4) changes weight values — model behaviour changes too, even if just slightly.
- Not pruning. Pruning removes parts of the model.
- Not approximation. No tricks, no calibration data, no quality loss.
BigSmall compresses neural network weights the same way ZIP compresses text files: it finds redundancy in the bit pattern and stores it more compactly. The output decodes back to the exact same bits. md5 verified on every tensor.
Install
pip install bigsmall # core
pip install "bigsmall[hf]" # + HuggingFace Hub integration
pip install "bigsmall[ecc]" # + Reed-Solomon error recovery
pip install "bigsmall[all]" # everything
Requirements: Python 3.9+, NumPy, safetensors. PyTorch is required for HuggingFace round-trips and for using compressed models in inference.
Works on Linux, macOS, and Windows. CPU + NVIDIA + AMD + Apple Silicon.
What's new in v3.13.0
- Delta compression (the big one). Compress a fine-tune as a patch on its base model.
bigsmall compress fine_tuned/ --delta-from base/ patch.bs. Often <35% of the full model size, fully lossless. - Auto-detect the base model.
bigsmall compress --auto-deltascans known-base fingerprints and suggests the right base. Header embeds a fingerprint of the base used, so decompression warns on mismatch. - Resumable compression.
bigsmall compress --resumepicks up exactly where it left off if the run was interrupted. Tensor-level checkpointing. - mmap-backed decode. Large
.bsfiles (>256 MB) are now mmap'd instead of fully read into RAM. Lower peak memory, faster start. - GPU INT8 KV cache.
LossyKVCacheGPU— opt-in lossy compression for runtime KV cache. ~50% VRAM saving for streaming inference, max error ~0.04 in BF16. - Streaming LRU layer cache.
BigSmallStreamingModel(lru_max_vram_gb=2.0)keeps the most-recently-used decoded layers in VRAM. - Reed-Solomon ECC.
bigsmall compress --eccwrites a parity sidecar that can recover from ~16 corrupted bytes per 223-byte block.bigsmall repairuses it. - Fast probabilistic verify.
bigsmall verify --sample 0.001decodes 0.1% of weights and verifies their md5 — catches in-blob corruption without the cost of a full verify. - Three new CLI commands.
bigsmall scan(analyse before compressing),bigsmall apply(delta + base → original),bigsmall repair(ECC recovery). - V8 codec opt-in. Layer-type-aware codec for attention / embedding tensors. Negligible average gain (~0.07%), available via
--use-v8-codecfor users who want the option. bigsmall.detect_bf16_native— detects F32 models that are really BF16 upcast and compresses them as BF16 (44% of raw F32 instead of 83%).bigsmall.download_delta(repo_id, base_dir, output_dir)— pull a delta repo from HuggingFace and reconstruct the fine-tune.
See CHANGELOG.md for full details.
CLI reference
bigsmall compress SRC [-o OUTPUT] [--delta-from BASE] [--auto-delta]
[--resume] [--ecc] [--storage|--balanced|--inference]
bigsmall decompress SRC [-o OUTPUT] [--base BASE]
bigsmall info SRC # size, ratio, codecs used
bigsmall scan SRC # analyse before compressing
bigsmall stat SRC [--tensor X] # per-tensor table
bigsmall verify SRC [--fast|--sample N] # integrity check
bigsmall diff A.bs B.bs [--patch P.bs] # compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT # reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT] # recover using .ecc sidecar
bigsmall benchmark SRC # encode/decode speed
bigsmall migrate SRC # re-encode with current codecs
bigsmall status # list your BigSmall HF repos
bigsmall pipeline run SRC DST # resumable download → compress → upload
Each command has --help for details. See docs/cli-reference.md for examples.
Common workflows
Compress and upload a model to HuggingFace
python -c "
import bigsmall
bigsmall.compress_for_hub('mistralai/Mistral-7B-Instruct-v0.3', './mistral_bs/')
bigsmall.upload_to_hub('./mistral_bs/', repo_id='wpferrell/mistral-7b-bigsmall')
"
Use a compressed model on a low-VRAM GPU
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
"wpferrell/mistral-7b-instruct-bigsmall",
device="cuda",
lru_max_vram_gb=2.0, # cache 2 GB of decoded layers
)
out = model.generate(input_ids, max_new_tokens=100)
Uses ~12× less VRAM than standard loading by streaming layers on demand.
Distribute a fine-tune as a small patch
# As the publisher:
bigsmall compress fine_tuned.safetensors --delta-from base.safetensors -o patch.bs
# upload patch.bs to your HF repo
# As a user:
python -c "
import bigsmall
bigsmall.download_delta(
'wpferrell/my-finetune-bigsmall-delta',
base_dir='~/.cache/huggingface/.../Mistral-7B-Instruct-v0.3',
output_dir='./reconstructed',
)
"
Research
BigSmall ships from a multi-month research arc that established the per-tensor lossless ceiling for BF16 transformer weights. We measured every meaningful direction — column-major rescan, 2D context coding, head-cluster dedup, QKV split, delta encoding, BF16-native F32 detection — and report what works and what doesn't.
Bottom-line findings:
- The per-tensor lossless floor for BF16 transformer weights is ~65-66% of raw. Proven by V4-V8 experiments (300+ tested combinations, see
research/). - The biggest meaningful gain available today is delta compression for fine-tuned models — ~34% of raw BF16.
- All other intra-tensor angles have been falsified empirically.
Cite the BigSmall paper: Zenodo DOI 10.5281/zenodo.20279248
See docs/research.md for a plain-English summary of what was learned.
License
Code: Elastic License 2.0. Free for personal, research, and commercial use under typical software-product terms. See LICENSING.md for commercial licensing.
Model weights distributed via BigSmall format keep the license of the original model.
Links
- PyPI: https://pypi.org/project/bigsmall/
- GitHub: https://github.com/wpferrell/Bigsmall
- HuggingFace: https://huggingface.co/wpferrell
- Paper: https://doi.org/10.5281/zenodo.20279248
- Docs: docs/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bigsmall-3.13.0.tar.gz.
File metadata
- Download URL: bigsmall-3.13.0.tar.gz
- Upload date:
- Size: 190.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8077027a05942b166c4e45dff58f901fa90743a9668a6149bf45c28c66de695c
|
|
| MD5 |
49d318efe689390c3192ffb517f4f99b
|
|
| BLAKE2b-256 |
09a2ef4326968c37156ada0a8d74742c210e7d2c011b0d9d97a902d49dbc5c07
|
File details
Details for the file bigsmall-3.13.0-py3-none-any.whl.
File metadata
- Download URL: bigsmall-3.13.0-py3-none-any.whl
- Upload date:
- Size: 166.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b29700aff0c4dba1fe0b5bb39498d60e8bfb31d385313beee753c35527786a1c
|
|
| MD5 |
01c086c936dd4e1a44cbb96ba11ea273
|
|
| BLAKE2b-256 |
484b4456a5f017ba95c3f1af3bc3f303884bbfdb1f2ee6dc38f59f1218df1c24
|