Skip to main content

Lossless neural network weight compression - run any model, no compromises

Project description

PyPI version DOI License Python

BigSmall — Make AI Models Smaller, Instantly

Lossless compression for neural network weights. Same model, smaller files. Bit-identical weights, md5-verified.

pip install bigsmall

A 14 GB Mistral 7B becomes 9 GB. A fine-tuned model becomes a small "patch" on top of its base — often less than 35% of the full size. Drop-in compatible with HuggingFace from_pretrained.


What it does

Three things, in plain English:

1. Compress any model

bigsmall compress model.safetensors -o model.bs
bigsmall decompress model.bs -o reconstructed.safetensors

Before: 15 GB safetensors. After: 10 GB .bs file. Quality: every weight bit-for-bit identical to the original.

2. Compress a fine-tuned model as a "patch"

If you have the base model already, store only what changed:

bigsmall compress fine_tuned.safetensors --delta-from base.safetensors -o patch.bs
bigsmall apply base.safetensors patch.bs -o reconstructed.safetensors

Before: 15 GB fine-tuned model. After: ~5 GB patch (depends on how much was fine-tuned). Quality: every weight bit-for-bit identical to the original.

This is the biggest user win. If you're publishing a fine-tune of a public base, your users can store the base once and download patches.

3. Use a pre-compressed model from HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/mistral-7b-instruct-bigsmall"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

Works exactly like any other HuggingFace model. BigSmall transparently decompresses in the background.


How much smaller?

Model Original BigSmall Saved
GPT-2 (117M, FP32) 548 MB 414 MB 24%
Llama 3.2-1B-Instruct 2.5 GB 1.5 GB 40%
Llama 3.2-3B-Instruct 6.4 GB 3.9 GB 39%
Mistral 7B Instruct v0.3 14.2 GB 9.3 GB 34%
Qwen 2.5-7B-Instruct 15.2 GB 10.0 GB 34%
Llama 3-8B-Instruct 16.1 GB 9.8 GB 39%
Qwen 2.5-14B-Instruct 29.5 GB 19.5 GB 34%
Gemma 2-9B-it 18.5 GB 11.3 GB 39%
Gemma 3-1B-it 2.0 GB 1.2 GB 40%
Stable Diffusion 1.5 UNet (FP16) 1.72 GB 1.48 GB 14%
Fine-tune patch (Instruct vs base) 14 GB ~5 GB ~65%

Browse all pre-compressed models →


Why lossless matters

  • Exact same model. Bit-identical weights. Every floating-point value is mathematically identical to the original.
  • Not quantization. Quantization (INT8, INT4) changes weight values — model behaviour changes too, even if just slightly.
  • Not pruning. Pruning removes parts of the model.
  • Not approximation. No tricks, no calibration data, no quality loss.

BigSmall compresses neural network weights the same way ZIP compresses text files: it finds redundancy in the bit pattern and stores it more compactly. The output decodes back to the exact same bits. md5 verified on every tensor.


Install

pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace Hub integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything

Requirements: Python 3.9+, NumPy, safetensors. PyTorch is required for HuggingFace round-trips and for using compressed models in inference.

Works on Linux, macOS, and Windows. CPU + NVIDIA + AMD + Apple Silicon.


What's new in v3.13.0

  • Delta compression (the big one). Compress a fine-tune as a patch on its base model. bigsmall compress fine_tuned/ --delta-from base/ patch.bs. Often <35% of the full model size, fully lossless.
  • Auto-detect the base model. bigsmall compress --auto-delta scans known-base fingerprints and suggests the right base. Header embeds a fingerprint of the base used, so decompression warns on mismatch.
  • Resumable compression. bigsmall compress --resume picks up exactly where it left off if the run was interrupted. Tensor-level checkpointing.
  • mmap-backed decode. Large .bs files (>256 MB) are now mmap'd instead of fully read into RAM. Lower peak memory, faster start.
  • GPU INT8 KV cache. LossyKVCacheGPU — opt-in lossy compression for runtime KV cache. ~50% VRAM saving for streaming inference, max error ~0.04 in BF16.
  • Streaming LRU layer cache. BigSmallStreamingModel(lru_max_vram_gb=2.0) keeps the most-recently-used decoded layers in VRAM.
  • Reed-Solomon ECC. bigsmall compress --ecc writes a parity sidecar that can recover from ~16 corrupted bytes per 223-byte block. bigsmall repair uses it.
  • Fast probabilistic verify. bigsmall verify --sample 0.001 decodes 0.1% of weights and verifies their md5 — catches in-blob corruption without the cost of a full verify.
  • Three new CLI commands. bigsmall scan (analyse before compressing), bigsmall apply (delta + base → original), bigsmall repair (ECC recovery).
  • V8 codec opt-in. Layer-type-aware codec for attention / embedding tensors. Negligible average gain (~0.07%), available via --use-v8-codec for users who want the option.
  • bigsmall.detect_bf16_native — detects F32 models that are really BF16 upcast and compresses them as BF16 (44% of raw F32 instead of 83%).
  • bigsmall.download_delta(repo_id, base_dir, output_dir) — pull a delta repo from HuggingFace and reconstruct the fine-tune.

See CHANGELOG.md for full details.


CLI reference

bigsmall compress SRC [-o OUTPUT] [--delta-from BASE] [--auto-delta]
                       [--resume] [--ecc] [--storage|--balanced|--inference]
bigsmall decompress SRC [-o OUTPUT] [--base BASE]
bigsmall info SRC                       # size, ratio, codecs used
bigsmall scan SRC                       # analyse before compressing
bigsmall stat SRC [--tensor X]          # per-tensor table
bigsmall verify SRC [--fast|--sample N] # integrity check
bigsmall diff A.bs B.bs [--patch P.bs]  # compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT     # reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT]         # recover using .ecc sidecar
bigsmall benchmark SRC                  # encode/decode speed
bigsmall migrate SRC                    # re-encode with current codecs
bigsmall status                         # list your BigSmall HF repos
bigsmall pipeline run SRC DST           # resumable download → compress → upload

Each command has --help for details. See docs/cli-reference.md for examples.


Common workflows

Compress and upload a model to HuggingFace

python -c "
import bigsmall
bigsmall.compress_for_hub('mistralai/Mistral-7B-Instruct-v0.3', './mistral_bs/')
bigsmall.upload_to_hub('./mistral_bs/', repo_id='wpferrell/mistral-7b-bigsmall')
"

Use a compressed model on a low-VRAM GPU

from bigsmall import BigSmallStreamingModel

model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/mistral-7b-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,     # cache 2 GB of decoded layers
)
out = model.generate(input_ids, max_new_tokens=100)

Uses ~12× less VRAM than standard loading by streaming layers on demand.

Distribute a fine-tune as a small patch

# As the publisher:
bigsmall compress fine_tuned.safetensors --delta-from base.safetensors -o patch.bs
# upload patch.bs to your HF repo

# As a user:
python -c "
import bigsmall
bigsmall.download_delta(
    'wpferrell/my-finetune-bigsmall-delta',
    base_dir='~/.cache/huggingface/.../Mistral-7B-Instruct-v0.3',
    output_dir='./reconstructed',
)
"

Research

BigSmall ships from a multi-month research arc that established the per-tensor lossless ceiling for BF16 transformer weights. We measured every meaningful direction — column-major rescan, 2D context coding, head-cluster dedup, QKV split, delta encoding, BF16-native F32 detection — and report what works and what doesn't.

Bottom-line findings:

  • The per-tensor lossless floor for BF16 transformer weights is ~65-66% of raw. Proven by V4-V8 experiments (300+ tested combinations, see research/).
  • The biggest meaningful gain available today is delta compression for fine-tuned models — ~34% of raw BF16.
  • All other intra-tensor angles have been falsified empirically.

Cite the BigSmall paper: Zenodo DOI 10.5281/zenodo.20279248

See docs/research.md for a plain-English summary of what was learned.


License

Code: Elastic License 2.0. Free for personal, research, and commercial use under typical software-product terms. See LICENSING.md for commercial licensing.

Model weights distributed via BigSmall format keep the license of the original model.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigsmall-3.13.0.tar.gz (190.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigsmall-3.13.0-py3-none-any.whl (166.6 kB view details)

Uploaded Python 3

File details

Details for the file bigsmall-3.13.0.tar.gz.

File metadata

  • Download URL: bigsmall-3.13.0.tar.gz
  • Upload date:
  • Size: 190.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-3.13.0.tar.gz
Algorithm Hash digest
SHA256 8077027a05942b166c4e45dff58f901fa90743a9668a6149bf45c28c66de695c
MD5 49d318efe689390c3192ffb517f4f99b
BLAKE2b-256 09a2ef4326968c37156ada0a8d74742c210e7d2c011b0d9d97a902d49dbc5c07

See more details on using hashes here.

File details

Details for the file bigsmall-3.13.0-py3-none-any.whl.

File metadata

  • Download URL: bigsmall-3.13.0-py3-none-any.whl
  • Upload date:
  • Size: 166.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-3.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b29700aff0c4dba1fe0b5bb39498d60e8bfb31d385313beee753c35527786a1c
MD5 01c086c936dd4e1a44cbb96ba11ea273
BLAKE2b-256 484b4456a5f017ba95c3f1af3bc3f303884bbfdb1f2ee6dc38f59f1218df1c24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page