Lossless neural network weight compression

These details have not been verified by PyPI

Project links

Homepage

Project description

BigSmall

Lossless neural network weight compression. One package, every float format, every model. Compress big models so they fit (or load faster) on small hardware.

Why BigSmall

BigSmall is lossless, not quantization. After decompression the weights are bit-for-bit identical to the original (md5-verified on every shard). You get the same inference outputs as the uncompressed model — no quality degradation, no fine-tune drift, no surprise accuracy regression on long-tail prompts.

Existing tools force a tradeoff: ZipNN (~83% ratio, FP32 only), DFloat11 (~68%, BF16 only, ~2× slower inference at batch=1), ZipServ (~70%, BF16 only, H100-style GPUs only). BigSmall covers FP32/BF16/FP16/FP8/FP4 in a single package, hits the proven mathematical floor on each format, and adds no inference overhead — you decompress once at load time and run at native speed.

It is the right tool when reproducibility, fine-tuning, or production quality matters. Ollama-style 4-bit quantization gives you a smaller, worse version of the model. BigSmall gives you a smaller version of the same model.

Install

pip install bigsmall

Optional extras: pip install "bigsmall[torch]", [hf], [diffusion], [vllm], [all].

Requirements: Python 3.9+, PyTorch 2.0+, safetensors, numpy, zstandard, constriction.

Quick start

import bigsmall

# 1. Compress a safetensors model
bigsmall.compress("model.safetensors", "model.bs")

# 2. Load it back as a torch state_dict (one line, works on local path or HF repo)
sd = bigsmall.from_pretrained("./model.bs")
my_model.load_state_dict(sd, strict=False)

# 3. Stream layer-by-layer for models bigger than RAM
with bigsmall.StreamingLoader("model.bs", device="cuda") as L:
    non_layer = L.load_non_layer_tensors()
    for i, layer in L.iter_layers():
        ...  # one layer's tensors in memory at a time

Benchmarks

All results are bit-exact md5-verified lossless. Source safetensors → .bs.

Model	Format	Source	Compressed	Ratio
GPT-2 117M	FP32	548 MB	414 MB	75.53%
GPT-2 117M	BF16	274 MB	165 MB	60.11%
GPT-2 117M	FP16	274 MB	215 MB	78.50%
Mistral 7B Instruct v0.3 (shard)	BF16	4.55 GB	2.98 GB	65.56%
Llama 3.1 8B (shard)	BF16	1.17 GB	768 MB	65.73%
Qwen 2.5 14B (shard)	BF16	1.70 GB	1.12 GB	65.75%
Stable Diffusion 1.5 VAE	FP32	335 MB	278 MB	83.20%
Stable Diffusion 1.5 UNet	FP16	1.72 GB	1.48 GB	85.92%

Delta compression (fine-tune vs base) on GPT-2 with a simulated fine-tune: 6.95% of source — fine-tunes ship as tiny diffs against the base.

Streaming peak RAM on GPT-2 117M is 29.6% lower than full load. The win grows with depth: a 70B BF16 model normally needs 132 GB to load; with the streaming loader the peak is non_layer + one layer ≈ a few GB.

CLI

# Compress / decompress
bigsmall compress model.safetensors                  # balanced (default)
bigsmall compress model.safetensors --storage        # max ratio, slow decode
bigsmall compress model.safetensors --inference      # fastest decode
bigsmall decompress model.bs -o /path/to/output.safetensors

# Info / verify / benchmark
bigsmall info model.bs
bigsmall verify model.bs
bigsmall benchmark model.safetensors

# Delta compression
bigsmall compress finetune.safetensors --base base.safetensors -o delta.bs
bigsmall decompress delta.bs --base base.safetensors -o reconstructed.safetensors

Python API

import bigsmall

# Standard compress / decompress
bigsmall.compress("model.safetensors", "model.bs", mode="balanced")
tensors = bigsmall.decompress("model.bs")           # dict[str, np.ndarray]
torch_tensors = bigsmall.load("model.bs", device="cuda")

# Inspect a .bs file
info = bigsmall.info("model.bs")
print(info["ratio_pct"], info["format"], info["tensor_count"])

# Verify
ok = bigsmall.verify("model.bs")

# Delta compression
bigsmall.compress_delta("ft.safetensors", "base.safetensors", "delta.bs")
tensors = bigsmall.decompress_delta("delta.bs", "base.safetensors")

# Hub integration
bigsmall.compress_for_hub("gpt2", output_dir="./gpt2_bs")
bigsmall.upload_to_hub("./gpt2_bs", "user/gpt2-bigsmall")
sd = bigsmall.from_pretrained("user/gpt2-bigsmall")

HuggingFace integration

from bigsmall.integrations.huggingface import from_pretrained, install_hook

# Drop-in loader that returns a transformers model
from transformers import AutoModelForCausalLM
model = from_pretrained("model.bs", model_class=AutoModelForCausalLM,
                        config_dir="/path/to/hf/model_dir")

# Or patch safetensors globally so any from_pretrained call understands .bs
install_hook()

vLLM integration

from bigsmall.integrations.vllm import decompress_to_temp, get_loader_class

# Portable: decompress to temp dir, then point vLLM at it
out_dir = decompress_to_temp("model.bs", config_dir="/path/to/hf_dir")

# Or use the BigSmallModelLoader subclass directly (vLLM 0.4+)
LoaderClass = get_loader_class()

Diffusion model support

from bigsmall.integrations.diffusion import (
    compress_diffusion, decompress_diffusion, load_pipeline, is_diffusion_model
)

compress_diffusion("unet.safetensors", "unet.bs")
pipe = load_pipeline("unet.bs", config_dir="/path/to/diffusers_dir")

Container format

.bs files are self-describing:

Bytes	Field
0..3	Magic `BGSM`
4..5	Version (uint16, currently 1)
6..9	Header JSON length (uint32)
10..	Header JSON (utf-8)
...	Concatenated compressed blobs

Header JSON encodes per-tensor name, shape, dtype, codec, special, compressed_bytes, offset, md5, and any codec-specific extras.

Codecs

Format	Codec	Notes
FP32	per-tensor (sign,exp) AC + zstd byte-plane mantissa	75-83% ratio
BF16	per-tensor (sign,exp) AC + per-(exp) mantissa AC	60-66% ratio
FP16	per-tensor (sign,exp) AC + per-(exp) mantissa AC	77-86% ratio
FP8	per-tensor Categorical AC on byte stream	71-72% ratio
FP4	per-tensor Categorical AC on 4-bit indices	30% ratio (huge savings)

Special tensors (auto-detected, architecture-agnostic):

lowcard: tensors with ≤16 unique values (e.g. attention masks) → tiny lookup table
wpe_delta: 2D embeddings with high row-row correlation → delta + blosc2
tied: tensors with identical bytes (embed_tokens / lm_head) → stored once

Paper

Technical paper with full research records and floor proofs across all five float formats: coming soon (arXiv preprint in preparation).

License

Apache 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.1

May 15, 2026

2.0.0

May 15, 2026

1.0.1

May 14, 2026

This version

1.0.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigsmall-1.0.0.tar.gz (46.7 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigsmall-1.0.0-py3-none-any.whl (51.8 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file bigsmall-1.0.0.tar.gz.

File metadata

Download URL: bigsmall-1.0.0.tar.gz
Upload date: May 14, 2026
Size: 46.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6db91026b3938e0d2d27733d2f3316bea1b43556605d1d5ddd74ff21dfacf382`
MD5	`a7f2a6313574e60bcd3da015e1b9c662`
BLAKE2b-256	`4b052883c287dcbeac2336b13ed1e8dcfa3d5d5df83d0251f7e5d50c1a6062f9`

See more details on using hashes here.

File details

Details for the file bigsmall-1.0.0-py3-none-any.whl.

File metadata

Download URL: bigsmall-1.0.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 51.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d70118b442b67ceddce1056b463ab8abd933ebef41a9a49ecd2cceb55c4dede3`
MD5	`4b2de63b11525cd87c1427148dc0c5d0`
BLAKE2b-256	`a52db6b39d9489154552edd58bdf01f7f7ab9e5b99ea70255e18e302b4932cc4`

See more details on using hashes here.

bigsmall 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

BigSmall

Why BigSmall

Install

Quick start

Benchmarks

CLI

Python API

HuggingFace integration

vLLM integration

Diffusion model support

Container format

Codecs

Paper

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes