Skip to main content

Lossless neural network weight compression

Project description

BigSmall

Lossless neural network weight compression. One package, every float format, every model. Compress big models so they fit (or load faster) on small hardware.

Why BigSmall

BigSmall is lossless, not quantization. After decompression the weights are bit-for-bit identical to the original (md5-verified on every shard). You get the same inference outputs as the uncompressed model — no quality degradation, no fine-tune drift, no surprise accuracy regression on long-tail prompts.

Existing tools force a tradeoff: ZipNN (~83% ratio, FP32 only), DFloat11 (~68%, BF16 only, ~2× slower inference at batch=1), ZipServ (~70%, BF16 only, H100-style GPUs only). BigSmall covers FP32/BF16/FP16/FP8/FP4 in a single package, hits the proven mathematical floor on each format, and adds no inference overhead — you decompress once at load time and run at native speed.

It is the right tool when reproducibility, fine-tuning, or production quality matters. Ollama-style 4-bit quantization gives you a smaller, worse version of the model. BigSmall gives you a smaller version of the same model.

Install

pip install bigsmall

Optional extras: pip install "bigsmall[torch]", [hf], [diffusion], [vllm], [all].

Requirements: Python 3.9+, PyTorch 2.0+, safetensors, numpy, zstandard, constriction.

Quick start

import bigsmall

# 1. Compress a safetensors model
bigsmall.compress("model.safetensors", "model.bs")

# 2. Load it back as a torch state_dict (one line, works on local path or HF repo)
sd = bigsmall.from_pretrained("./model.bs")
my_model.load_state_dict(sd, strict=False)

# 3. Stream layer-by-layer for models bigger than RAM
with bigsmall.StreamingLoader("model.bs", device="cuda") as L:
    non_layer = L.load_non_layer_tensors()
    for i, layer in L.iter_layers():
        ...  # one layer's tensors in memory at a time

Benchmarks

All results are bit-exact md5-verified lossless. Source safetensors → .bs.

Model Format Source Compressed Ratio
GPT-2 117M FP32 548 MB 414 MB 75.53%
GPT-2 117M BF16 274 MB 165 MB 60.11%
GPT-2 117M FP16 274 MB 215 MB 78.50%
Mistral 7B Instruct v0.3 (shard) BF16 4.55 GB 2.98 GB 65.56%
Llama 3.1 8B (shard) BF16 1.17 GB 768 MB 65.73%
Qwen 2.5 14B (shard) BF16 1.70 GB 1.12 GB 65.75%
Stable Diffusion 1.5 VAE FP32 335 MB 278 MB 83.20%
Stable Diffusion 1.5 UNet FP16 1.72 GB 1.48 GB 85.92%

Delta compression (fine-tune vs base) on GPT-2 with a simulated fine-tune: 6.95% of source — fine-tunes ship as tiny diffs against the base.

Streaming peak RAM on GPT-2 117M is 29.6% lower than full load. The win grows with depth: a 70B BF16 model normally needs 132 GB to load; with the streaming loader the peak is non_layer + one layer ≈ a few GB.

CLI

# Compress / decompress
bigsmall compress model.safetensors                  # balanced (default)
bigsmall compress model.safetensors --storage        # max ratio, slow decode
bigsmall compress model.safetensors --inference      # fastest decode
bigsmall decompress model.bs -o /path/to/output.safetensors

# Info / verify / benchmark
bigsmall info model.bs
bigsmall verify model.bs
bigsmall benchmark model.safetensors

# Delta compression
bigsmall compress finetune.safetensors --base base.safetensors -o delta.bs
bigsmall decompress delta.bs --base base.safetensors -o reconstructed.safetensors

Python API

import bigsmall

# Standard compress / decompress
bigsmall.compress("model.safetensors", "model.bs", mode="balanced")
tensors = bigsmall.decompress("model.bs")           # dict[str, np.ndarray]
torch_tensors = bigsmall.load("model.bs", device="cuda")

# Inspect a .bs file
info = bigsmall.info("model.bs")
print(info["ratio_pct"], info["format"], info["tensor_count"])

# Verify
ok = bigsmall.verify("model.bs")

# Delta compression
bigsmall.compress_delta("ft.safetensors", "base.safetensors", "delta.bs")
tensors = bigsmall.decompress_delta("delta.bs", "base.safetensors")

# Hub integration
bigsmall.compress_for_hub("gpt2", output_dir="./gpt2_bs")
bigsmall.upload_to_hub("./gpt2_bs", "user/gpt2-bigsmall")
sd = bigsmall.from_pretrained("user/gpt2-bigsmall")

HuggingFace integration

from bigsmall.integrations.huggingface import from_pretrained, install_hook

# Drop-in loader that returns a transformers model
from transformers import AutoModelForCausalLM
model = from_pretrained("model.bs", model_class=AutoModelForCausalLM,
                        config_dir="/path/to/hf/model_dir")

# Or patch safetensors globally so any from_pretrained call understands .bs
install_hook()

vLLM integration

from bigsmall.integrations.vllm import decompress_to_temp, get_loader_class

# Portable: decompress to temp dir, then point vLLM at it
out_dir = decompress_to_temp("model.bs", config_dir="/path/to/hf_dir")

# Or use the BigSmallModelLoader subclass directly (vLLM 0.4+)
LoaderClass = get_loader_class()

Diffusion model support

from bigsmall.integrations.diffusion import (
    compress_diffusion, decompress_diffusion, load_pipeline, is_diffusion_model
)

compress_diffusion("unet.safetensors", "unet.bs")
pipe = load_pipeline("unet.bs", config_dir="/path/to/diffusers_dir")

Container format

.bs files are self-describing:

Bytes Field
0..3 Magic BGSM
4..5 Version (uint16, currently 1)
6..9 Header JSON length (uint32)
10.. Header JSON (utf-8)
... Concatenated compressed blobs

Header JSON encodes per-tensor name, shape, dtype, codec, special, compressed_bytes, offset, md5, and any codec-specific extras.

Codecs

Format Codec Notes
FP32 per-tensor (sign,exp) AC + zstd byte-plane mantissa 75-83% ratio
BF16 per-tensor (sign,exp) AC + per-(exp) mantissa AC 60-66% ratio
FP16 per-tensor (sign,exp) AC + per-(exp) mantissa AC 77-86% ratio
FP8 per-tensor Categorical AC on byte stream 71-72% ratio
FP4 per-tensor Categorical AC on 4-bit indices 30% ratio (huge savings)

Special tensors (auto-detected, architecture-agnostic):

  • lowcard: tensors with ≤16 unique values (e.g. attention masks) → tiny lookup table
  • wpe_delta: 2D embeddings with high row-row correlation → delta + blosc2
  • tied: tensors with identical bytes (embed_tokens / lm_head) → stored once

Paper

Technical paper with full research records and floor proofs across all five float formats: coming soon (arXiv preprint in preparation).

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigsmall-1.0.0.tar.gz (46.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigsmall-1.0.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file bigsmall-1.0.0.tar.gz.

File metadata

  • Download URL: bigsmall-1.0.0.tar.gz
  • Upload date:
  • Size: 46.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6db91026b3938e0d2d27733d2f3316bea1b43556605d1d5ddd74ff21dfacf382
MD5 a7f2a6313574e60bcd3da015e1b9c662
BLAKE2b-256 4b052883c287dcbeac2336b13ed1e8dcfa3d5d5df83d0251f7e5d50c1a6062f9

See more details on using hashes here.

File details

Details for the file bigsmall-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: bigsmall-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d70118b442b67ceddce1056b463ab8abd933ebef41a9a49ecd2cceb55c4dede3
MD5 4b2de63b11525cd87c1427148dc0c5d0
BLAKE2b-256 a52db6b39d9489154552edd58bdf01f7f7ab9e5b99ea70255e18e302b4932cc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page