Skip to main content

Lossless neural network weight compression - run any model, no compromises

Project description

BigSmall

Run any model. No compromises.

Mistral 7B is 14 GB. Your machine has 8 GB. Today your only option is quantization -- a degraded, worse version of the model. BigSmall changes that.

BigSmall compresses model weights losslessly. Mistral 7B goes from 14 GB to 9 GB. The streaming loader means you never need 9 GB free at once -- it decompresses one layer at a time, directly into VRAM, with a peak RAM footprint of under 2 GB. You run the exact same model. Bit-for-bit identical weights. No quality loss. No accuracy regression. No surprises.

pip install bigsmall
import bigsmall

# Load a compressed model -- same as the original, smaller footprint
state_dict = bigsmall.from_pretrained("wpferrell/mistral-7b-bigsmall")
model.load_state_dict(state_dict)

# Or stream it layer-by-layer -- runs models bigger than your RAM
with bigsmall.StreamingLoader("mistral.bs", device="cuda") as loader:
    for layer_idx, tensors in loader.iter_layers():
        # one layer in memory at a time, previous layer already freed
        pass

The problem with quantization

When a model doesn't fit, the standard answer is quantization. Drop to 4-bit. Use Ollama. It fits now.

But it's not the same model anymore. 4-bit quantization degrades every weight. The outputs are different. Fine-tuning on a quantized model introduces drift. Reproducibility goes out the window. For research, production, or anything where the answer actually matters -- quantization is a compromise you shouldn't have to make.

BigSmall is not quantization. After decompression, every weight is bit-for-bit identical to the original, md5-verified on every tensor. You get the full model. Always.


What it does

Quantization (4-bit) BigSmall
Lossless? No -- weights degraded Yes -- bit-identical
Mistral 7B size ~4 GB 9 GB
Peak RAM to load ~4 GB < 2 GB (streaming)
Inference speed Slower on some hardware Native (decompress once)
Fine-tuning safe? No -- drift from quantized base Yes -- clean base
Reproducible? No Yes

Benchmarks

All results are lossless -- md5-verified bit-identical reconstruction on every tensor.

Model Format Original Compressed Ratio
Mistral 7B Instruct v0.3 BF16 14.2 GB 9.3 GB 65.6%
Llama 3.1 8B BF16 15.0 GB 9.9 GB 65.7%
Qwen 2.5 14B BF16 28.6 GB 18.8 GB 65.8%
Stable Diffusion 1.5 UNet FP16 1.72 GB 1.48 GB 85.9%
Stable Diffusion 1.5 VAE FP32 335 MB 278 MB 83.2%
GPT-2 117M FP32 548 MB 414 MB 75.5%
GPT-2 117M BF16 274 MB 165 MB 60.1%

Fine-tune delta compression: 6.95% of source size -- ship fine-tunes as tiny diffs, not full model copies.

Streaming peak RAM: 29.6% lower than full load on GPT-2. On a 70B model the difference is tens of gigabytes.


Install

pip install bigsmall

Requirements: Python 3.9+, PyTorch 2.0+

Optional extras:

pip install "bigsmall[hf]"        # HuggingFace Hub integration
pip install "bigsmall[diffusion]" # Stable Diffusion support
pip install "bigsmall[vllm]"      # vLLM integration
pip install "bigsmall[all]"       # everything

HuggingFace integration

import bigsmall

# Compress any HuggingFace model
bigsmall.compress_for_hub("mistralai/Mistral-7B-Instruct-v0.3", output_dir="./mistral_bs")

# Upload to the Hub
bigsmall.upload_to_hub("./mistral_bs", "you/mistral-7b-bigsmall")

# Anyone can load it with one line
state_dict = bigsmall.from_pretrained("you/mistral-7b-bigsmall")

Streaming loader

The streaming loader lets you run models that don't fit in RAM or VRAM. It decompresses one transformer layer at a time, directly into the target device, and frees the previous layer before loading the next. Peak memory is embeddings + one layer -- typically under 2 GB even for 7B models.

with bigsmall.StreamingLoader("mistral.bs", device="cuda") as loader:
    print(f"{loader.layer_count()} layers")

    # Load embeddings and non-layer tensors upfront (small)
    base = loader.load_non_layer_tensors()

    # Stream layers one at a time
    for layer_idx, layer_tensors in loader.iter_layers():
        # Previous layer already freed from memory
        # layer_tensors is on device, ready to use
        pass

CLI

bigsmall compress model.safetensors                   # balanced (default)
bigsmall compress model.safetensors --storage         # maximum compression
bigsmall compress model.safetensors --inference       # fastest load
bigsmall decompress model.bs -o model.safetensors
bigsmall info model.bs
bigsmall verify model.bs

# Fine-tune delta
bigsmall compress finetune.safetensors --base base.safetensors -o delta.bs
bigsmall decompress delta.bs --base base.safetensors -o reconstructed.safetensors

Format support

Format Ratio Notes
BF16 60-66% LLMs (Mistral, Llama, Qwen)
FP32 75-83% GPT-2, SD VAE, research models
FP16 77-86% SD UNet, half-precision models
FP8 71-72% Quantization-aware models
FP4 ~30% Extreme compression

vs. other tools

Tool Formats Ratio Lossless Inference overhead
ZipNN FP32 only ~83% Yes None
DFloat11 BF16 only ~68% Yes ~2x at batch=1
ZipServ BF16 only ~70% Yes None (H100 only)
BigSmall All formats 60-86% Yes None

Paper

Full technical paper with floor proofs across all five float formats: coming soon (arXiv preprint in preparation).


License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigsmall-1.0.1.tar.gz (46.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigsmall-1.0.1-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file bigsmall-1.0.1.tar.gz.

File metadata

  • Download URL: bigsmall-1.0.1.tar.gz
  • Upload date:
  • Size: 46.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-1.0.1.tar.gz
Algorithm Hash digest
SHA256 77971f4d2277d8c7692f6e7acc4bddcab7f06ec8963df7b9d9214b95a2478d68
MD5 0e96e6fc7aa4a8c90cc1b98623caa225
BLAKE2b-256 49a76297db8afe9819d3206c72e52617aace0c2f1aae563d31cd27b28c86f149

See more details on using hashes here.

File details

Details for the file bigsmall-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: bigsmall-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for bigsmall-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b97de40c868b2fefa6365a98b97558d0192b6593b5613647ca5a8784c612b7ce
MD5 8cfabf8e00c3133f71d98d9474083661
BLAKE2b-256 b708f66a2074368d422cc650fb20af94484d8fdcbe0889ae67583d02136763d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page