LLM weight compression via Gaussian Mixture Models + Asymmetric Numeral Systems (ANS) — 4.9× ratio, CPU-only

These details have not been verified by PyPI

Project links

Project description

Gauss — GMM+ANS LLM Checkpoint Compression

4.90× smaller. CPU-only. Lossless format, bounded error. Full dtype round-trip.

Gauss compresses large language model weight tensors using a two-stage pipeline: Gaussian Mixture Model (GMM) distribution fitting followed by Asymmetric Numeral Systems (ANS) entropy coding. Small or 1-D tensors (e.g. LayerNorm scale/bias) are stored losslessly in a dedicated raw section — nothing is silently dropped.

1,645 MB safetensors  →  335 MB .gauss   (4.90×)   max error ±5×10⁻⁴

Highlights

Feature	Detail
Compression ratio	~4.9× on typical SFT checkpoints (K=16)
Max reconstruction error	±5×10⁻⁴ for fp32/fp16; ±3.9×10⁻³ for bf16 (adaptive-S)
Small / 1-D tensors	Stored losslessly in raw section (not skipped)
Dtype preservation	Original `float32 / float16 / bfloat16` restored exactly
Parallel	Compression and decompression via `multiprocessing.Pool`
Random access	Decompress one tensor without reading the whole file
CPU-only	No GPU required
Dependencies	NumPy, PyTorch, safetensors, constriction

Installation

pip install gauss-compress

Or from source:

git clone https://github.com/skystarry-ai/gauss
cd gauss
pip install -e .

Requirements: Python ≥ 3.9, and the four packages above (installed automatically).

Quick Start

# Compress a safetensors checkpoint
gauss compress model.safetensors model.gauss

# Restore it — original dtypes are recovered automatically
gauss decompress model.gauss restored.safetensors

# Inspect a compressed file (no decompression)
gauss info model.gauss

What Gets Compressed vs. Stored Raw

Gauss automatically routes each tensor to the best storage path:

Condition	Storage path	Lossless?
`ndim == 1` (e.g. bias, LayerNorm scale)	Raw section	✅ Yes
`numel < 32 768` (small tensors)	Raw section	✅ Yes
Everything else	GMM+ANS compressed	❌ Bounded ±5×10⁻⁴

You can adjust the raw threshold with --raw-threshold (default: 32 768).

CLI Reference

`gauss compress`

gauss compress INPUT OUTPUT [options]

Option	Default	Description
`--K`	`16`	GMM components per tensor
`--max-iter`	`100`	Max hard-EM iterations
`--workers`	all cores	Parallel worker processes
`--shard-threshold`	`5 000 000`	Split 2-D+ tensors larger than this into row-shards
`--shard-rows`	`4 096`	Rows per shard
`--raw-threshold`	`32 768`	Store tensors smaller than this losslessly
`--layers`	(all)	Comma-separated layer names to compress

Example output

Loading sft_model.safetensors ...
Layers: 170 compressed (148 sharded), 42 raw | tasks: 892 | K=16 | workers=8
  [   1/170] model.layers.0.mlp.down_proj.weight        5.412x  iter= 23 [8 shards]  3.2s
  ...
  [raw] model.layers.0.self_attn.q_proj.bias
  ...
Saved: sft_model.gauss  (335.8 MB)
Overall ratio: 4.903x  (1645MB → 335MB)  total 355s

`gauss decompress`

gauss decompress INPUT OUTPUT [--workers N]

Restores the original .safetensors file with all tensors in their original dtype (float32, float16, bfloat16, …).

`gauss info`

gauss info FILE

File:    model.gauss
Size:    335.84 MB
Layers:  212  (170 compressed, 42 raw)

  model.layers.0.mlp.gate_proj.weight            18432KB →  3913KB  4.709x  K=16  torch.float32
  model.layers.0.mlp.down_proj.weight            18432KB →  3413KB  5.400x  K=16  torch.float32 [×8 shards]
  ...

  model.layers.0.self_attn.q_proj.bias               3KB   [raw / lossless]  torch.float32
  model.norm.weight                                   3KB   [raw / lossless]  torch.float32

Python API

from gauss import GaussWriter, GaussReader, compress_file, decompress_file

# --- High-level API (recommended) ---
compress_file("model.safetensors", "model.gauss", K=16, n_workers=8)
decompress_file("model.gauss", "restored.safetensors")

# --- Low-level writer ---
writer = GaussWriter("model.gauss")

# Add a GMM-compressed layer
writer.add(key, shape, K, pi, mu, sigma, idx_compressed, resid_compressed,
           dtype_id=0)   # 0 = float32

# Add a raw (lossless) layer
writer.add_raw(key, shape, dtype_id=0, raw_bytes=tensor.numpy().tobytes())

writer.save()
print(writer.summary())

# --- Low-level reader ---
reader = GaussReader("model.gauss")
print(reader.keys())                        # all tensor names

arr = reader.decompress("model.layers.0.mlp.gate_proj.weight")  # np.ndarray
state_dict = reader.decompress_all()        # {key: np.ndarray}

File Format (GAUSS004)

The .gauss format stores metadata and data in separate sections, enabling random access to any individual tensor with a single seek:

magic[8]           "GAUSS004"
n_compressed[4]    uint32 — GMM-compressed layer count
n_raw[4]           uint32 — raw lossless layer count

=== Compressed index section ===
  per layer: name, shape, dtype_id, n_shards, shard_rows,
             per shard: K, pi, mu, sigma, resid_scale, idx_words, resid_words

=== Raw index section ===
  per layer: name, shape, dtype_id, raw_n_bytes

=== Compressed data section ===
  per shard: ANS index stream + ANS residual stream

=== Raw data section ===
  per layer: little-endian tensor bytes

resid_scale (uint16, new in GAUSS004) records the effective quantization scale used per shard. For fp32 tensors this is always 1000; for bf16 it is automatically capped at 128 via adaptive-S (see How It Works).

Legacy GAUSS001–GAUSS003 files are read-compatible (dtype defaults to float32, resid_scale defaults to 1000).

Module Layout

gauss/
├── __init__.py     # Public API: GaussWriter, GaussReader, compress_file, decompress_file
├── gmm.py          # Pure-NumPy hard EM: fit_gmm_fast, assign_all
├── codec.py        # ANS wrappers (constriction): encode/decode indices & residuals;
│                   #   adaptive_resid_scale() for fp16/bf16 precision capping
├── format.py       # File I/O: GaussWriter, GaussReader (GAUSS001–004)
└── compress.py     # CLI entry point, multiprocessing workers

How It Works

For each weight tensor w:

Route — tensors with ndim == 1 or numel < 32 768 go to the raw section (lossless). All others enter the GMM pipeline.
GMM Fitting — fit a K-component mixture of Gaussians using hard EM (pure NumPy). Large tensors are subsampled to 200k elements for speed.
Assignment — assign every weight to its most likely cluster.
Residual Quantization — compute r = round((w − μ_cluster) × S), clipped to INT16 range. Reconstruction error is bounded by ±1/(2×S). For reduced-precision dtypes, adaptive-S caps the scale at the dtype's precision floor so sub-epsilon noise is never encoded into the bitstream:

dtype S (effective) Max error

float32 1000 ±5×10⁻⁴

float16 1000 ±5×10⁻⁴

bfloat16 128 ±3.9×10⁻³
ANS Encoding — encode the assignment sequence with a Categorical model and the residuals with per-element QuantizedGaussian models, approaching the Shannon entropy bound.

dtype	S (effective)	Max error
float32	1000	±5×10⁻⁴
float16	1000	±5×10⁻⁴
bfloat16	128	±3.9×10⁻³

Decompression reverses this exactly: decode assignments → decode residuals → ŵ = μ_cluster + r / 1000, then cast back to the original dtype.

Benchmark

Tested on a 24-layer SFT checkpoint (768 hidden dim, 6144 FFN, GQA 4 KV heads, vocab 32k):

Method	Type	Ratio	Lossless?
gzip (level 6)	byte-level	~1.02×	✅
zstd (level 3)	byte-level	~1.03×	✅
INT8 quantization	precision reduction	4.00×	❌
Gauss (K=16)	distribution	4.90×	❌ (±5×10⁻⁴)
INT4 quantization	precision reduction	8.00×	❌

Compression: 355 s / Decompression: 95 s (measured on Google Colab CPU — Xeon server-class; consumer CPUs may be faster or slower depending on core count and single-core performance; use --workers to tune parallelism for your host)
Max per-element reconstruction error: 5.00×10⁻⁴ (tighter than INT8)

Limitations

Lossy for large tensors — bounded error ±5×10⁻⁴; not suitable for bit-exact reproduction of large weight matrices.
No inter-layer compression — each tensor is processed independently.
ANS stack ordering — encoding must run in reverse order, imposing a single-pass sequential constraint within each tensor's ANS step.
Sequential decompression — parallel decompression across tensors is supported; however, within each shard the ANS decode is single-threaded.

Citation

@techreport{seok2026gauss,
  title       = {GAUSS: Distribution-Aware Compression for Neural Network Weights},
  author      = {Seok, Minju and {Claude Sonnet 4.6}},
  year        = {2026},
  month       = {4},
  institution = {Skystarry-AI},
  doi         = {10.5281/zenodo.19676854},
}

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gauss_compress-0.1.0.tar.gz (28.3 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gauss_compress-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file gauss_compress-0.1.0.tar.gz.

File metadata

Download URL: gauss_compress-0.1.0.tar.gz
Upload date: Apr 23, 2026
Size: 28.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for gauss_compress-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c3b8fdfb909b35eaddab72ac6f4a9e8bb5afbbf09d6e5d506c42a9e049845d27`
MD5	`70b4bb906f97dbf5f10b98cb3926f6ca`
BLAKE2b-256	`7127461d59754cd4203eb6a0ba3b4ec2f1607b7a7539e04dcd583df511661b37`

See more details on using hashes here.

File details

Details for the file gauss_compress-0.1.0-py3-none-any.whl.

File metadata

Download URL: gauss_compress-0.1.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for gauss_compress-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`094f21f99c48da63ffbb7ea7b95668fbb12d2ac1bd0447487e9a31b20584e800`
MD5	`c0b5b617b1031e15b7d9e5608d208f07`
BLAKE2b-256	`70a6b8ce9760989e8f0ade5825b1614bf19a4e39f57ca2799ff6a506110dd1d8`

See more details on using hashes here.

gauss-compress 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gauss — GMM+ANS LLM Checkpoint Compression

Highlights

Installation

Quick Start

What Gets Compressed vs. Stored Raw

CLI Reference

gauss compress

gauss decompress

gauss info

Python API

File Format (GAUSS004)

Module Layout

How It Works

Benchmark

Limitations

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`gauss compress`

`gauss decompress`

`gauss info`