Model weight compression and streaming decode library

These details have not been verified by PyPI

Project links

Project description

Calibration-free VQ compression. Beats GPTQ quality at 7B and AWQ at 14B. No training data. No fine-tuning. Transformers, SSMs, CNNs, vision — same command.

helix-substrate

Calibration-free neural network compression. Beats GPTQ quality at 7B (+6.3% vs +8.2% PPL) and AWQ at 14B by 15.4%. No training data needed. Works on transformers, SSMs, CNNs, vision encoders, and embedding models without code changes.

pip install helix-substrate

from helix_substrate import CDNAv3Writer, CDNAv3Reader

# Compress any 2D weight tensor
writer = CDNAv3Writer("./compressed/")
writer.write_tensor(weight_matrix, "layer_name")

# Reconstruct
reader = CDNAv3Reader("./compressed/layer_name.cdnav3")
reconstructed = reader.reconstruct()  # cosine similarity >= 0.999

Model Zoo

Pre-compressed models on HuggingFace. One import, one line to load:

import helix_substrate.hf_quantizer  # registers cdna_v3
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix")

Model	Architecture	Ratio	PPL Delta	Size	Link
Zamba2-1.2B	Hybrid (Mamba2+Transformer)	4.0x	+2.90%	1.35 GB	HF
Qwen2.5-Coder-3B	Transformer	4.44x	+1.92%	3.84 GB	HF
TinyLlama-1.1B	Transformer	3.99x	+0.78%	1.03 GB	HF
Mamba-130M	Pure SSM	5.61x	+18.4%	128 MB	HF

Three architectures, one codec. CDNA v3 compresses any nn.Linear — transformer attention, Mamba projections, hybrid layers. Same pip install, same API, same codebook format.

What it does

helix-substrate compresses neural network weights using scalar k-means vector quantization. Each weight value is assigned to the nearest entry in a learned 256-entry codebook. Outlier values (top 0.1% by magnitude) are preserved exactly in a sparse sidecar. The result is a codebook + uint8 indices + sidecar representation at ~4x compression with negligible quality loss.

No calibration data. No fine-tuning. No architecture-specific code.

Benchmarks

Weight Compression Quality (RTX 4090, WikiText-2 PPL)

Model	Method	PPL	PPL Delta	Calibration
Qwen2.5-7B	FP16 Dense	6.949	baseline	--
	HelixLinear k=256	7.388	+6.34%	None
	GPTQ Int4	7.518	+8.2%	128 sequences
	AWQ Int4	7.719	+11.1%	Activation stats
Qwen2.5-14B	HelixLinear k=256	3.78	--	None
	AWQ Int4	4.47	--	Activation stats

Helix beats GPTQ by 23% less degradation at 7B, and beats AWQ by 15.4% at 14B. With zero calibration data.

The remaining +6.34% PPL delta comes primarily from early down_proj layers (layers 3-4) at 0.964 cosine. These are the highest-kurtosis FFN tensors in the model. Rank-32 SVD on those specific layers is expected to push this below +4%.

Quality vs ratio tradeoff: GPTQ/AWQ achieve 8x compression with worse quality. helix-substrate achieves 4x with the best quality of any post-training method tested, and requires zero calibration data. VQ degrades more gracefully than INT4 at scale — the quality gap widens as model size increases.

Architecture Coverage (all k=256, same `compress.py`)

Model	Architecture	Tensors	Ratio	Cosine (min)
TinyLlama 1.1B	Transformer (LLaMA)	154	3.98x	0.9946
Qwen2.5 1.5B	Transformer (Qwen)	196	4.00x	0.9943
Qwen2.5 7B	Transformer (Qwen)	196	4.00x	0.9955
Qwen2.5 14B	Transformer (Qwen)	336	4.00x	--
Mamba-130m	SSM (Mamba)	97	3.92x	0.9990+
Mamba-2 1.3B	SSM (Mamba-2)	98	3.99x	0.9990+
MiniLM-L6	Transformer (BERT)	73	3.94x	0.9997
CLIP ViT-B/32	Vision Transformer	49	3.98x	0.9997
Zamba2-1.2B	Hybrid (Mamba2+Transformer)	136	4.00x	0.9973
ResNet-18	CNN	1 (fc)	3.97x	0.9998

All compressed with the same command. No architecture-specific flags or code paths.

Compression Quality Frontier (TinyLlama, PPL on WikiText-2)

Config	Ratio	PPL Delta	Status
k=256 + sidecar	4.0x	+0.11%	Production baseline
k=64 + sidecar	5.3x	+1.44%	Model-dependent (fails on Qwen at +2.78%)
k=32 + sidecar	6.4x	+2.61%	Below quality threshold
k=16 + sidecar	8.0x	+9.34%	Rejected

Dead ends tested for 8x (all falsified with receipts): Group VQ k=16 (per-column codebooks, cos=0.991 vs 0.999 global), off-the-shelf ResidualVQ (Lucidrain, full-row vector quantization, cos=0.26), SVD residual correction (hurts at 7B), channel scaling/calibration (zero net benefit). Sub-vector product quantization (AQLM/VPTQ) could reach 8x but requires architecture-aware calibration, destroying the universality advantage. See receipts/group_vq/, receipts/rvq_benchmark/, receipts/scaling_analysis/.

Key findings

Outlier sidecar is non-negotiable. Without it, k=256 VQ produces PPL 274. With it, PPL 6.18. Cosine similarity is identical (0.999) in both cases. The top 0.1% of weights by magnitude carry outsized importance despite being statistically invisible. This means cosine alone is not a safe quality metric -- outlier preservation is mandatory.

SVD residual correction was tested and rejected. On TinyLlama (1.1B), kurtosis-routed SVD gave marginal per-tensor cosine improvement. Crossover test at 1.5B, 3B, 7B: SVD adds zero value at 1.5B/3B and actively hurts at 7B (+4% PPL). Plain k-means VQ-256 is optimal at all scales tested. The simplest approach wins.

Kurtosis routing beats Hessian routing on TinyLlama but the routing target (SVD) is dead at scale. The finding stands: calibration-free signals (kurtosis from weights alone) outperform calibration-dependent signals (Hessian). But the routing itself is disabled — plain VQ for everything.

Weighted k-means is harmful. Hessian-weighted centroid placement gives +2.93% PPL -- actively worse than unweighted. Distorting the codebook toward "important" columns degrades it for the majority of weights.

Embedding tables must stay dense. VQ on embed_tokens and lm_head inflated 7B PPL from +6.34% to +11%. Two lines of code (exclude both from VQ) eliminated the entire quality gap vs GPTQ. Lesson: embedding tables have uniform importance across all rows — VQ's "representative centroid" assumption fails catastrophically when every entry is equally important.

Quick start

Compress a model

python tools/compress.py \
  --model-dir /path/to/model \
  --out-dir /path/to/output \
  --k 256 --sidecar

Load compressed weights for inference

From HuggingFace (recommended):

import helix_substrate.hf_quantizer  # register quantizer
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
output = model.generate(input_ids, max_new_tokens=128)

From local CDNA v3 directory:

from transformers import AutoModelForCausalLM
from helix_substrate.helix_linear import swap_to_helix

model = AutoModelForCausalLM.from_pretrained("path/to/model")
swap_to_helix(model, "path/to/cdnav3/")
# All nn.Linear modules replaced with HelixLinear
# Forward pass works normally -- codebook[indices] -> matmul
output = model.generate(input_ids, max_new_tokens=128)

Compress any tensor

from helix_substrate import CDNAv3Writer, CDNAv3Reader
from helix_substrate.tensor_policy import TensorPolicy, TensorClass
import numpy as np

tensor = np.random.randn(1024, 768).astype(np.float32)

policy = TensorPolicy(
    tensor_class=TensorClass.UNKNOWN,
    storage_mode="codebook+sidecar",
    n_clusters=256,
    use_kmeans=True,
    sidecar_enabled=True,
    percentile=99.9,
    max_corrections=512,
)

writer = CDNAv3Writer("./output/")
stats = writer.write_tensor(tensor, "my_tensor", policy=policy)

reader = CDNAv3Reader("./output/my_tensor.cdnav3")
reconstructed = reader.reconstruct()

10-Domain Tensor Infrastructure Proofs

The same codec handles any 2D float32 tensor. No modifications needed across domains.

Domain	Data Source	Key Metric	Verdict
Gradient compression	TinyLlama backward pass	SGD step cos=1.000	PASS
Embedding tables	TinyLlama embed_tokens	Row cos min=0.983	WEAK
Activation checkpointing	TinyLlama activations	cos min=0.996, 3.90x	PASS
Federated learning deltas	SGD weight deltas	Weight cos=1.000, 4.0x	PASS
Neural codec weights	CLIP ViT + ResNet-18	cos 0.9997+, 100% pred match	STRONG
RAG index	MiniLM embeddings	top-1 100%, top-5 4.9/5	STRONG
LoRA adapters	PEFT LoRA matrices	All 88 matrices cos>=0.9997	STRONG
MoE tiered compression	Simulated expert split	Fidelity tiers work	MIXED
Continual learning	Model snapshots	Full 4.0x, delta cos=1.0	PASS
Sensor / scientific data	scRNA-seq + protein PDB	ARI 0.75-0.92	MIXED

All receipts in receipts/tensor_infra/.

Companion projects

helix-online-kv

Online KV cache compression using the same VQ codec. Fits codebooks on the first 128 tokens, then VQ-assigns all subsequent KV entries in real time.

2.81 ms/token encoding latency (gate: <5ms)
1.9x more tokens fit in same VRAM
End-to-end with HelixLinear: +0.77% PPL at 1329 MB on Quadro T2000

Compressed-domain attention (CDC-03): Product quantization scores rank all tokens cheaply, select top 128 by approximate score, exact attention on subset only. Proven at cosine 0.9997 on layers 1-21 with 12.5% token coverage. Projected 29x compute savings at 4K context, 900x at 128K.

echo_runtime

Unified inference runtime wiring HelixLinear + CompressedKVCache + CDC-03 attention into one forward pass. One config file, one command.

155/155 weight modules compressed (3.98x)
22-layer KV cache (layer 0 exact, 1-21 streaming VQ)
21 CDC-03 hybrid attention layers
1404 MB peak VRAM on Quadro T2000 (4 GB card)
14.3 tok/s, coherent output

How it works

Input Tensor (2D float32)
        |
        v
+-------------------+
| K-Means VQ k=256  | <-- 256 centroids, 15 iterations, no calibration
+--------+----------+
         |
         v
+--------------------------+
| Outlier Sidecar          |
| Top 0.1% -> exact FP32  |
+-----------+--------------+
            |
            v
  codebook.npy (256 x 4B)
  indices.npy  (rows x cols x uint8)
  sidecar.npz  (sparse corrections)
  meta.json    (kurtosis, quality, config)

What's honest

We do not claim to have invented VQ for neural networks. VQ weight compression dates to the 1980s, with DNN applications since 2015. Our differentiators are: calibration-free operation, architecture-agnostic coverage including SSMs (no prior work compresses Mamba through the same pipeline as LLaMA), kurtosis-based statistical routing that outperforms Hessian-based approaches, and the intelligence layer (adaptive routing, symbolic governance, semantic memory indexing).

4x is the universal number. 5.3x is model-dependent. k=64 passes on TinyLlama (+1.44%) but fails on Qwen-1.5B (+2.78%). We do not claim universal 5.3x compression.

The GPU path does late materialization. The fused Triton kernel computes Y = X @ codebook[indices] directly from compressed VQ indices -- the full weight matrix W never hits global VRAM. Measured peak allocation is 0.4% of W size across all tensor shapes (attn, FFN gate, FFN down). This is not a roadmap item; it ships today. Receipt: receipts/late_materialization/late_materialization_20260326T131246.json.

Speed comparison against GPTQ/AWQ is not yet fair. The decode speed gap reflects kernel maturity, not architecture. GPTQ/AWQ have years of optimization (Marlin, exllama2). Our Triton kernel is correct and memory-efficient but not yet throughput-optimized. Our advantage is on quality, universality, and the compressed runtime stack.

VQ is 4x, not 8x. GPTQ/AWQ achieve 8x compression (INT4). We achieve 4x (uint8 indices + codebook). Our compression ratio is lower, but our quality is better. The right comparison is quality at a given memory budget, not compression ratio alone.

Prior art and references

This work builds on and differentiates from:

Choi & El-Khamy (NIPS 2018) -- Universal DNN compression via lattice VQ. First "universal" VQ-for-DNN paper. We differ: calibration-free, no fine-tuning, architecture-agnostic including SSMs.
VQ4ALL (Dec 2024) -- Universal codebook shared across networks. We differ: per-tensor codebooks (better quality), purely post-training, no calibration.
AQLM (ICML 2024) -- Additive multi-codebook quantization for 2-bit LLM compression. We differ: calibration-free, architecture-agnostic, quality-first (4x not 16x).
GPTQ / AWQ -- INT4 with Hessian/activation-aware scaling. We differ: calibration-free, VQ (non-uniform), works on SSMs.
SpQR / SqueezeLLM -- Outlier-preserving mixed-precision. Our sidecar mechanism is related but calibration-free (magnitude percentile, not Hessian sensitivity).
KIVI / KVQuant -- KV cache quantization. Our helix-online-kv uses VQ codebooks with calibrate-then-stream, combined with weight compression from the same codec.

Project structure

helix-substrate/
+-- helix_substrate/
|   +-- cdnav3_writer.py       # Compress tensors to CDNA v3 format
|   +-- cdnav3_reader.py       # Reconstruct from CDNA v3
|   +-- tensor_policy.py       # Compression routing policy
|   +-- helix_linear.py        # Drop-in nn.Linear replacement
|   +-- hf_quantizer.py        # HuggingFace AutoModel integration
|   +-- generate_sidecars_v3.py # Outlier sidecar generation
|   +-- triton_vq_matmul.py    # Fused Triton kernel (late materialization)
+-- tools/
|   +-- compress.py            # Universal model compressor (one command)
|   +-- eval_ppl_cpu.py        # CPU perplexity evaluation
|   +-- cloud_ready_check.py   # Pre-cloud deployment validation
|   +-- scaling_analysis.py    # VQ scaling hypothesis analysis
|   +-- group_vq_test.py       # Group VQ falsification (k=16 dead end)
|   +-- rvq_benchmark.py       # RVQ falsification (Lucidrain dead end)
|   +-- tensor_infra/          # 10-domain proof suite
+-- receipts/                  # All experiment receipts (JSON, with cost blocks)
+-- tests/

License

Echo Labs LLC. See LICENSE for details.

Citation

If you use helix-substrate in research, please cite:

@software{helix_substrate,
  author = {Josh (voidstr3m33)},
  title = {helix-substrate: Calibration-free neural network compression},
  year = {2026},
  url = {https://github.com/echo313unfolding/helix-substrate}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.3

Apr 2, 2026

0.3.2

Apr 2, 2026

0.3.1

Apr 2, 2026

0.3.0

Mar 31, 2026

0.2.6

Mar 30, 2026

0.2.5

Mar 29, 2026

0.2.4

Mar 29, 2026

This version

0.2.3

Mar 29, 2026

0.2.2

Mar 29, 2026

0.2.1

Mar 5, 2026

0.2.0

Mar 5, 2026

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

helix_substrate-0.2.3.tar.gz (353.7 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

helix_substrate-0.2.3-py3-none-any.whl (293.1 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file helix_substrate-0.2.3.tar.gz.

File metadata

Download URL: helix_substrate-0.2.3.tar.gz
Upload date: Mar 29, 2026
Size: 353.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for helix_substrate-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`8a0ca70f3a5ffc23ba3bb296b0054b0f7633ac86886ec3ec82b901517b237a53`
MD5	`401f97244bfc44f574eb059191aed17a`
BLAKE2b-256	`f61eefe09f0c1123f5d1bcb14b66cd0cf04f3fe0b0cf5bd73fde033edeaa6a87`

See more details on using hashes here.

File details

Details for the file helix_substrate-0.2.3-py3-none-any.whl.

File metadata

Download URL: helix_substrate-0.2.3-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 293.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for helix_substrate-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`982e92b7a99958c87a62bf0f3bc1ae35b43f12b716f005acf41c63a08a937d0a`
MD5	`de0995839b8e6c52aeeaaa32b2fd9598`
BLAKE2b-256	`655db9ab19e6843caf81457201295f1cadf119571be40741c1cf5d5ca089a41f`

See more details on using hashes here.

helix-substrate 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

helix-substrate

Model Zoo

What it does

Benchmarks

Weight Compression Quality (RTX 4090, WikiText-2 PPL)

Architecture Coverage (all k=256, same compress.py)

Compression Quality Frontier (TinyLlama, PPL on WikiText-2)

Key findings

Quick start

Compress a model

Load compressed weights for inference

Compress any tensor

10-Domain Tensor Infrastructure Proofs

Companion projects

helix-online-kv

echo_runtime

How it works

What's honest

Prior art and references

Project structure

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Architecture Coverage (all k=256, same `compress.py`)