Model weight compression and streaming decode library
Project description
Calibration-free VQ compression. Beats GPTQ quality at 7B and AWQ at 14B. No training data. No fine-tuning. Transformers, SSMs, CNNs, vision — same command.
helix-substrate
Calibration-free neural network compression. Beats GPTQ quality at 7B (+6.3% vs +8.2% PPL) and AWQ at 14B by 15.4%. No training data needed. Works on transformers, SSMs, CNNs, vision encoders, and embedding models without code changes.
pip install helix-substrate
from helix_substrate import CDNAv3Writer, CDNAv3Reader
# Compress any 2D weight tensor
writer = CDNAv3Writer("./compressed/")
writer.write_tensor(weight_matrix, "layer_name")
# Reconstruct
reader = CDNAv3Reader("./compressed/layer_name.cdnav3")
reconstructed = reader.reconstruct() # cosine similarity >= 0.999
Model Zoo
Pre-compressed models on HuggingFace. One import, one line to load:
import helix_substrate.hf_quantizer # registers cdna_v3
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
| Model | Architecture | Ratio | PPL Delta | Size | Link |
|---|---|---|---|---|---|
| Zamba2-1.2B | Hybrid (Mamba2+Transformer) | 4.0x | +2.90% | 1.35 GB | HF |
| Qwen2.5-Coder-3B | Transformer | 4.44x | +1.92% | 3.84 GB | HF |
| TinyLlama-1.1B | Transformer | 3.99x | +0.78% | 1.03 GB | HF |
| Mamba-130M | Pure SSM | 5.61x | +18.4% | 128 MB | HF |
Three architectures, one codec. CDNA v3 compresses any nn.Linear — transformer attention, Mamba projections, hybrid layers. Same pip install, same API, same codebook format.
What it does
helix-substrate compresses neural network weights using scalar k-means vector quantization. Each weight value is assigned to the nearest entry in a learned 256-entry codebook. Outlier values (top 0.1% by magnitude) are preserved exactly in a sparse sidecar. The result is a codebook + uint8 indices + sidecar representation at ~4x compression with negligible quality loss.
No calibration data. No fine-tuning. No architecture-specific code.
Benchmarks
Weight Compression Quality (RTX 4090, WikiText-2 PPL)
| Model | Method | PPL | PPL Delta | Calibration |
|---|---|---|---|---|
| Qwen2.5-7B | FP16 Dense | 6.949 | baseline | -- |
| HelixLinear k=256 | 7.388 | +6.34% | None | |
| GPTQ Int4 | 7.518 | +8.2% | 128 sequences | |
| AWQ Int4 | 7.719 | +11.1% | Activation stats | |
| Qwen2.5-14B | HelixLinear k=256 | 3.78 | -- | None |
| AWQ Int4 | 4.47 | -- | Activation stats |
Helix beats GPTQ by 23% less degradation at 7B, and beats AWQ by 15.4% at 14B. With zero calibration data.
The remaining +6.34% PPL delta comes primarily from early down_proj layers (layers 3-4) at 0.964 cosine. These are the highest-kurtosis FFN tensors in the model. Rank-32 SVD on those specific layers is expected to push this below +4%.
Quality vs ratio tradeoff: GPTQ/AWQ achieve 8x compression with worse quality. helix-substrate achieves 4x with the best quality of any post-training method tested, and requires zero calibration data. VQ degrades more gracefully than INT4 at scale — the quality gap widens as model size increases.
Architecture Coverage (all k=256, same compress.py)
| Model | Architecture | Tensors | Ratio | Cosine (min) |
|---|---|---|---|---|
| TinyLlama 1.1B | Transformer (LLaMA) | 154 | 3.98x | 0.9946 |
| Qwen2.5 1.5B | Transformer (Qwen) | 196 | 4.00x | 0.9943 |
| Qwen2.5 7B | Transformer (Qwen) | 196 | 4.00x | 0.9955 |
| Qwen2.5 14B | Transformer (Qwen) | 336 | 4.00x | -- |
| Mamba-130m | SSM (Mamba) | 97 | 3.92x | 0.9990+ |
| Mamba-2 1.3B | SSM (Mamba-2) | 98 | 3.99x | 0.9990+ |
| MiniLM-L6 | Transformer (BERT) | 73 | 3.94x | 0.9997 |
| CLIP ViT-B/32 | Vision Transformer | 49 | 3.98x | 0.9997 |
| Zamba2-1.2B | Hybrid (Mamba2+Transformer) | 136 | 4.00x | 0.9973 |
| ResNet-18 | CNN | 1 (fc) | 3.97x | 0.9998 |
All compressed with the same command. No architecture-specific flags or code paths.
Compression Quality Frontier (TinyLlama, PPL on WikiText-2)
| Config | Ratio | PPL Delta | Status |
|---|---|---|---|
| k=256 + sidecar | 4.0x | +0.11% | Production baseline |
| k=64 + sidecar | 5.3x | +1.44% | Model-dependent (fails on Qwen at +2.78%) |
| k=32 + sidecar | 6.4x | +2.61% | Below quality threshold |
| k=16 + sidecar | 8.0x | +9.34% | Rejected |
Dead ends tested for 8x (all falsified with receipts): Group VQ k=16 (per-column codebooks, cos=0.991 vs 0.999 global), off-the-shelf ResidualVQ (Lucidrain, full-row vector quantization, cos=0.26), SVD residual correction (hurts at 7B), channel scaling/calibration (zero net benefit). Sub-vector product quantization (AQLM/VPTQ) could reach 8x but requires architecture-aware calibration, destroying the universality advantage. See receipts/group_vq/, receipts/rvq_benchmark/, receipts/scaling_analysis/.
Key findings
Outlier sidecar is non-negotiable. Without it, k=256 VQ produces PPL 274. With it, PPL 6.18. Cosine similarity is identical (0.999) in both cases. The top 0.1% of weights by magnitude carry outsized importance despite being statistically invisible. This means cosine alone is not a safe quality metric -- outlier preservation is mandatory.
SVD residual correction was tested and rejected. On TinyLlama (1.1B), kurtosis-routed SVD gave marginal per-tensor cosine improvement. Crossover test at 1.5B, 3B, 7B: SVD adds zero value at 1.5B/3B and actively hurts at 7B (+4% PPL). Plain k-means VQ-256 is optimal at all scales tested. The simplest approach wins.
Kurtosis routing beats Hessian routing on TinyLlama but the routing target (SVD) is dead at scale. The finding stands: calibration-free signals (kurtosis from weights alone) outperform calibration-dependent signals (Hessian). But the routing itself is disabled — plain VQ for everything.
Weighted k-means is harmful. Hessian-weighted centroid placement gives +2.93% PPL -- actively worse than unweighted. Distorting the codebook toward "important" columns degrades it for the majority of weights.
Embedding tables must stay dense. VQ on embed_tokens and lm_head inflated 7B PPL from +6.34% to +11%. Two lines of code (exclude both from VQ) eliminated the entire quality gap vs GPTQ. Lesson: embedding tables have uniform importance across all rows — VQ's "representative centroid" assumption fails catastrophically when every entry is equally important.
Quick start
Compress a model
python tools/compress.py \
--model-dir /path/to/model \
--out-dir /path/to/output \
--k 256 --sidecar
Load compressed weights for inference
From HuggingFace (recommended):
import helix_substrate.hf_quantizer # register quantizer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
output = model.generate(input_ids, max_new_tokens=128)
From local CDNA v3 directory:
from transformers import AutoModelForCausalLM
from helix_substrate.helix_linear import swap_to_helix
model = AutoModelForCausalLM.from_pretrained("path/to/model")
swap_to_helix(model, "path/to/cdnav3/")
# All nn.Linear modules replaced with HelixLinear
# Forward pass works normally -- codebook[indices] -> matmul
output = model.generate(input_ids, max_new_tokens=128)
Compress any tensor
from helix_substrate import CDNAv3Writer, CDNAv3Reader
from helix_substrate.tensor_policy import TensorPolicy, TensorClass
import numpy as np
tensor = np.random.randn(1024, 768).astype(np.float32)
policy = TensorPolicy(
tensor_class=TensorClass.UNKNOWN,
storage_mode="codebook+sidecar",
n_clusters=256,
use_kmeans=True,
sidecar_enabled=True,
percentile=99.9,
max_corrections=512,
)
writer = CDNAv3Writer("./output/")
stats = writer.write_tensor(tensor, "my_tensor", policy=policy)
reader = CDNAv3Reader("./output/my_tensor.cdnav3")
reconstructed = reader.reconstruct()
10-Domain Tensor Infrastructure Proofs
The same codec handles any 2D float32 tensor. No modifications needed across domains.
| Domain | Data Source | Key Metric | Verdict |
|---|---|---|---|
| Gradient compression | TinyLlama backward pass | SGD step cos=1.000 | PASS |
| Embedding tables | TinyLlama embed_tokens | Row cos min=0.983 | WEAK |
| Activation checkpointing | TinyLlama activations | cos min=0.996, 3.90x | PASS |
| Federated learning deltas | SGD weight deltas | Weight cos=1.000, 4.0x | PASS |
| Neural codec weights | CLIP ViT + ResNet-18 | cos 0.9997+, 100% pred match | STRONG |
| RAG index | MiniLM embeddings | top-1 100%, top-5 4.9/5 | STRONG |
| LoRA adapters | PEFT LoRA matrices | All 88 matrices cos>=0.9997 | STRONG |
| MoE tiered compression | Simulated expert split | Fidelity tiers work | MIXED |
| Continual learning | Model snapshots | Full 4.0x, delta cos=1.0 | PASS |
| Sensor / scientific data | scRNA-seq + protein PDB | ARI 0.75-0.92 | MIXED |
All receipts in receipts/tensor_infra/.
Companion projects
helix-online-kv
Online KV cache compression using the same VQ codec. Fits codebooks on the first 128 tokens, then VQ-assigns all subsequent KV entries in real time.
- 2.81 ms/token encoding latency (gate: <5ms)
- 1.9x more tokens fit in same VRAM
- End-to-end with HelixLinear: +0.77% PPL at 1329 MB on Quadro T2000
Compressed-domain attention (CDC-03): Product quantization scores rank all tokens cheaply, select top 128 by approximate score, exact attention on subset only. Proven at cosine 0.9997 on layers 1-21 with 12.5% token coverage. Projected 29x compute savings at 4K context, 900x at 128K.
echo_runtime
Unified inference runtime wiring HelixLinear + CompressedKVCache + CDC-03 attention into one forward pass. One config file, one command.
- 155/155 weight modules compressed (3.98x)
- 22-layer KV cache (layer 0 exact, 1-21 streaming VQ)
- 21 CDC-03 hybrid attention layers
- 1404 MB peak VRAM on Quadro T2000 (4 GB card)
- 14.3 tok/s, coherent output
How it works
Input Tensor (2D float32)
|
v
+-------------------+
| K-Means VQ k=256 | <-- 256 centroids, 15 iterations, no calibration
+--------+----------+
|
v
+--------------------------+
| Outlier Sidecar |
| Top 0.1% -> exact FP32 |
+-----------+--------------+
|
v
codebook.npy (256 x 4B)
indices.npy (rows x cols x uint8)
sidecar.npz (sparse corrections)
meta.json (kurtosis, quality, config)
What's honest
We do not claim to have invented VQ for neural networks. VQ weight compression dates to the 1980s, with DNN applications since 2015. Our differentiators are: calibration-free operation, architecture-agnostic coverage including SSMs (no prior work compresses Mamba through the same pipeline as LLaMA), kurtosis-based statistical routing that outperforms Hessian-based approaches, and the intelligence layer (adaptive routing, symbolic governance, semantic memory indexing).
4x is the universal number. 5.3x is model-dependent. k=64 passes on TinyLlama (+1.44%) but fails on Qwen-1.5B (+2.78%). We do not claim universal 5.3x compression.
The GPU path does late materialization. The fused Triton kernel computes Y = X @ codebook[indices] directly from compressed VQ indices -- the full weight matrix W never hits global VRAM. Measured peak allocation is 0.4% of W size across all tensor shapes (attn, FFN gate, FFN down). This is not a roadmap item; it ships today. Receipt: receipts/late_materialization/late_materialization_20260326T131246.json.
Speed comparison against GPTQ/AWQ is not yet fair. The decode speed gap reflects kernel maturity, not architecture. GPTQ/AWQ have years of optimization (Marlin, exllama2). Our Triton kernel is correct and memory-efficient but not yet throughput-optimized. Our advantage is on quality, universality, and the compressed runtime stack.
VQ is 4x, not 8x. GPTQ/AWQ achieve 8x compression (INT4). We achieve 4x (uint8 indices + codebook). Our compression ratio is lower, but our quality is better. The right comparison is quality at a given memory budget, not compression ratio alone.
Prior art and references
This work builds on and differentiates from:
- Choi & El-Khamy (NIPS 2018) -- Universal DNN compression via lattice VQ. First "universal" VQ-for-DNN paper. We differ: calibration-free, no fine-tuning, architecture-agnostic including SSMs.
- VQ4ALL (Dec 2024) -- Universal codebook shared across networks. We differ: per-tensor codebooks (better quality), purely post-training, no calibration.
- AQLM (ICML 2024) -- Additive multi-codebook quantization for 2-bit LLM compression. We differ: calibration-free, architecture-agnostic, quality-first (4x not 16x).
- GPTQ / AWQ -- INT4 with Hessian/activation-aware scaling. We differ: calibration-free, VQ (non-uniform), works on SSMs.
- SpQR / SqueezeLLM -- Outlier-preserving mixed-precision. Our sidecar mechanism is related but calibration-free (magnitude percentile, not Hessian sensitivity).
- KIVI / KVQuant -- KV cache quantization. Our helix-online-kv uses VQ codebooks with calibrate-then-stream, combined with weight compression from the same codec.
Project structure
helix-substrate/
+-- helix_substrate/
| +-- cdnav3_writer.py # Compress tensors to CDNA v3 format
| +-- cdnav3_reader.py # Reconstruct from CDNA v3
| +-- tensor_policy.py # Compression routing policy
| +-- helix_linear.py # Drop-in nn.Linear replacement
| +-- hf_quantizer.py # HuggingFace AutoModel integration
| +-- generate_sidecars_v3.py # Outlier sidecar generation
| +-- triton_vq_matmul.py # Fused Triton kernel (late materialization)
+-- tools/
| +-- compress.py # Universal model compressor (one command)
| +-- eval_ppl_cpu.py # CPU perplexity evaluation
| +-- cloud_ready_check.py # Pre-cloud deployment validation
| +-- scaling_analysis.py # VQ scaling hypothesis analysis
| +-- group_vq_test.py # Group VQ falsification (k=16 dead end)
| +-- rvq_benchmark.py # RVQ falsification (Lucidrain dead end)
| +-- tensor_infra/ # 10-domain proof suite
+-- receipts/ # All experiment receipts (JSON, with cost blocks)
+-- tests/
License
Echo Labs LLC. See LICENSE for details.
Citation
If you use helix-substrate in research, please cite:
@software{helix_substrate,
author = {Josh (voidstr3m33)},
title = {helix-substrate: Calibration-free neural network compression},
year = {2026},
url = {https://github.com/echo313unfolding/helix-substrate}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file helix_substrate-0.2.3.tar.gz.
File metadata
- Download URL: helix_substrate-0.2.3.tar.gz
- Upload date:
- Size: 353.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a0ca70f3a5ffc23ba3bb296b0054b0f7633ac86886ec3ec82b901517b237a53
|
|
| MD5 |
401f97244bfc44f574eb059191aed17a
|
|
| BLAKE2b-256 |
f61eefe09f0c1123f5d1bcb14b66cd0cf04f3fe0b0cf5bd73fde033edeaa6a87
|
File details
Details for the file helix_substrate-0.2.3-py3-none-any.whl.
File metadata
- Download URL: helix_substrate-0.2.3-py3-none-any.whl
- Upload date:
- Size: 293.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
982e92b7a99958c87a62bf0f3bc1ae35b43f12b716f005acf41c63a08a937d0a
|
|
| MD5 |
de0995839b8e6c52aeeaaa32b2fd9598
|
|
| BLAKE2b-256 |
655db9ab19e6843caf81457201295f1cadf119571be40741c1cf5d5ca089a41f
|