Skip to main content

Model weight compression and streaming decode library

Project description

helix-substrate

Python 3.10+ PyTorch 2.0+ License 4x Compression Beats GPTQ

Calibration-free VQ compression. Beats GPTQ quality at 7B and AWQ at 14B. No training data. No fine-tuning. Transformers, SSMs, CNNs, vision — same command.

helix-substrate

Calibration-free neural network compression. Beats GPTQ quality at 7B (+6.3% vs +8.2% PPL) and AWQ at 14B by 15.4%. No training data needed. Works on transformers, SSMs, CNNs, vision encoders, and embedding models without code changes.

pip install helix-substrate
from helix_substrate import CDNAv3Writer, CDNAv3Reader

# Compress any 2D weight tensor
writer = CDNAv3Writer("./compressed/")
writer.write_tensor(weight_matrix, "layer_name")

# Reconstruct
reader = CDNAv3Reader("./compressed/layer_name.cdnav3")
reconstructed = reader.reconstruct()  # cosine similarity >= 0.999

Model Zoo

Pre-compressed models on HuggingFace. One import, one line to load:

import helix_substrate.hf_quantizer  # registers cdna_v3
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
Model Architecture Ratio PPL Delta Size Link
Zamba2-1.2B Hybrid (Mamba2+Transformer) 4.0x +2.90% 1.35 GB HF
Qwen2.5-Coder-3B Transformer 4.44x +1.92% 3.84 GB HF
TinyLlama-1.1B Transformer 3.99x +0.78% 1.03 GB HF
Mamba-130M Pure SSM 5.61x +18.4% 128 MB HF

Three architectures, one codec. CDNA v3 compresses any nn.Linear — transformer attention, Mamba projections, hybrid layers. Same pip install, same API, same codebook format.

What it does

helix-substrate compresses neural network weights using scalar k-means vector quantization. Each weight value is assigned to the nearest entry in a learned 256-entry codebook. Outlier values (top 0.1% by magnitude) are preserved exactly in a sparse sidecar. The result is a codebook + uint8 indices + sidecar representation at ~4x compression with negligible quality loss.

No calibration data. No fine-tuning. No architecture-specific code.

Benchmarks

Weight Compression Quality (RTX 4090, WikiText-2 PPL)

Model Method PPL PPL Delta Calibration
Qwen2.5-7B FP16 Dense 6.949 baseline --
HelixLinear k=256 7.388 +6.34% None
GPTQ Int4 7.518 +8.2% 128 sequences
AWQ Int4 7.719 +11.1% Activation stats
Qwen2.5-14B HelixLinear k=256 3.78 -- None
AWQ Int4 4.47 -- Activation stats

Helix beats GPTQ by 23% less degradation at 7B, and beats AWQ by 15.4% at 14B. With zero calibration data.

The remaining +6.34% PPL delta comes primarily from early down_proj layers (layers 3-4) at 0.964 cosine. These are the highest-kurtosis FFN tensors in the model. Rank-32 SVD on those specific layers is expected to push this below +4%.

Quality vs ratio tradeoff: GPTQ/AWQ achieve 8x compression with worse quality. helix-substrate achieves 4x with the best quality of any post-training method tested, and requires zero calibration data. VQ degrades more gracefully than INT4 at scale — the quality gap widens as model size increases.

Architecture Coverage (all k=256, same compress.py)

Model Architecture Tensors Ratio Cosine (min)
TinyLlama 1.1B Transformer (LLaMA) 154 3.98x 0.9946
Qwen2.5 1.5B Transformer (Qwen) 196 4.00x 0.9943
Qwen2.5 7B Transformer (Qwen) 196 4.00x 0.9955
Qwen2.5 14B Transformer (Qwen) 336 4.00x --
Mamba-130m SSM (Mamba) 97 3.92x 0.9990+
Mamba-2 1.3B SSM (Mamba-2) 98 3.99x 0.9990+
MiniLM-L6 Transformer (BERT) 73 3.94x 0.9997
CLIP ViT-B/32 Vision Transformer 49 3.98x 0.9997
Zamba2-1.2B Hybrid (Mamba2+Transformer) 136 4.00x 0.9973
ResNet-18 CNN 1 (fc) 3.97x 0.9998

All compressed with the same command. No architecture-specific flags or code paths.

Compression Quality Frontier (TinyLlama, PPL on WikiText-2)

Config Ratio PPL Delta Status
k=256 + sidecar 4.0x +0.11% Production baseline
k=64 + sidecar 5.3x +1.44% Model-dependent (fails on Qwen at +2.78%)
k=32 + sidecar 6.4x +2.61% Below quality threshold
k=16 + sidecar 8.0x +9.34% Rejected

Dead ends tested for 8x (all falsified with receipts): Group VQ k=16 (per-column codebooks, cos=0.991 vs 0.999 global), off-the-shelf ResidualVQ (Lucidrain, full-row vector quantization, cos=0.26), SVD residual correction (hurts at 7B), channel scaling/calibration (zero net benefit). Sub-vector product quantization (AQLM/VPTQ) could reach 8x but requires architecture-aware calibration, destroying the universality advantage. See receipts/group_vq/, receipts/rvq_benchmark/, receipts/scaling_analysis/.

Key findings

Outlier sidecar is non-negotiable. Without it, k=256 VQ produces PPL 274. With it, PPL 6.18. Cosine similarity is identical (0.999) in both cases. The top 0.1% of weights by magnitude carry outsized importance despite being statistically invisible. This means cosine alone is not a safe quality metric -- outlier preservation is mandatory.

SVD residual correction was tested and rejected. On TinyLlama (1.1B), kurtosis-routed SVD gave marginal per-tensor cosine improvement. Crossover test at 1.5B, 3B, 7B: SVD adds zero value at 1.5B/3B and actively hurts at 7B (+4% PPL). Plain k-means VQ-256 is optimal at all scales tested. The simplest approach wins.

Kurtosis routing beats Hessian routing on TinyLlama but the routing target (SVD) is dead at scale. The finding stands: calibration-free signals (kurtosis from weights alone) outperform calibration-dependent signals (Hessian). But the routing itself is disabled — plain VQ for everything.

Weighted k-means is harmful. Hessian-weighted centroid placement gives +2.93% PPL -- actively worse than unweighted. Distorting the codebook toward "important" columns degrades it for the majority of weights.

Embedding tables must stay dense. VQ on embed_tokens and lm_head inflated 7B PPL from +6.34% to +11%. Two lines of code (exclude both from VQ) eliminated the entire quality gap vs GPTQ. Lesson: embedding tables have uniform importance across all rows — VQ's "representative centroid" assumption fails catastrophically when every entry is equally important.

Quick start

Compress a model

python tools/compress.py \
  --model-dir /path/to/model \
  --out-dir /path/to/output \
  --k 256 --sidecar

Load compressed weights for inference

From HuggingFace (recommended):

import helix_substrate.hf_quantizer  # register quantizer
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
output = model.generate(input_ids, max_new_tokens=128)

From local CDNA v3 directory:

from transformers import AutoModelForCausalLM
from helix_substrate.helix_linear import swap_to_helix

model = AutoModelForCausalLM.from_pretrained("path/to/model")
swap_to_helix(model, "path/to/cdnav3/")
# All nn.Linear modules replaced with HelixLinear
# Forward pass works normally -- codebook[indices] -> matmul
output = model.generate(input_ids, max_new_tokens=128)

Compress any tensor

from helix_substrate import CDNAv3Writer, CDNAv3Reader
from helix_substrate.tensor_policy import TensorPolicy, TensorClass
import numpy as np

tensor = np.random.randn(1024, 768).astype(np.float32)

policy = TensorPolicy(
    tensor_class=TensorClass.UNKNOWN,
    storage_mode="codebook+sidecar",
    n_clusters=256,
    use_kmeans=True,
    sidecar_enabled=True,
    percentile=99.9,
    max_corrections=512,
)

writer = CDNAv3Writer("./output/")
stats = writer.write_tensor(tensor, "my_tensor", policy=policy)

reader = CDNAv3Reader("./output/my_tensor.cdnav3")
reconstructed = reader.reconstruct()

10-Domain Tensor Infrastructure Proofs

The same codec handles any 2D float32 tensor. No modifications needed across domains.

Domain Data Source Key Metric Verdict
Gradient compression TinyLlama backward pass SGD step cos=1.000 PASS
Embedding tables TinyLlama embed_tokens Row cos min=0.983 WEAK
Activation checkpointing TinyLlama activations cos min=0.996, 3.90x PASS
Federated learning deltas SGD weight deltas Weight cos=1.000, 4.0x PASS
Neural codec weights CLIP ViT + ResNet-18 cos 0.9997+, 100% pred match STRONG
RAG index MiniLM embeddings top-1 100%, top-5 4.9/5 STRONG
LoRA adapters PEFT LoRA matrices All 88 matrices cos>=0.9997 STRONG
MoE tiered compression Simulated expert split Fidelity tiers work MIXED
Continual learning Model snapshots Full 4.0x, delta cos=1.0 PASS
Sensor / scientific data scRNA-seq + protein PDB ARI 0.75-0.92 MIXED

All receipts in receipts/tensor_infra/.

Companion projects

helix-online-kv

Online KV cache compression using the same VQ codec. Fits codebooks on the first 128 tokens, then VQ-assigns all subsequent KV entries in real time.

  • 2.81 ms/token encoding latency (gate: <5ms)
  • 1.9x more tokens fit in same VRAM
  • End-to-end with HelixLinear: +0.77% PPL at 1329 MB on Quadro T2000

Compressed-domain attention (CDC-03): Product quantization scores rank all tokens cheaply, select top 128 by approximate score, exact attention on subset only. Proven at cosine 0.9997 on layers 1-21 with 12.5% token coverage. Projected 29x compute savings at 4K context, 900x at 128K.

echo_runtime

Unified inference runtime wiring HelixLinear + CompressedKVCache + CDC-03 attention into one forward pass. One config file, one command.

  • 155/155 weight modules compressed (3.98x)
  • 22-layer KV cache (layer 0 exact, 1-21 streaming VQ)
  • 21 CDC-03 hybrid attention layers
  • 1404 MB peak VRAM on Quadro T2000 (4 GB card)
  • 14.3 tok/s, coherent output

How it works

Input Tensor (2D float32)
        |
        v
+-------------------+
| K-Means VQ k=256  | <-- 256 centroids, 15 iterations, no calibration
+--------+----------+
         |
         v
+--------------------------+
| Outlier Sidecar          |
| Top 0.1% -> exact FP32  |
+-----------+--------------+
            |
            v
  codebook.npy (256 x 4B)
  indices.npy  (rows x cols x uint8)
  sidecar.npz  (sparse corrections)
  meta.json    (kurtosis, quality, config)

What's honest

We do not claim to have invented VQ for neural networks. VQ weight compression dates to the 1980s, with DNN applications since 2015. Our differentiators are: calibration-free operation, architecture-agnostic coverage including SSMs (no prior work compresses Mamba through the same pipeline as LLaMA), kurtosis-based statistical routing that outperforms Hessian-based approaches, and the intelligence layer (adaptive routing, symbolic governance, semantic memory indexing).

4x is the universal number. 5.3x is model-dependent. k=64 passes on TinyLlama (+1.44%) but fails on Qwen-1.5B (+2.78%). We do not claim universal 5.3x compression.

The GPU path does late materialization. The fused Triton kernel computes Y = X @ codebook[indices] directly from compressed VQ indices -- the full weight matrix W never hits global VRAM. Measured peak allocation is 0.4% of W size across all tensor shapes (attn, FFN gate, FFN down). This is not a roadmap item; it ships today. Receipt: receipts/late_materialization/late_materialization_20260326T131246.json.

Speed comparison against GPTQ/AWQ is not yet fair. The decode speed gap reflects kernel maturity, not architecture. GPTQ/AWQ have years of optimization (Marlin, exllama2). Our Triton kernel is correct and memory-efficient but not yet throughput-optimized. Our advantage is on quality, universality, and the compressed runtime stack.

VQ is 4x, not 8x. GPTQ/AWQ achieve 8x compression (INT4). We achieve 4x (uint8 indices + codebook). Our compression ratio is lower, but our quality is better. The right comparison is quality at a given memory budget, not compression ratio alone.

Prior art and references

This work builds on and differentiates from:

  • Choi & El-Khamy (NIPS 2018) -- Universal DNN compression via lattice VQ. First "universal" VQ-for-DNN paper. We differ: calibration-free, no fine-tuning, architecture-agnostic including SSMs.
  • VQ4ALL (Dec 2024) -- Universal codebook shared across networks. We differ: per-tensor codebooks (better quality), purely post-training, no calibration.
  • AQLM (ICML 2024) -- Additive multi-codebook quantization for 2-bit LLM compression. We differ: calibration-free, architecture-agnostic, quality-first (4x not 16x).
  • GPTQ / AWQ -- INT4 with Hessian/activation-aware scaling. We differ: calibration-free, VQ (non-uniform), works on SSMs.
  • SpQR / SqueezeLLM -- Outlier-preserving mixed-precision. Our sidecar mechanism is related but calibration-free (magnitude percentile, not Hessian sensitivity).
  • KIVI / KVQuant -- KV cache quantization. Our helix-online-kv uses VQ codebooks with calibrate-then-stream, combined with weight compression from the same codec.

Project structure

helix-substrate/
+-- helix_substrate/
|   +-- cdnav3_writer.py       # Compress tensors to CDNA v3 format
|   +-- cdnav3_reader.py       # Reconstruct from CDNA v3
|   +-- tensor_policy.py       # Compression routing policy
|   +-- helix_linear.py        # Drop-in nn.Linear replacement
|   +-- hf_quantizer.py        # HuggingFace AutoModel integration
|   +-- generate_sidecars_v3.py # Outlier sidecar generation
|   +-- triton_vq_matmul.py    # Fused Triton kernel (late materialization)
+-- tools/
|   +-- compress.py            # Universal model compressor (one command)
|   +-- eval_ppl_cpu.py        # CPU perplexity evaluation
|   +-- cloud_ready_check.py   # Pre-cloud deployment validation
|   +-- scaling_analysis.py    # VQ scaling hypothesis analysis
|   +-- group_vq_test.py       # Group VQ falsification (k=16 dead end)
|   +-- rvq_benchmark.py       # RVQ falsification (Lucidrain dead end)
|   +-- tensor_infra/          # 10-domain proof suite
+-- receipts/                  # All experiment receipts (JSON, with cost blocks)
+-- tests/

License

Echo Labs LLC. See LICENSE for details.

Citation

If you use helix-substrate in research, please cite:

@software{helix_substrate,
  author = {Josh (voidstr3m33)},
  title = {helix-substrate: Calibration-free neural network compression},
  year = {2026},
  url = {https://github.com/echo313unfolding/helix-substrate}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

helix_substrate-0.2.3.tar.gz (353.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

helix_substrate-0.2.3-py3-none-any.whl (293.1 kB view details)

Uploaded Python 3

File details

Details for the file helix_substrate-0.2.3.tar.gz.

File metadata

  • Download URL: helix_substrate-0.2.3.tar.gz
  • Upload date:
  • Size: 353.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for helix_substrate-0.2.3.tar.gz
Algorithm Hash digest
SHA256 8a0ca70f3a5ffc23ba3bb296b0054b0f7633ac86886ec3ec82b901517b237a53
MD5 401f97244bfc44f574eb059191aed17a
BLAKE2b-256 f61eefe09f0c1123f5d1bcb14b66cd0cf04f3fe0b0cf5bd73fde033edeaa6a87

See more details on using hashes here.

File details

Details for the file helix_substrate-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: helix_substrate-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 293.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for helix_substrate-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 982e92b7a99958c87a62bf0f3bc1ae35b43f12b716f005acf41c63a08a937d0a
MD5 de0995839b8e6c52aeeaaa32b2fd9598
BLAKE2b-256 655db9ab19e6843caf81457201295f1cadf119571be40741c1cf5d5ca089a41f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page