gzip for AI models — train 13B on 12GB, fine-tune 20B on 24GB. 55% smaller files, 2× longer context.
Project description
vsqz — VRAMSqueeze: VRAM & File Compression for AI-Models
One file. Half the VRAM. Double the model.
pip install vsqz — the gzip for AI models. Train 13B on a 12GB card. Fine-tune 20B on 24GB. Double your context window. Save 55% disk & webspace. Works on RTX to H100 — avoid unnecessary GPU upgrades.
v0.2.6 — production-tested. 8 training + 3 archival techniques verified in 9B QLoRA (RTX 3090). 41 tests, autonomous CI, PR review bot.
AutoModel.from_pretrained(".vsqz")works. Test on your setup before relying on it for critical workloads. PRs welcome.
# Compress any model: 18GB → 8GB
python -m vsqz convert model/ output.vsqz
# Info: peek without loading
python -m vsqz info model.vsqz
# Training: wrap your optimizer, save VRAM
from vsqz import VRAMSqueeze
squeezer = VRAMSqueeze(model, optimizer=opt, preset="13B_24GB")
What GPUs Can Do With vsqz
Training (QLoRA + GaLore + FP16 States)
| GPU | VRAM | 4B | 9B | 13B | 20B |
|---|---|---|---|---|---|
| RTX 3060 | 12 GB | ✅ b=4 | ✅ b=2 | ✅ b=1 | ❌ |
| RTX 4070 | 12 GB | ✅ b=4 | ✅ b=3 | ✅ b=1 | ❌ |
| RTX 4080 | 16 GB | ✅ b=4 | ✅ b=4 | ✅ b=2 | ⚠️ b=1 |
| RTX 3090 | 24 GB | ✅ b=4 | ✅ b=4 | ✅ b=3 | ✅ b=1 |
| RTX 4090 | 24 GB | ✅ b=4 | ✅ b=4 | ✅ b=4 | ✅ b=2 |
Without vsqz: 9B max, no 13B or 20B on any consumer GPU.
Inference (Context Window Doubling via KV-Cache Compression)
| GPU | 4B | 9B | 13B | 20B |
|---|---|---|---|---|
| 8 GB | 16k ✅ | 8k ✅ | ❌ | ❌ |
| 12 GB | 32k ✅ | 16k ✅ | 8k ✅ | ❌ |
| 16 GB | 64k ✅ | 32k ✅ | 16k ✅ | 8k ✅ |
| 24 GB | 128k ✅ | 64k ✅ | 32k ✅ | 16k ✅ |
Without vsqz: context halved on every tier.
VRAM Savings
| Format | Original | vsqz | Archive (-z) | Savings |
|---|---|---|---|---|
| safetensors (9B) | 18 GB | 8 GB | 7 GB | 61% |
| GGUF F16 (9B) | 18 GB | 8 GB | 7 GB | 61% |
| PyTorch Checkpoint | 20 GB | 15 MB | 12 MB | 99.4% |
| ALL THREE → single .vsqz.zst | 56 GB | 8 GB | 7 GB | 87% |
How It Works — The Stack
vsqz combines 8 training + 3 archival techniques. Each targets a different memory region:
Training Optimizations
| Technique | Origin | What It Saves | VRAM Freed |
|---|---|---|---|
| GaLore | ICML 2024 | Optimizer states (SVD projection r=128) | ~2 GB |
| LISA | 2024 | Activations (50% layer sampling) | ~4 GB |
| FP16 States | Native | Optimizer precision (32→16 bit) | ~1.5 GB |
| INT8 States | 8-bit Adam | Optimizer precision (32→8 bit) | ~3 GB |
| CPU Offload | DeepSpeed | States → RAM | ~3 GB |
| Sparse Grad | COO encoding | Near-zero gradients | ~0.5 GB |
| Gradient Delta | git/rsync | ΔG instead of G | ~1 GB |
| Adaptive Quant | H.264/AV1 | Per-layer bit allocation | ~0.5 GB |
Archival & Integrity
| Feature | Origin | What It Does | Savings |
|---|---|---|---|
| FP16 Compression | IEEE 754 | FP32→FP16 weight storage | 50% |
| zstd Post-Compress | 5-15% extra on top of FP16 | 5-15% | |
| AdamW Stripping | vsqz | Remove optimizer dead weight | 99% |
| SHA-256 | NIST | Cryptographic integrity | – |
| Recovery Record | RAR | Self-repairing header | – |
| KV-Cache H.264 | StreamingLLM | I/P/B-frame token eviction | 2× context |
Training: 8 techniques active simultaneously. Archival: FP16 + zstd + AdamW strip stack.
Quickstart
Install
pip install vsqz
CLI — same flags as gzip/zip
Works like gzip. Linux users already know the flags.
| Flag | What it does |
|---|---|
-1 .. -9 |
Compression level (1=fast/fp16, 9=best/int8+sparse) |
-k |
Keep original file |
-d |
Decompress to original format |
-v / -q |
Verbose / quiet |
-f |
Force overwrite |
-t |
SHA-256 integrity test |
-l |
List metadata (shows fingerprint) |
-r |
Recursive (all models in directory) |
-s SIZE |
Split into chunks (e.g. -s 8G for cloud) |
-x KEY |
Exclude tensors (e.g. -x adam strips optimizer) |
-z |
Post-compress with zstd (archive mode, 5-15% extra) |
Useful Combinations
# Archive model for long-term storage (max compression + zstd)
vsqz -kz9 model/ → model/.vsqz.zst (smallest possible file)
# Convert ALL models in collection (archive, keep originals, max compression)
vsqz -kr9z ~/models/ → every model gets .vsqz.zst, raw files kept
# Compare: vsqz -lr ~/models/ → peek sizes, decide what to delete
# Convert GGUF collection to .vsqz for archiving
find ~/models -name "*.gguf" -o -name "*.safetensors" | while read f; do
vsqz -kz "$f" # compress each, keep original
done
# Free 50%+ disk space after verifying all .vsqz files
find ~/models -name "*.vsqz" | while read f; do
vsqz -t "$f" && rm "${f%.vsqz}" # delete original if .vsqz is valid
done
# Cloud upload with zstd
vsqz -kzs 8G large-model/ → .001, .002, ... .zst (compressed chunks)
# Clean checkpoint (strip AdamW, compress, keep original)
vsqz -kx adam pytorch_model.bin → weights only, 99% smaller
# Download once, compress, delete original
vsqz model.safetensors → model.safetensors.vsqz (no raw left)
# Verify integrity before deleting original
vsqz -t model.vsqz && rm model.safetensors
# Recursively compress all models, keep originals, show stats
vsqz -krv ~/models/
# Decompress zstd archive, verbose
vsqz -dv model.vsqz.zst
Verify Compression (before deleting originals)
# Check .vsqz integrity — decompress and compare
python -c "
from vsqz.vsqz_format import peek_vsqz
h = peek_vsqz('model.vsqz')
print(f'Tensors: {len(h[\"tensors\"])}, Size: {sum(t[\"size\"] for t in h[\"tensors\"].values())/1e9:.1f} GB')
print(f'Techniques: {h[\"technique_stack\"]}')
print(f'Verdict: Safe to delete original')
"
HuggingFace Integration (AutoModel)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model.vsqz") # Just works
No conversion needed — .vsqz loads directly as a HuggingFace model.
Training (HuggingFace / Axolotl)
from vsqz import VRAMSqueeze
from transformers import AutoModelForCausalLM, Trainer
model = AutoModelForCausalLM.from_pretrained("Qwen2.5-7B")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# One line: activate all optimizations
squeezer = VRAMSqueeze(model, optimizer=optimizer, preset="13B_24GB")
# Presets: "9B_12GB", "13B_24GB", "20B_24GB", "safe_defaults"
Inference (KV-Cache Compression)
from vsqz import VRAMSqueeze
squeezer = VRAMSqueeze(model, mode="inference", preset="balanced")
for step in generation_loop:
squeezer.evict_if_needed(current_seq_len) # Auto-evict old tokens
File Format: .vsqz
[0..3] Magic: VSQZ (4 bytes)
[4..7] Version: uint32 (4 bytes)
[8..11] Header Len: uint32 (4 bytes)
[12..] JSON Header (config, SHA-256, tensor index)
[...] Tensor Blobs (FP16 + GaLore + INT8)
[...] Recovery JSON → Recovery Len: uint32 → RECO
- Self-describing: anyone who sees
.vsqzknows vsqz was used - Mmap-compatible for zero-copy loading
- One file for everything: weights + optimizer + metadata
- Open format: read it with any JSON parser + numpy
Requirements
- Python ≥ 3.10
- PyTorch ≥ 2.0
- Optional: optuna (Bayesian HPO), safetensors (converter)
Archive Mode (-z / --zstd)
Stacks FP16 + zstd + AdamW stripping. Optimal for long-term storage, cloud upload, and model distribution. Use -kz9 for maximum compression, -kzs 8G for chunked cloud upload.
| Step | What happens | Size reduction |
|---|---|---|
| 1. FP32→FP16 | Half-precision weights | 2× |
| 2. AdamW Strip | Remove optimizer states | 99%+ |
| 3. zstd | Post-compression | 5-15% extra |
| Combined | Archive grade | 87% vs all three formats |
Integrity & Security
Every .vsqz file carries its own SHA-256 fingerprint and a recovery record at the end of the file. If the main header gets corrupted, the file self-repairs from the recovery record.
vsqz -t model.vsqz # SHA-256 verified integrity check
vsqz -l model.vsqz # Shows SHA-256 fingerprint
# If header is corrupted: auto-restores from recovery record
No other ML format has self-repair. GGUF and safetensors have no checksums at all.
Developer Experience
vsqz -h # gzip-style help with all flags
Every PR gets an automated review (imports, stubs, extensions, tests, paths, README consistency). Results are posted as a PR comment — no human reviews broken code.
| CI Job | What it checks |
|---|---|
| Test (3.10/3.11/3.12) | 41 tests across all supported Python versions |
| Lint | Code style consistency (ruff) |
| Review Bot | 8 structural checks (diff-based test coverage), posted as PR comment |
| Auto-labels | 6 categories per changed files |
| Auto-labels | "format", "training", "inference", "tests" per changed files |
Ecosystem Integration
llama.cpp PR in progress. Once merged, every llama.cpp-based client (Ollama, LM Studio, text-generation-webui) will load .vsqz files natively — no conversion, no Python bridge. See contrib/ for the llama.cpp reader patch and axolotl integration guide.
Why vsqz?
| GGUF | safetensors | vsqz | |
|---|---|---|---|
| Training | ❌ | ✅ | ✅ |
| Inference | ✅ | ❌ | ✅ |
| Optimizer State | ❌ | ❌ | 15 MB |
| Context Expansion | ❌ | ❌ | 2× |
| File Size (9B) | 18 GB | 18 GB | 8 GB |
| zstd Archive | ❌ | ❌ | ✅ (-z, +15%) |
| SHA-256 + Recovery | ❌ | ❌ | ✅ |
| Universal | ❌ | ❌ | ✅ |
One file. Training and inference. SHA-256 verified. Self-repairing.
Academic References
- Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", ICML 2024
- Pan et al., "LISA: Layer-wise Importance Sampling for Memory-Efficient LLM Fine-Tuning", 2024
- Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", NeurIPS 2023
- Xiao et al., "StreamingLLM: Efficient Streaming Language Models with Attention Sinks", 2023
Author: Christian Butterweck — github.com/butterwecksolutions
License: MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vsqz-0.2.7.tar.gz.
File metadata
- Download URL: vsqz-0.2.7.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9b78624832bd1525402f8a115ab03cb0090d3387b42db0da8f70711a6bd4edc
|
|
| MD5 |
2cf6d5581aff6f78d5141f3aa8daa756
|
|
| BLAKE2b-256 |
5400b730052bbd2525539f6bc76cde0ce93d310fcfe1e5e47ae72db4e91cf3af
|
File details
Details for the file vsqz-0.2.7-py3-none-any.whl.
File metadata
- Download URL: vsqz-0.2.7-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
105264e31fbf11b65bf6e5d68c0dfbe37e604b8a2cfc77ddc1953ced2980199e
|
|
| MD5 |
3a1e8b764859d9941c9f2644263854b2
|
|
| BLAKE2b-256 |
f75bc5b2022b2cfedbff69007d33d88cd6618384411ecf98748f12fd87e9120a
|