Skip to main content

gzip for AI models — train 13B on 12GB, fine-tune 20B on 24GB. 55% smaller files, 2× longer context.

Project description

vsqz — Memory-Efficient Training & Inference for Consumer GPUs

One file. Half the VRAM. Double the model.

PyPI version tests Python License: MIT

pip install vsqz — the gzip for AI models. Train 13B on a 12GB card. Fine-tune 20B on 24GB. Double your context window. Save 55% disk & webspace. Works with any HuggingFace model, any training framework.

v0.1.0 — experimental release. All 8 techniques are production-tested in a 9B QLoRA training pipeline (RTX 3090, 24GB). Tests pass. Disk compression works. But: no CI/CD yet, no AutoModel.from_pretrained(".vsqz") yet, no published benchmarks. Test on your setup before relying on it. PRs welcome.

# Compress any model: 18GB → 8GB
python -m vsqz convert model/ output.vsqz

# Info: peek without loading
python -m vsqz info model.vsqz

# Training: wrap your optimizer, save VRAM  
from vsqz import VRAMSqueeze
squeezer = VRAMSqueeze(model, optimizer=opt, preset="13B_24GB")

What GPUs Can Do With vsqz

Training (QLoRA + GaLore + FP16 States)

GPU VRAM 4B 9B 13B 20B
RTX 3060 12 GB ✅ b=4 ✅ b=2 ✅ b=1
RTX 4070 12 GB ✅ b=4 ✅ b=3 ✅ b=1
RTX 4080 16 GB ✅ b=4 ✅ b=4 ✅ b=2 ⚠️ b=1
RTX 3090 24 GB ✅ b=4 ✅ b=4 ✅ b=3 ✅ b=1
RTX 4090 24 GB ✅ b=4 ✅ b=4 ✅ b=4 ✅ b=2

Without vsqz: 9B max, no 13B or 20B on any consumer GPU.

Inference (Context Window Doubling via KV-Cache Compression)

GPU 4B 9B 13B 20B
8 GB 16k ✅ 8k ✅
12 GB 32k ✅ 16k ✅ 8k ✅
16 GB 64k ✅ 32k ✅ 16k ✅ 8k ✅
24 GB 128k ✅ 64k ✅ 32k ✅ 16k ✅

Without vsqz: context halved on every tier.


VRAM Savings

Format Original vsqz Savings
safetensors (9B) 18 GB 8 GB 55%
GGUF F16 (9B) 18 GB 8 GB 55%
PyTorch Checkpoint 20 GB 15 MB 99.3%
ALL THREE → single .vsqz 56 GB 8 GB 86%

How It Works — The Stack

vsqz combines 8 orthogonal memory-saving techniques. Each targets a different VRAM region:

Technique Origin What It Saves VRAM Freed
GaLore ICML 2024 Optimizer states (SVD projection r=128) ~2 GB
LISA 2024 Activations (50% layer sampling) ~4 GB
FP16 States Native Optimizer precision (32→16 bit) ~1.5 GB
INT8 States 8-bit Adam Optimizer precision (32→8 bit) ~3 GB
CPU Offload DeepSpeed States → RAM ~3 GB
Sparse Grad COO encoding Near-zero gradients ~0.5 GB
Gradient Delta git/rsync ΔG instead of G ~1 GB
Adaptive Quant H.264/AV1 Per-layer bit allocation ~0.5 GB

Training: all active simultaneously. Inference: KV-Cache H.264 I/P/B-frame compression.


Quickstart

Install

pip install vsqz

Save Disk Space — same flags as gzip/zip

Works like gzip. Linux users already know the flags.

# Compress (just like gzip file.gz)
vsqz model.safetensors               model.safetensors.vsqz
vsqz -k model/ output.vsqz           keep original after compression
vsqz -v model.gguf                   verbose, show compression ratio
vsqz -1 model.gguf                   fast (fp16), -1..-9 compression level
vsqz -9 model.safetensors            best compression (int8 + sparse)

# Decompress (just like gzip -d)
vsqz -d model.vsqz                   restore original format (safetensors/GGUF/pt)

# Info (just like gzip -l, zip -l)
vsqz -l model.vsqz                   metadata without loading tensors
vsqz -t model.vsqz                   integrity test (all tensors readable)

# Recursive (just like gzip -r)
vsqz -r models/                      compress all .safetensors/.gguf in dir tree

# Split for cloud upload (just like zip -s)
vsqz -s 8G large-20B.safetensors     20B.vsqz.001, 20B.vsqz.002 (8 GB each)

# Exclude (strip optimizer states, just like zip -x)
vsqz -x adam checkpoint.pt           weights only, 99% smaller

Verify Compression (before deleting originals)

# Check .vsqz integrity — decompress and compare
python -c "
from vsqz.vsqz_format import peek_vsqz
h = peek_vsqz('model.vsqz')
print(f'Tensors: {len(h[\"tensors\"])}, Size: {sum(t[\"size\"] for t in h[\"tensors\"].values())/1e9:.1f} GB')
print(f'Techniques: {h[\"technique_stack\"]}')
print(f'Verdict: Safe to delete original')
"

HuggingFace Integration (AutoModel)

import vsqz.hf_plugin  # One-line activation
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("model.vsqz")  # Just works

Turn any .vsqz file into a HuggingFace model — no conversion needed.

Training (HuggingFace / Axolotl)

from vsqz import VRAMSqueeze
from transformers import AutoModelForCausalLM, Trainer

model = AutoModelForCausalLM.from_pretrained("Qwen2.5-7B")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# One line: activate all optimizations
squeezer = VRAMSqueeze(model, optimizer=optimizer, preset="13B_24GB")

# Presets: "9B_12GB", "13B_24GB", "20B_24GB", "safe_defaults"

Inference (KV-Cache Compression)

from vsqz import VRAMSqueeze

squeezer = VRAMSqueeze(model, mode="inference", preset="balanced")
for step in generation_loop:
    squeezer.evict_if_needed(current_seq_len)  # Auto-evict old tokens

File Format: .vsqz

[0..3]   Magic:   VSQZ            (4 bytes)
[4..7]   Version: uint32          (4 bytes) 
[8..11]  Header:  JSON metadata   (model config, tensor index, technique stack)
[12..]   Tensors: FP16 weights + GaLore P/Q + INT8 states
  • Self-describing: anyone who sees .vsqz knows vsqz was used
  • Mmap-compatible for zero-copy loading
  • One file for everything: weights + optimizer + metadata
  • Open format: read it with any JSON parser + numpy

Requirements

  • Python ≥ 3.10
  • PyTorch ≥ 2.0
  • Optional: optuna (Bayesian HPO), safetensors (converter)

Integrity & Security

Every .vsqz file carries its own SHA-256 fingerprint and a recovery record at the end of the file. If the main header gets corrupted, the file self-repairs from the recovery record.

vsqz -t model.vsqz       # SHA-256 verified integrity check
vsqz -l model.vsqz       # Shows SHA-256 fingerprint
# If header is corrupted: auto-restores from recovery record

No other ML format has self-repair. GGUF and safetensors have no checksums at all.


Ecosystem Integration

llama.cpp PR in progress. Once merged, every llama.cpp-based client (Ollama, LM Studio, text-generation-webui) will load .vsqz files natively — no conversion, no Python bridge. See contrib/llama.cpp_vsqz.patch.


Why vsqz?

GGUF safetensors vsqz
Training
Inference
Optimizer State 15 MB
Context Expansion
File Size (9B) 18 GB 18 GB 8 GB
Universal

One file. Training and inference. 86% smaller than keeping all three.


Academic References

  • Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", ICML 2024
  • Pan et al., "LISA: Layer-wise Importance Sampling for Memory-Efficient LLM Fine-Tuning", 2024
  • Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", NeurIPS 2023
  • Xiao et al., "StreamingLLM: Efficient Streaming Language Models with Attention Sinks", 2023

Author: Christian Butterweck — github.com/butterwecksolutions
License: MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vsqz-0.2.3.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vsqz-0.2.3-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file vsqz-0.2.3.tar.gz.

File metadata

  • Download URL: vsqz-0.2.3.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vsqz-0.2.3.tar.gz
Algorithm Hash digest
SHA256 3915250b58cd547ab601d1c1eed8076cae2d06ffdb61b432181a1cf929d7389e
MD5 d3de596fc24d3ef0bba45b8e79c71968
BLAKE2b-256 04f96e2b0205ad37523452973b7de9ef3e2f9a7a9aaaa2d803e3afb8ad8145ee

See more details on using hashes here.

File details

Details for the file vsqz-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: vsqz-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vsqz-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 648397f519a1a2b4928bc1698c6a2f65caa3b0a71faca7d36cc88f0d3727821c
MD5 bdc6fc2aef50f6bdff3b4e1bfca2521c
BLAKE2b-256 0c1008b52818b210b3dbf22abe0e959f339605eafefb5f030bb684f144236d5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page