gzip for AI models — train 13B on 12GB, run 20B on 24GB. 55% smaller files, 2× longer context. Works with any HuggingFace model.

These details have not been verified by PyPI

Project links

Project description

vsqz — Memory-Efficient Training & Inference for Consumer GPUs

One file. Half the VRAM. Double the model.

pip install vsqz — the gzip for AI models. Train 13B on a 12GB card. Run 20B on 24GB. Double your context window. Save 55% disk & webspace. Works with any HuggingFace model, any training framework.

v0.1.0 — experimental release. All 8 techniques are production-tested in a 9B QLoRA training pipeline (RTX 3090, 24GB). Tests pass. Disk compression works. But: no CI/CD yet, no AutoModel.from_pretrained(".vsqz") yet, no published benchmarks. Test on your setup before relying on it. PRs welcome.

# Compress any model: 18GB → 8GB
python -m vsqz convert model/ output.vsqz

# Info: peek without loading
python -m vsqz info model.vsqz

# Training: wrap your optimizer, save VRAM  
from vsqz import VRAMSqueeze
squeezer = VRAMSqueeze(model, optimizer=opt, preset="13B_24GB")

What GPUs Can Do With vsqz

Training (QLoRA + GaLore + FP16 States)

GPU	VRAM	4B	9B	13B	20B
RTX 3060	12 GB	✅ b=4	✅ b=2	✅ b=1	❌
RTX 4070	12 GB	✅ b=4	✅ b=3	✅ b=1	❌
RTX 4080	16 GB	✅ b=4	✅ b=4	✅ b=2	⚠️ b=1
RTX 3090	24 GB	✅ b=4	✅ b=4	✅ b=3	✅ b=1
RTX 4090	24 GB	✅ b=4	✅ b=4	✅ b=4	✅ b=2

Without vsqz: 9B max, no 13B or 20B on any consumer GPU.

Inference (Context Window Doubling via KV-Cache Compression)

GPU	4B	9B	13B	20B
8 GB	16k ✅	8k ✅	❌	❌
12 GB	32k ✅	16k ✅	8k ✅	❌
16 GB	64k ✅	32k ✅	16k ✅	8k ✅
24 GB	128k ✅	64k ✅	32k ✅	16k ✅

Without vsqz: context halved on every tier.

VRAM Savings

Format	Original	vsqz	Savings
safetensors (9B)	18 GB	8 GB	55%
GGUF F16 (9B)	18 GB	8 GB	55%
PyTorch Checkpoint	20 GB	15 MB	99.3%
ALL THREE → single .vsqz	56 GB	8 GB	86%

How It Works — The Stack

vsqz combines 8 orthogonal memory-saving techniques. Each targets a different VRAM region:

Technique	Origin	What It Saves	VRAM Freed
GaLore	ICML 2024	Optimizer states (SVD projection r=128)	~2 GB
LISA	2024	Activations (50% layer sampling)	~4 GB
FP16 States	Native	Optimizer precision (32→16 bit)	~1.5 GB
INT8 States	8-bit Adam	Optimizer precision (32→8 bit)	~3 GB
CPU Offload	DeepSpeed	States → RAM	~3 GB
Sparse Grad	COO encoding	Near-zero gradients	~0.5 GB
Gradient Delta	git/rsync	ΔG instead of G	~1 GB
Adaptive Quant	H.264/AV1	Per-layer bit allocation	~0.5 GB

Training: all active simultaneously. Inference: KV-Cache H.264 I/P/B-frame compression.

Quickstart

Install

pip install vsqz

Save Disk Space — same flags as gzip/zip

Works like gzip. Linux users already know the flags.

# Compress (just like gzip file.gz)
vsqz model.safetensors              → model.safetensors.vsqz
vsqz -k model/ output.vsqz          → keep original after compression
vsqz -v model.gguf                  → verbose, show compression ratio
vsqz -1 model.gguf                  → fast (fp16), -1..-9 compression level
vsqz -9 model.safetensors           → best compression (int8 + sparse)

# Decompress (just like gzip -d)
vsqz -d model.vsqz                  → restore original format (safetensors/GGUF/pt)

# Info (just like gzip -l, zip -l)
vsqz -l model.vsqz                  → metadata without loading tensors
vsqz -t model.vsqz                  → integrity test (all tensors readable)

# Recursive (just like gzip -r)
vsqz -r models/                     → compress all .safetensors/.gguf in dir tree

# Split for cloud upload (just like zip -s)
vsqz -s 8G large-20B.safetensors    → 20B.vsqz.001, 20B.vsqz.002 (8 GB each)

# Exclude (strip optimizer states, just like zip -x)
vsqz -x adam checkpoint.pt          → weights only, 99% smaller

Verify Compression (before deleting originals)

# Check .vsqz integrity — decompress and compare
python -c "
from vsqz.sqz_format import peek_vsqz
h = peek_vsqz('model.vsqz')
print(f'Tensors: {len(h[\"tensors\"])}, Size: {sum(t[\"size\"] for t in h[\"tensors\"].values())/1e9:.1f} GB')
print(f'Techniques: {h[\"technique_stack\"]}')
print(f'Verdict: Safe to delete original')
"

HuggingFace Integration (AutoModel)

import vsqz.hf_plugin  # One-line activation
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("model.vsqz")  # Just works

Turn any .vsqz file into a HuggingFace model — no conversion needed.

Training (HuggingFace / Axolotl)

from vsqz import VRAMSqueeze
from transformers import AutoModelForCausalLM, Trainer

model = AutoModelForCausalLM.from_pretrained("Qwen2.5-7B")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# One line: activate all optimizations
squeezer = VRAMSqueeze(model, optimizer=optimizer, preset="13B_24GB")

# Presets: "9B_12GB", "13B_24GB", "20B_24GB", "safe_defaults"

Inference (KV-Cache Compression)

from vsqz import VRAMSqueeze

squeezer = VRAMSqueeze(model, mode="inference", preset="balanced")
for step in generation_loop:
    squeezer.evict_if_needed(current_seq_len)  # Auto-evict old tokens

File Format: .vsqz

[0..3]   Magic:   VSQZ            (4 bytes)
[4..7]   Version: uint32          (4 bytes) 
[8..11]  Header:  JSON metadata   (model config, tensor index, technique stack)
[12..]   Tensors: FP16 weights + GaLore P/Q + INT8 states

Self-describing: anyone who sees .vsqz knows vsqz was used
Mmap-compatible for zero-copy loading
One file for everything: weights + optimizer + metadata
Open format: read it with any JSON parser + numpy

Requirements

Python ≥ 3.10
PyTorch ≥ 2.0
Optional: optuna (Bayesian HPO), safetensors (converter)

Ecosystem Integration

llama.cpp PR in progress. Once merged, every llama.cpp-based client (Ollama, LM Studio, text-generation-webui) will load .vsqz files natively — no conversion, no Python bridge. See contrib/llama.cpp_vsqz.patch.

Why vsqz?

	GGUF	safetensors	vsqz
Training	❌	✅	✅
Inference	✅	❌	✅
Optimizer State	❌	❌	15 MB
Context Expansion	❌	❌	2×
File Size (9B)	18 GB	18 GB	8 GB
Universal	❌	❌	✅

One file. Training and inference. 86% smaller than keeping all three.

Academic References

Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", ICML 2024
Pan et al., "LISA: Layer-wise Importance Sampling for Memory-Efficient LLM Fine-Tuning", 2024
Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", NeurIPS 2023
Xiao et al., "StreamingLLM: Efficient Streaming Language Models with Attention Sinks", 2023

Author: Christian Butterweck — github.com/butterwecksolutions
License: MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.4

May 12, 2026

0.4.2

May 11, 2026

0.4.1

May 11, 2026

0.4.0

May 11, 2026

0.3.5

May 10, 2026

0.3.4 yanked

May 10, 2026

0.3.3 yanked

May 10, 2026

0.3.2 yanked

May 10, 2026

0.3.1 yanked

May 10, 2026

0.3.0 yanked

May 10, 2026

0.2.9 yanked

May 10, 2026

0.2.8 yanked

May 10, 2026

0.2.7 yanked

May 10, 2026

0.2.6 yanked

May 10, 2026

0.2.4 yanked

May 10, 2026

0.2.3 yanked

May 10, 2026

0.2.2 yanked

May 10, 2026

0.2.1 yanked

May 10, 2026

This version

0.2.0 yanked

May 10, 2026

0.1.1 yanked

May 10, 2026

0.1.0.post1 yanked

May 10, 2026

0.1.0 yanked

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vsqz-0.2.0.tar.gz (11.1 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vsqz-0.2.0-py3-none-any.whl (4.8 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file vsqz-0.2.0.tar.gz.

File metadata

Download URL: vsqz-0.2.0.tar.gz
Upload date: May 10, 2026
Size: 11.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vsqz-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a844e1b7d6d9566447aaf2786eb8ffd6b60d5c6451fbc99033d460c4d9b4c9f2`
MD5	`df6ee87ce5eb2a13936be9b0aa8d7c76`
BLAKE2b-256	`f91b5fcbdf5f6a92340219cf318fc60e3928609b11b74923ffcaece09f769c21`

See more details on using hashes here.

File details

Details for the file vsqz-0.2.0-py3-none-any.whl.

File metadata

Download URL: vsqz-0.2.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 4.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vsqz-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02919b690dfd567ee6af25b434baf8e80a4703a72bf272dd8b7a0cb2aac4e490`
MD5	`3f4ee238b4028e94bd97aeb38a62f4d2`
BLAKE2b-256	`9e2b2f5c0598be58d4ee655f9a4ec96e6903ea2e7a34414382e5494b43b99d58`

See more details on using hashes here.

vsqz 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vsqz — Memory-Efficient Training & Inference for Consumer GPUs

What GPUs Can Do With vsqz

Training (QLoRA + GaLore + FP16 States)

Inference (Context Window Doubling via KV-Cache Compression)

VRAM Savings

How It Works — The Stack

Quickstart

Install

Save Disk Space — same flags as gzip/zip

Verify Compression (before deleting originals)

HuggingFace Integration (AutoModel)

Training (HuggingFace / Axolotl)

Inference (KV-Cache Compression)

File Format: .vsqz

Requirements

Ecosystem Integration

Why vsqz?

Academic References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes