Skip to main content

Memory-efficient training for 24GB GPUs — bundle of optimizer-space compression techniques enabling 13B+ models on consumer hardware

Project description

vsqz — Memory-Efficient Training & Inference for Consumer GPUs

One file. Half the VRAM. Double the model.

pip install vsqz — the gzip for AI models. Train 13B on a 12GB card. Run 20B on 24GB. Double your context window. Works with any HuggingFace model, any training framework.

# Compress any model: 18GB → 8GB
python -m vsqz convert model/ output.vsqz

# Info: peek without loading
python -m vsqz info model.vsqz

# Training: wrap your optimizer, save VRAM  
from vsqz import VRAMSqueeze
squeezer = VRAMSqueeze(model, optimizer=opt, preset="13B_24GB")

What GPUs Can Do With vsqz

Training (QLoRA + GaLore + FP16 States)

GPU VRAM 4B 9B 13B 20B
RTX 3060 12 GB ✅ b=4 ✅ b=2 ✅ b=1
RTX 4070 12 GB ✅ b=4 ✅ b=3 ✅ b=1
RTX 4080 16 GB ✅ b=4 ✅ b=4 ✅ b=2 ⚠️ b=1
RTX 3090 24 GB ✅ b=4 ✅ b=4 ✅ b=3 ✅ b=1
RTX 4090 24 GB ✅ b=4 ✅ b=4 ✅ b=4 ✅ b=2

Without vsqz: 9B max, no 13B or 20B on any consumer GPU.

Inference (Context Window Doubling via KV-Cache Compression)

GPU 4B 9B 13B 20B
8 GB 16k ✅ 8k ✅
12 GB 32k ✅ 16k ✅ 8k ✅
16 GB 64k ✅ 32k ✅ 16k ✅ 8k ✅
24 GB 128k ✅ 64k ✅ 32k ✅ 16k ✅

Without vsqz: context halved on every tier.


VRAM Savings

Format Original vsqz Savings
safetensors (9B) 18 GB 8 GB 55%
GGUF F16 (9B) 18 GB 8 GB 55%
PyTorch Checkpoint 20 GB 15 MB 99.3%
ALL THREE → single .vsqz 56 GB 8 GB 86%

How It Works — The Stack

vsqz combines 8 orthogonal memory-saving techniques. Each targets a different VRAM region:

Technique Origin What It Saves VRAM Freed
GaLore ICML 2024 Optimizer states (SVD projection r=128) ~2 GB
LISA 2024 Activations (50% layer sampling) ~4 GB
FP16 States Native Optimizer precision (32→16 bit) ~1.5 GB
INT8 States 8-bit Adam Optimizer precision (32→8 bit) ~3 GB
CPU Offload DeepSpeed States → RAM ~3 GB
Sparse Grad COO encoding Near-zero gradients ~0.5 GB
Gradient Delta git/rsync ΔG instead of G ~1 GB
Adaptive Quant H.264/AV1 Per-layer bit allocation ~0.5 GB

Training: all active simultaneously. Inference: KV-Cache H.264 I/P/B-frame compression.


Quickstart

Install

pip install vsqz

Save Disk Space — Compress Any Model (like gzip)

Compress HuggingFace models, GGUF files, or PyTorch checkpoints to .vsqz format.
Strips AdamW dead weight, compresses FP32→FP16. Works on any model format.

# HuggingFace safetensors directory → .vsqz (saves 10 GB on 9B model)
python -m vsqz convert unsloth/Qwen2.5-7B-Instruct/ qwen-7b.vsqz
# Output: Stored 18 GB → 8 GB (55% smaller, 10 GB freed on disk)

# GGUF model → .vsqz (keep the compact version, delete the raw)
python -m vsqz convert llama-3-8b-F16.gguf llama-3-8b.vsqz
rm llama-3-8b-F16.gguf  # Safe to delete — .vsqz has everything

# PyTorch training checkpoint → .vsqz (99% smaller — strips AdamW bloat)
python -m vsqz convert pytorch_model.bin tiny.vsqz
# Output: 20 GB → 15 MB (optimizer states stripped, weights compressed)

# Peek metadata — no GPU, no loading, instant
python -m vsqz info model.vsqz
# Output: 760 tensors, 9B params, Qwen3_5 architecture, compressed from GGUF

# Batch compress all models in a directory
find . -name "*.safetensors" -o -name "*.gguf" | while read f; do
  python -m vsqz convert "$f" "${f%%.*}".vsqz && rm "$f"
done
# Your model collection: 50%+ disk space freed

Verify Compression (before deleting originals)

# Check .vsqz integrity — decompress and compare
python -c "
from vsqz.sqz_format import peek_vsqz
h = peek_vsqz('model.vsqz')
print(f'Tensors: {len(h[\"tensors\"])}, Size: {sum(t[\"size\"] for t in h[\"tensors\"].values())/1e9:.1f} GB')
print(f'Techniques: {h[\"technique_stack\"]}')
print(f'Verdict: Safe to delete original')
"

Training (HuggingFace / Axolotl)

from vsqz import VRAMSqueeze
from transformers import AutoModelForCausalLM, Trainer

model = AutoModelForCausalLM.from_pretrained("Qwen2.5-7B")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# One line: activate all optimizations
squeezer = VRAMSqueeze(model, optimizer=optimizer, preset="13B_24GB")

# Presets: "9B_12GB", "13B_24GB", "20B_24GB", "safe_defaults"

Inference (KV-Cache Compression)

from vsqz import VRAMSqueeze

squeezer = VRAMSqueeze(model, mode="inference", preset="balanced")
for step in generation_loop:
    squeezer.evict_if_needed(current_seq_len)  # Auto-evict old tokens

File Format: .vsqz

[0..3]   Magic:   VSQZ            (4 bytes)
[4..7]   Version: uint32          (4 bytes) 
[8..11]  Header:  JSON metadata   (model config, tensor index, technique stack)
[12..]   Tensors: FP16 weights + GaLore P/Q + INT8 states
  • Self-describing: anyone who sees .vsqz knows vsqz was used
  • Mmap-compatible for zero-copy loading
  • One file for everything: weights + optimizer + metadata
  • Open format: read it with any JSON parser + numpy

Requirements

  • Python ≥ 3.10
  • PyTorch ≥ 2.0
  • Optional: optuna (Bayesian HPO), safetensors (converter)

Why vsqz?

GGUF safetensors vsqz
Training
Inference
Optimizer State 15 MB
Context Expansion
File Size (9B) 18 GB 18 GB 8 GB
Universal

One file. Training and inference. 86% smaller than keeping all three.


Academic References

  • Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", ICML 2024
  • Pan et al., "LISA: Layer-wise Importance Sampling for Memory-Efficient LLM Fine-Tuning", 2024
  • Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", NeurIPS 2023
  • Xiao et al., "StreamingLLM: Efficient Streaming Language Models with Attention Sinks", 2023

Author: Christian Butterweck — github.com/butterwecksolutions
License: MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vsqz-0.1.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vsqz-0.1.0-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file vsqz-0.1.0.tar.gz.

File metadata

  • Download URL: vsqz-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vsqz-0.1.0.tar.gz
Algorithm Hash digest
SHA256 78a770eabf01bcf359185b66a95bbb5911892098ebcd0349932a7888b7639c06
MD5 645a57fafb12f95b8e461d56a9d08659
BLAKE2b-256 f06b0dc8c4e57d6d5b7e6df4599ec1ed5e82f969ad274f966408e3ce5ec95ef0

See more details on using hashes here.

File details

Details for the file vsqz-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vsqz-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vsqz-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61e6bc0bb8fa8c12f54820a0e1b9d4c1c9b1917286811fc13295a1a8a21b076d
MD5 9fee64af53d9f66e1bea80862412ed80
BLAKE2b-256 da45bed3417fb7bcd7fed0556ea81ae787216fb4a3c5f7432a25f478ddba93b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page