Skip to main content

ZSE - Z Server Engine: Ultra memory-efficient LLM inference engine

Project description

ZLLM Logo

ZSE - Z Server Engine

PyPI Python 3.11+ License Website

Deploy on Railway Deploy to Render

Ultra memory-efficient LLM inference engine with native INT4 CUDA kernels.

Run 32B models on 24GB GPUs. Run 7B models on 8GB GPUs. Fast cold starts, single-file deployment.

๐Ÿ†• v1.4.1: Model Hub + Pull Commands

Train 7B models on 8GB GPUs. Train 70B models on 48GB GPUs.

from zse.format import load_zse_model
from zse.training import LoRAConfig, add_lora_to_model, save_lora_adapter

# Load INT4 model (uses ~6GB for 7B)
model, tokenizer, info = load_zse_model("model.zse", device="cuda")

# Add LoRA adapters (~1% trainable params)
config = LoRAConfig(rank=16, alpha=32)
model = add_lora_to_model(model, config)

# Train as usual with PyTorch/HuggingFace
# ... your training loop ...

# Save adapter (tiny file, ~25MB for 7B)
save_lora_adapter(model, "my_adapter.safetensors")

Install training dependencies: pip install zllm-zse[training]

๐Ÿš€ Benchmarks (Verified, March 2026)

ZSE Custom Kernel (Default)

Model File Size VRAM Speed Cold Start GPU
Qwen 7B 5.57 GB 5.67 GB 37.2 tok/s 5.7s H200
Qwen 14B 9.95 GB 10.08 GB 20.8 tok/s 10.5s H200
Qwen 32B 19.23 GB 19.47 GB 10.9 tok/s 20.4s H200
Qwen 72B 41.21 GB 41.54 GB 6.3 tok/s 51.8s H200

ZSE bnb Backend (Alternate)

Model VRAM Speed Cold Start
Qwen 7B 6.57 GB 45.6 tok/s 6.0s
Qwen 14B 11.39 GB 27.6 tok/s 7.1s
Qwen 32B 22.27 GB 20.4 tok/s 20.8s
Qwen 72B 47.05 GB 16.4 tok/s 53.0s

VRAM Comparison

Model Custom Kernel bnb Backend Savings
7B 5.67 GB 6.57 GB -0.90 GB (14%)
14B 10.08 GB 11.39 GB -1.31 GB (12%)
32B 19.47 GB 22.27 GB -2.80 GB (13%)
72B 41.54 GB 47.05 GB -5.51 GB (12%)

GPU Compatibility

GPU VRAM Max Model
RTX 3070/4070 8GB 7B
RTX 3080 12GB 14B
RTX 3090/4090 24GB 32B
A100-40GB 40GB 32B
A100-80GB / H200 80-141GB 72B

Key Features

  • ๐Ÿ“ฆ Single .zse File: Model + tokenizer + config in one file
  • ๐Ÿšซ No Network Calls: Everything embedded, works offline
  • โšก ZSE Custom Kernel: Native INT4 inference with maximum VRAM efficiency
  • ๐Ÿง  Memory Efficient: 72B in 41GB, 32B in 19GB, 7B in 5.7GB VRAM
  • ๐Ÿƒ Fast Cold Start: 5.7s for 7B, 20s for 32B, 52s for 72B
  • ๐ŸŽฏ Dual Backend: Custom kernel (default) or bnb backend (alternate)
  • ๐Ÿ”ฅ QLoRA Training: Fine-tune INT4 models with LoRA adapters (NEW in v1.4.0)
  • ๐Ÿ“„ Built-in RAG with .zpf: 25% fewer LLM tokens at 100% accuracy vs plain chunking

Installation

pip install zllm-zse

Requirements:

  • Python 3.11+
  • CUDA GPU (8GB+ VRAM recommended)
  • bitsandbytes (auto-installed)

Quick Start

1. Pull a Pre-Converted Model (Fastest)

# Download ready-to-use .zse model (no GPU needed for conversion)
zse pull qwen-7b          # 5.18 GB download
zse pull mistral-7b       # 3.86 GB download
zse pull qwen-0.5b        # 0.69 GB download

# Browse available models
zse list
zse list --cached

2. Or Convert Model to .zse Format (One-Time)

# Convert any HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen7b.zse
zse convert Qwen/Qwen2.5-32B-Instruct -o qwen32b.zse

# Or in Python
from zse.format.writer import convert_model
convert_model("Qwen/Qwen2.5-7B-Instruct", "qwen7b.zse", quantization="int4")

2. Load and Run

from zse.format.reader_v2 import load_zse_model

# Load model (auto-detects optimal settings)
model, tokenizer, info = load_zse_model("qwen7b.zse")

# Generate
inputs = tokenizer("Write a poem about AI:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

3. Start Server (OpenAI-Compatible)

zse serve qwen7b.zse --port 8000
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
    model="qwen7b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

.zse Format Benefits

Feature HuggingFace .zse Format
Cold start (7B) 45s 9s
Cold start (32B) 120s 24s
Network calls on load Yes No
Files to manage Many One
Quantization time Runtime Pre-done

.zpf: Built-in RAG with Token Cost Reduction

.zpf delivers 100% retrieval accuracy at 25% lower LLM token cost compared to plain chunking โ€” tested on real noisy web content.

.zpf (Z Packed Format) is ZSE's built-in semantic document format for RAG. It compresses documents at write-time, stripping noise (cookie banners, nav, boilerplate, filler prose) while preserving what the LLM needs.

Cost Benchmark (CNN noisy web article)

Metric .zpf Plain Chunking
Correct answers 10/10 10/10
Tokens sent to LLM (avg/query) 943 1,257
Token reduction 25% โ€”
Cost per 1M queries (GPT-4o) $2,357 $3,143
Annual savings at 1M queries $786 โ€”
Break-even 1 query per document โ€”

Quick Start

# Ingest a document into RAG store
zse rag add paper.pdf --title "ML Paper"

# Semantic search
zse rag search "What is batch normalization?" -k 5

# Get token-budgeted LLM context
zse rag context "What is batch normalization?" --max-tokens 500

# List all documents
zse rag list

# Export to open formats (zero vendor lock-in)
zse rag export paper.zpf -f markdown -o paper.md

# Inspect .zpf file metadata
zse rag inspect paper.zpf

# Show store stats
zse rag stats

# Re-embed with a different model
zse rag reindex --model all-MiniLM-L6-v2

# Remove a document
zse rag remove <doc_id>
from zse.core.zrag.pipeline import RAGPipeline

pipeline = RAGPipeline(store_dir="./my_store")
pipeline.ingest("noisy_article.md")

# Get LLM-ready context (token-budgeted)
context = pipeline.get_context("What is transfer learning?", max_tokens=500)
# ~25% fewer tokens than plain chunking, same answer quality

How It Works

  1. Semantic chunking โ€” splits by content type (11 block types: CODE, TABLE, DEFINITION, PROCEDURE, etc.)
  2. 10-layer compression โ€” strips filler phrases, verbose patterns, redundant qualifiers, noise lines
  3. Contextual embedding โ€” each block embeds with parent section hierarchy for cross-section retrieval
  4. Hybrid retrieval โ€” 0.6ร— embedding similarity + 0.4ร— BM25, with size normalization, block-type boosting, and identifier-aware scoring

Multi-GPU: Tensor Parallelism & Pipeline Parallelism

Run models across multiple GPUs to reduce per-GPU VRAM or serve larger models.

# Tensor Parallelism โ€” shard weights across GPUs (NCCL all-reduce)
zse serve qwen-32b -tp 2 --port 8000

# Pipeline Parallelism โ€” split layers across GPUs (NCCL send/recv)
zse serve qwen-72b -pp 4 --port 8000

# Hybrid TP+PP โ€” 2D process grid
zse serve qwen-72b -tp 2 -pp 2 --port 8000

# Auto-detect โ€” let ZSE pick optimal layout
zse serve qwen-72b --multi-gpu --port 8000

Multi-GPU Benchmarks (Qwen2.5-7B INT4, A10G GPUs)

Config Speed VRAM/GPU vs Single GPU
Single GPU 23.1 tok/s 5.22 GB baseline
TP=2 18.6 tok/s 3.76 GB -28% VRAM
PP=2 23.1 tok/s 2.67 GB -49% VRAM
TP=2 ร— PP=2 20.6 tok/s 1.6-2.1 GB -65% VRAM

When to use what:

  • TP: Fastest inference, splits every layer. Best when GPUs are connected via NVLink.
  • PP: Best VRAM reduction, near-perfect memory split. Works over PCIe.
  • TP+PP: Maximum scale โ€” run 72B on 4ร— consumer GPUs.

Advanced Usage

QLoRA Fine-Tuning (v1.4.0+)

Train any model with QLoRA - LoRA adapters on quantized INT4 base models.

# Install training dependencies
pip install zllm-zse[training]
from zse.format import load_zse_model, convert_model
from zse.training import (
    LoRAConfig, 
    add_lora_to_model, 
    save_lora_adapter,
    load_lora_adapter
)
import torch

# 1. Convert model to .zse (one-time)
convert_model("meta-llama/Llama-3-8B", "llama8b.zse", quantization="int4")

# 2. Load INT4 model
model, tokenizer, info = load_zse_model("llama8b.zse", device="cuda")

# 3. Add LoRA adapters
config = LoRAConfig(
    rank=16,              # LoRA rank (higher = more capacity)
    alpha=32,             # LoRA alpha (scaling factor)
    dropout=0.05,         # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]  # Which layers
)
model = add_lora_to_model(model, config)

# 4. Train with standard PyTorch
optimizer = torch.optim.AdamW(
    [p for p in model.parameters() if p.requires_grad], 
    lr=2e-4
)

for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# 5. Save adapter (~25MB for 7B model)
save_lora_adapter(model, "my_adapter.safetensors")

# 6. Load adapter for inference
model, tokenizer, info = load_zse_model("llama8b.zse", lora="my_adapter.safetensors")

QLoRA VRAM Usage:

Model Base VRAM + LoRA Training Trainable Params
7B 6 GB ~8 GB 12M (0.2%)
14B 11 GB ~14 GB 25M (0.2%)
32B 20 GB ~26 GB 50M (0.2%)
70B 42 GB ~52 GB 100M (0.1%)

Control Caching Strategy

# Auto (default): Detect VRAM, pick optimal strategy
model, tok, info = load_zse_model("qwen7b.zse", cache_weights="auto")

# Force bnb mode (low VRAM, fast inference)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=False)

# Force FP16 cache (max speed, high VRAM)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=True)

Benchmark Your Setup

# Full benchmark
python3 -c "
import time, torch
from zse.format.reader_v2 import load_zse_model

t0 = time.time()
model, tokenizer, info = load_zse_model('qwen7b.zse')
print(f'Load: {time.time()-t0:.1f}s, VRAM: {torch.cuda.memory_allocated()/1e9:.1f}GB')

inputs = tokenizer('Hello', return_tensors='pt').to('cuda')
model.generate(**inputs, max_new_tokens=10)  # Warmup

prompt = 'Write a detailed essay about AI.'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
torch.cuda.synchronize()
t0 = time.time()
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
torch.cuda.synchronize()
tokens = out.shape[1] - inputs['input_ids'].shape[1]
print(f'{tokens} tokens in {time.time()-t0:.2f}s = {tokens/(time.time()-t0):.1f} tok/s')
"

CLI Commands

# Model Management
zse pull <model>                    # Download pre-converted .zse model
zse list                            # Browse available models
zse list --cached                   # Show locally cached models
zse cached                          # Show cache details
zse convert <model_id> -o out.zse   # Convert HuggingFace model
zse info <model>                    # Show model info
zse hardware                        # Check GPU/VRAM

# Serving
zse serve <model> --port 8000       # Single GPU
zse serve <model> -tp 2 --port 8000 # Tensor parallel (2 GPUs)
zse serve <model> -pp 2 --port 8000 # Pipeline parallel (2 GPUs)
zse serve <model> -tp 2 -pp 2       # Hybrid TP+PP (4 GPUs)
zse serve <model> --multi-gpu       # Auto-detect optimal layout
zse chat <model>                    # Interactive chat

# RAG (.zpf)
zse rag add <file> [--title <t>]    # Ingest document
zse rag search <query> [-k <n>]     # Semantic search
zse rag context <query> [--max-tokens <n>]  # Token-budgeted context
zse rag list                        # List documents
zse rag remove <doc_id>             # Remove document
zse rag convert <file> [-o out.zpf] # Convert to .zpf only
zse rag inspect <zpf_path>          # Show .zpf metadata
zse rag export <zpf> [-f fmt]       # Export (markdown/jsonl/json)
zse rag stats                       # Store statistics
zse rag reindex [--model <name>]    # Re-embed with new model

# Auth (for gated models)
zse login
zse logout

๐Ÿค— Pre-Converted Models

13 models ready for instant download from huggingface.co/zse-zllm:

Model Size Pull Command
Qwen2.5-0.5B-Instruct 0.69 GB zse pull qwen-0.5b
TinyLlama-1.1B-Chat 0.71 GB zse pull tinyllama-1.1b
Qwen2.5-1.5B-Instruct 1.51 GB zse pull qwen-1.5b
Qwen2.5-3B-Instruct 2.51 GB zse pull qwen-3b
Qwen2.5-Coder-1.5B 1.51 GB zse pull qwen-coder-1.5b
DeepSeek-Coder-6.7B 3.61 GB zse pull deepseek-6.7b
Mistral-7B-Instruct 3.86 GB zse pull mistral-7b
Qwen2.5-7B-Instruct 5.18 GB zse pull qwen-7b
Qwen2.5-Coder-7B 5.18 GB zse pull qwen-coder-7b
Qwen2.5-14B-Instruct 9.26 GB zse pull qwen-14b
Qwen2.5-32B-Instruct 17.9 GB zse pull qwen-32b
Mixtral-8x7B-Instruct 85.14 GB zse pull mixtral-8x7b
Qwen2.5-72B-Instruct 38.38 GB zse pull qwen-72b

How It Works

  1. Conversion: Quantize HF model to INT4, pack weights, embed tokenizer + config
  2. Loading: Memory-map .zse file, load INT4 weights directly to GPU
  3. Inference: ZSE custom kernel (default) or bnb backend for matmul
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  HuggingFace    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   .zse File     โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   GPU Model     โ”‚
โ”‚  Model (FP16)   โ”‚     โ”‚   (INT4 + tok)  โ”‚     โ”‚  (ZSE Kernel)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    One-time             Single file             12% less VRAM
    conversion           ~0.5 bytes/param        vs bitsandbytes

OpenClaw Integration

Run local models with OpenClaw - the 24/7 AI assistant by @steipete.

# Start ZSE server
zse serve <model-name> --port 8000

# Configure OpenClaw to use local ZSE
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=zse

Or in OpenClaw's config.yaml:

llm:
  provider: openai-compatible
  api_base: http://localhost:8000/v1
  api_key: zse
  model: default

Benefits: 100% private, zero API costs, works offline, run ANY model.

Docker Deployment

# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest

# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu

# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latest

See deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.

License

Apache 2.0

Sponsors

This project is supported by:

DigitalOcean

Contact


Made with โค๏ธ by Zyora Labs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zllm_zse-1.4.2.tar.gz (391.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zllm_zse-1.4.2-py3-none-any.whl (430.9 kB view details)

Uploaded Python 3

File details

Details for the file zllm_zse-1.4.2.tar.gz.

File metadata

  • Download URL: zllm_zse-1.4.2.tar.gz
  • Upload date:
  • Size: 391.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-1.4.2.tar.gz
Algorithm Hash digest
SHA256 eacbc7d857742c2cc27efa81190b62d2bed88aa20734ba59bba17b3f273fcf20
MD5 0560bcb14755f6cf2ea00f7e6944f6b7
BLAKE2b-256 38ff2a81ef71860818fa2a2d57d6f032cf560cab1f606f54fa73d023c2e65249

See more details on using hashes here.

File details

Details for the file zllm_zse-1.4.2-py3-none-any.whl.

File metadata

  • Download URL: zllm_zse-1.4.2-py3-none-any.whl
  • Upload date:
  • Size: 430.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 298c2c409e75b4350f75e13aa734d1cbc422437f87f4a05b5f5e091c336dbd85
MD5 b54ed250d0b935ea74a8d64a60691e2e
BLAKE2b-256 d416befe75cc53afd0ac81d3a53dc0d251423151ce7fbeb3ad9c860843a2a9bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page