Skip to main content

ZSE - Z Server Engine: Ultra memory-efficient LLM inference engine

Project description

ZLLM Logo

ZSE - Z Server Engine

PyPI Python 3.11+ License Website

Deploy on Railway Deploy to Render

Ultra memory-efficient LLM inference engine with native INT4 CUDA kernels.

Run 32B models on 24GB GPUs. Run 7B models on 8GB GPUs. Fast cold starts, single-file deployment.

๐Ÿš€ Benchmarks (Verified, March 2026)

Custom Triton Kernel (Default - Max VRAM Efficiency)

Model File Size VRAM Speed Cold Start GPU
Qwen 7B 5.57 GB 5.67 GB 37.2 tok/s 5.7s H200
Qwen 14B 9.95 GB 10.08 GB 20.8 tok/s 10.5s H200
Qwen 32B 19.23 GB 19.47 GB 10.9 tok/s 20.4s H200
Qwen 72B 41.21 GB 41.54 GB 6.3 tok/s 51.8s H200

bitsandbytes Backend (Optional - Max Speed)

Model VRAM Speed Cold Start
Qwen 7B 6.57 GB 45.6 tok/s 6.0s
Qwen 14B 11.39 GB 27.6 tok/s 7.1s
Qwen 32B 22.27 GB 20.4 tok/s 20.8s
Qwen 72B 47.05 GB 16.4 tok/s 53.0s

VRAM Savings (Triton vs bnb)

Model Triton VRAM bnb VRAM Savings
7B 5.67 GB 6.57 GB -0.90 GB (14%)
14B 10.08 GB 11.39 GB -1.31 GB (12%)
32B 19.47 GB 22.27 GB -2.80 GB (13%)
72B 41.54 GB 47.05 GB -5.51 GB (12%)

GPU Compatibility

GPU VRAM Max Model (Triton) Max Model (bnb)
RTX 3070/4070 8GB 7B 7B
RTX 3080 12GB 14B 7B
RTX 3090/4090 24GB 32B 32B
A100-40GB 40GB 32B 32B
A100-80GB / H200 80-141GB 72B 72B

Key Features

  • ๐Ÿ“ฆ Single .zse File: Model + tokenizer + config in one file
  • ๐Ÿšซ No Network Calls: Everything embedded, works offline
  • โšก Custom Triton Kernel: Native INT4 inference, no bitsandbytes required
  • ๐Ÿง  Memory Efficient: 72B in 41GB, 32B in 19GB, 7B in 5.7GB VRAM
  • ๐Ÿƒ Fast Cold Start: 5.7s for 7B, 20s for 32B, 52s for 72B
  • ๐ŸŽฏ Auto Backend: Triton (VRAM efficient) or bnb (max speed)

Installation

pip install zllm-zse

Requirements:

  • Python 3.11+
  • CUDA GPU (8GB+ VRAM recommended)
  • bitsandbytes (auto-installed)

Quick Start

1. Convert Model to .zse Format (One-Time)

# Convert any HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen7b.zse
zse convert Qwen/Qwen2.5-32B-Instruct -o qwen32b.zse

# Or in Python
from zse.format.writer import convert_model
convert_model("Qwen/Qwen2.5-7B-Instruct", "qwen7b.zse", quantization="int4")

2. Load and Run

from zse.format.reader_v2 import load_zse_model

# Load model (auto-detects optimal settings)
model, tokenizer, info = load_zse_model("qwen7b.zse")

# Generate
inputs = tokenizer("Write a poem about AI:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

3. Start Server (OpenAI-Compatible)

zse serve qwen7b.zse --port 8000
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
    model="qwen7b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

.zse Format Benefits

Feature HuggingFace .zse Format
Cold start (7B) 45s 9s
Cold start (32B) 120s 24s
Network calls on load Yes No
Files to manage Many One
Quantization time Runtime Pre-done

Advanced Usage

Control Caching Strategy

# Auto (default): Detect VRAM, pick optimal strategy
model, tok, info = load_zse_model("qwen7b.zse", cache_weights="auto")

# Force bnb mode (low VRAM, fast inference)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=False)

# Force FP16 cache (max speed, high VRAM)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=True)

Benchmark Your Setup

# Full benchmark
python3 -c "
import time, torch
from zse.format.reader_v2 import load_zse_model

t0 = time.time()
model, tokenizer, info = load_zse_model('qwen7b.zse')
print(f'Load: {time.time()-t0:.1f}s, VRAM: {torch.cuda.memory_allocated()/1e9:.1f}GB')

inputs = tokenizer('Hello', return_tensors='pt').to('cuda')
model.generate(**inputs, max_new_tokens=10)  # Warmup

prompt = 'Write a detailed essay about AI.'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
torch.cuda.synchronize()
t0 = time.time()
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
torch.cuda.synchronize()
tokens = out.shape[1] - inputs['input_ids'].shape[1]
print(f'{tokens} tokens in {time.time()-t0:.2f}s = {tokens/(time.time()-t0):.1f} tok/s')
"

CLI Commands

# Convert model
zse convert <model_id> -o output.zse

# Start server
zse serve <model.zse> --port 8000

# Interactive chat
zse chat <model.zse>

# Show model info
zse info <model.zse>

# Check hardware
zse hardware

How It Works

  1. Conversion: Quantize HF model to INT4, pack weights, embed tokenizer + config
  2. Loading: Memory-map .zse file, load INT4 weights directly to GPU
  3. Inference: Custom Triton kernel (default) or bnb for matmul
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  HuggingFace    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   .zse File     โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   GPU Model     โ”‚
โ”‚  Model (FP16)   โ”‚     โ”‚   (INT4 + tok)  โ”‚     โ”‚ (Triton/bnb)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    One-time             Single file             12% less VRAM
    conversion           ~0.5 bytes/param        vs bitsandbytes

OpenClaw Integration

Run local models with OpenClaw - the 24/7 AI assistant by @steipete.

# Start ZSE server
zse serve <model-name> --port 8000

# Configure OpenClaw to use local ZSE
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=zse

Or in OpenClaw's config.yaml:

llm:
  provider: openai-compatible
  api_base: http://localhost:8000/v1
  api_key: zse
  model: default

Benefits: 100% private, zero API costs, works offline, run ANY model.

Docker Deployment

# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest

# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu

# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latest

See deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.

License

Apache 2.0

Contact


Made with โค๏ธ by Zyora Labs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zllm_zse-1.3.0.tar.gz (269.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zllm_zse-1.3.0-py3-none-any.whl (291.4 kB view details)

Uploaded Python 3

File details

Details for the file zllm_zse-1.3.0.tar.gz.

File metadata

  • Download URL: zllm_zse-1.3.0.tar.gz
  • Upload date:
  • Size: 269.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-1.3.0.tar.gz
Algorithm Hash digest
SHA256 3d4d2892e7d29bf34a74c15a3d5547a18a660c2bd651074a09312b0741c912c8
MD5 daa6b57e57614790955fb24a9260f5d0
BLAKE2b-256 d8ac1182de48c7b8e07d04eb0aab206305710e17b8234666e2555ebc34cb0263

See more details on using hashes here.

File details

Details for the file zllm_zse-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: zllm_zse-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 291.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c043f2813ecf910ea4959cfcb72f72067274350b9fc1140e647ea83a843bcb3a
MD5 9d7eefecc79ede3934b112e56ca97727
BLAKE2b-256 a377ee17b38af2d0ee3f2e65aefd39fa8bb668a70eef32f4133f705b1dd710ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page