Skip to main content

ZSE - Z Server Engine: Ultra memory-efficient LLM inference engine

Project description

ZSE - Z Server Engine

Python 3.11+ License Deploy on Railway Deploy to Render

Ultra memory-efficient LLM inference engine.

ZSE is designed to run large language models with minimal memory footprint while maintaining high performance. Our key innovation is the Intelligence Orchestrator that provides smart recommendations based on your available (not total) memory.

Key Features

  • 🧠 zAttention: Custom CUDA kernels for paged, flash, and sparse attention
  • 🗜️ zQuantize: Per-tensor INT2-8 mixed precision quantization
  • 💾 zKV: Quantized KV cache with sliding precision (4x memory savings)
  • 🌊 zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
  • 🎯 zOrchestrator: Smart recommendations based on FREE memory
  • 📊 Efficiency Modes: speed / balanced / memory / ultra

⚡ Cold Start Benchmark

3.9s (7B) and 21.4s (32B) to first token with .zse format — verified on A100-80GB.

Model bitsandbytes ZSE (.zse) Speedup
Qwen 7B 45.4s 3.9s 11.6×
Qwen 32B 120.0s 21.4s 5.6×
# One-time conversion (~20s)
zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse

# Every subsequent start: 3.9s
zse serve qwen-7b.zse

Note: Results measured on A100-80GB with NVMe storage (Feb 2026). On consumer SSDs expect 5-10s; HDDs may be slower. Any modern SSD achieves sub-10s cold starts.

Memory Benchmarks (Verified, A100-80GB)

Model FP16 INT4/NF4 Reduction Throughput
Qwen 7B 14.2 GB 5.2 GB 63% ✅ 12-15 tok/s
Qwen 32B ~64 GB 19.3 GB (NF4) / ~35 GB (.zse) 70% ✅ 7.9 tok/s
14B ~28 GB ~7 GB ⏳ est -
70B ~140 GB ~24 GB ⏳ est -

32B note: Use NF4 (19.3 GB) on GPUs with <36 GB VRAM. Use .zse (35 GB, 5.6× faster start) on 40 GB+ GPUs.

Installation

# Clone and install (PyPI coming soon)
git clone https://github.com/Zyora-Dev/zse.git
cd zse
pip install -e ".[dev]"

With CUDA support (recommended):

pip install -e ".[cuda]"

Quick Start

Start Server

# Any HuggingFace model works!
zse serve Qwen/Qwen2.5-7B-Instruct
zse serve meta-llama/Llama-3.1-8B-Instruct
zse serve mistralai/Mistral-7B-Instruct-v0.3
zse serve microsoft/Phi-3-mini-4k-instruct
zse serve google/gemma-2-9b-it

# With memory optimization
zse serve Qwen/Qwen2.5-32B-Instruct --max-memory 24GB

# With recommendations
zse serve meta-llama/Llama-3.1-70B-Instruct --recommend

# Ultra memory efficiency
zse serve deepseek-ai/DeepSeek-V2-Lite --efficiency ultra

# GGUF models (via llama.cpp)
zse serve ./model-Q4_K_M.gguf

💡 Supported Models: Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices: Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, Yi, and more.

Interactive Chat

zse chat Qwen/Qwen2.5-7B-Instruct

Convert to ZSE Format

zse convert Qwen/Qwen2.5-32B-Instruct -o qwen-32b.zse --target-memory 24GB

Check Hardware

zse hardware

API Server

ZSE provides an OpenAI-compatible API:

zse serve Qwen/Qwen2.5-7B-Instruct --port 8000
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Efficiency Modes

Mode Description Use Case
speed Maximum throughput Production with ample GPU memory
balanced Good throughput, moderate memory Standard deployment (default)
memory Low memory, reduced throughput Consumer GPUs
ultra Extreme memory savings 4GB GPUs, laptops
zse serve model --efficiency memory

Deployment

Developer Mode

zse serve model --mode dev
  • No authentication required
  • SQLite database
  • Hot reload enabled
  • Debug logging

Enterprise Mode

zse serve model --config configs/enterprise.yaml
  • API key authentication
  • PostgreSQL + Redis
  • Prometheus metrics
  • Rate limiting
  • Multi-tenancy

Architecture

zse/
├── core/                   # ZSE Native Engine (100% custom)
│   ├── zattention/         # Custom attention kernels
│   ├── zquantize/          # Quantization (GPTQ, HQQ, INT2-8)
│   ├── zkv/                # Paged + quantized KV cache
│   ├── zstream/            # Layer streaming + prefetch
│   ├── zscheduler/         # Continuous batching
│   └── zdistributed/       # Tensor/pipeline parallelism
├── models/                 # Model loaders + architectures
├── engine/                 # Executor + Orchestrator
├── api/                    # CLI, FastAPI server, Web UI
└── enterprise/             # Auth, monitoring, scaling

GGUF Support

GGUF models are supported via llama.cpp backend:

pip install zse[gguf]
zse serve ./model.gguf

Note: GGUF uses llama.cpp for inference. Native ZSE engine handles HuggingFace, safetensors, and .zse formats.

Docker Deployment

# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest

# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu

# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latest

Docker Compose:

docker-compose up -d                    # CPU
docker-compose --profile gpu up -d      # GPU

See deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=zse

# Type checking
mypy zse

# Linting
ruff check zse

License

Apache 2.0

Acknowledgments

  • PagedAttention concept from vLLM (UC Berkeley)
  • Flash Attention from Tri Dao
  • GPTQ, HQQ, and other quantization research

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zllm_zse-0.1.0.tar.gz (244.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zllm_zse-0.1.0-py3-none-any.whl (259.8 kB view details)

Uploaded Python 3

File details

Details for the file zllm_zse-0.1.0.tar.gz.

File metadata

  • Download URL: zllm_zse-0.1.0.tar.gz
  • Upload date:
  • Size: 244.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 913d38cdbd123516650efeff6cc58e6ba1c0a093266166f345a0642f0914e5f5
MD5 2047d52e0b41fc95863b9ef5d26cc117
BLAKE2b-256 18feb605703fcf488ddf765d0aaaa3700bad43b81666b8ad8136bf0ded5aa5d1

See more details on using hashes here.

File details

Details for the file zllm_zse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: zllm_zse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 259.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad787f31b51373c2c4354ca815dbb01d86849764fad7a9997fa02cbeccb05c8f
MD5 c9d239bc0560098ee99794516430476a
BLAKE2b-256 cd89b774982f0c2000aa8646877720fb48e7d3e6fce01914f311c49abf268a57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page