ZSE - Z Server Engine: Ultra memory-efficient LLM inference engine

These details have not been verified by PyPI

Project links

Project description

ZSE - Z Server Engine

Ultra memory-efficient LLM inference engine.

ZSE is designed to run large language models with minimal memory footprint while maintaining high performance. Our key innovation is the Intelligence Orchestrator that provides smart recommendations based on your available (not total) memory.

Key Features

🧠 zAttention: Custom CUDA kernels for paged, flash, and sparse attention
🗜️ zQuantize: Per-tensor INT2-8 mixed precision quantization
💾 zKV: Quantized KV cache with sliding precision (4x memory savings)
🌊 zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
🎯 zOrchestrator: Smart recommendations based on FREE memory
📊 Efficiency Modes: speed / balanced / memory / ultra

⚡ Cold Start Benchmark

6.5s (72B) — 79× faster than bitsandbytes, verified on H200 (150GB VRAM).

Model	bitsandbytes	ZSE (.zse)	Speedup
Qwen 7B	45.4s	3.9s	11.6×
Qwen 32B	120.0s	21.4s	5.6×
Qwen 72B	512.7s	6.5s	79×

ZSE vs llama.cpp (72B)

Format	Cold Start	VRAM
bitsandbytes	512.7s	139.1 GB
llama.cpp GGUF	10.2s	36.3 GB
ZSE (.zse)	6.5s	76.6 GB

# One-time conversion (~20s)
zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse

# Every subsequent start: 3.9s
zse serve qwen-7b.zse

Note: 72B results on NVIDIA H200 (150GB). 7B/32B on A100-80GB. Any modern SSD achieves sub-10s cold starts.

Memory Benchmarks (Verified, A100-80GB)

Model	FP16	INT4/NF4	Reduction	Throughput
Qwen 7B	14.2 GB	5.2 GB	63% ✅	12-15 tok/s
Qwen 32B	~64 GB	19.3 GB (NF4) / ~35 GB (.zse)	70% ✅	7.9 tok/s
14B	~28 GB	~7 GB	⏳ est	-
70B	~140 GB	~24 GB	⏳ est	-

32B note: Use NF4 (19.3 GB) on GPUs with <36 GB VRAM. Use .zse (35 GB, 5.6× faster start) on 40 GB+ GPUs.

Installation

pip install zllm-zse

With CUDA support (recommended):

pip install zllm-zse[cuda]

From source:

git clone https://github.com/Zyora-Dev/zse.git
cd zse
pip install -e ".[dev]"

Quick Start

Start Server

# Any HuggingFace model works!
zse serve Qwen/Qwen2.5-7B-Instruct
zse serve meta-llama/Llama-3.1-8B-Instruct
zse serve mistralai/Mistral-7B-Instruct-v0.3
zse serve microsoft/Phi-3-mini-4k-instruct
zse serve google/gemma-2-9b-it

# With memory optimization
zse serve Qwen/Qwen2.5-32B-Instruct --max-memory 24GB

# With recommendations
zse serve meta-llama/Llama-3.1-70B-Instruct --recommend

# Ultra memory efficiency
zse serve deepseek-ai/DeepSeek-V2-Lite --efficiency ultra

# GGUF models (via llama.cpp)
zse serve ./model-Q4_K_M.gguf

💡 Supported Models: Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices: Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, Yi, and more.

Interactive Chat

zse chat Qwen/Qwen2.5-7B-Instruct

Convert to ZSE Format

zse convert Qwen/Qwen2.5-32B-Instruct -o qwen-32b.zse --target-memory 24GB

Check Hardware

zse hardware

API Server

ZSE provides an OpenAI-compatible API:

zse serve Qwen/Qwen2.5-7B-Instruct --port 8000

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Efficiency Modes

Mode	Description	Use Case
`speed`	Maximum throughput	Production with ample GPU memory
`balanced`	Good throughput, moderate memory	Standard deployment (default)
`memory`	Low memory, reduced throughput	Consumer GPUs
`ultra`	Extreme memory savings	4GB GPUs, laptops

zse serve model --efficiency memory

Deployment

Developer Mode

zse serve model --mode dev

No authentication required
SQLite database
Hot reload enabled
Debug logging

Enterprise Mode

zse serve model --config configs/enterprise.yaml

API key authentication
PostgreSQL + Redis
Prometheus metrics
Rate limiting
Multi-tenancy

Architecture

zse/
├── core/                   # ZSE Native Engine (100% custom)
│   ├── zattention/         # Custom attention kernels
│   ├── zquantize/          # Quantization (GPTQ, HQQ, INT2-8)
│   ├── zkv/                # Paged + quantized KV cache
│   ├── zstream/            # Layer streaming + prefetch
│   ├── zscheduler/         # Continuous batching
│   └── zdistributed/       # Tensor/pipeline parallelism
├── models/                 # Model loaders + architectures
├── engine/                 # Executor + Orchestrator
├── api/                    # CLI, FastAPI server, Web UI
└── enterprise/             # Auth, monitoring, scaling

GGUF Support

GGUF models are supported via llama.cpp backend:

pip install zllm-zse[gguf]
zse serve ./model.gguf

Note: GGUF uses llama.cpp for inference. Native ZSE engine handles HuggingFace, safetensors, and .zse formats.

Docker Deployment

# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest

# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu

# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latest

Docker Compose:

docker-compose up -d                    # CPU
docker-compose --profile gpu up -d      # GPU

See deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=zse

# Type checking
mypy zse

# Linting
ruff check zse

License

Apache 2.0

Acknowledgments

PagedAttention concept from vLLM (UC Berkeley)
Flash Attention from Tri Dao
GPTQ, HQQ, and other quantization research

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.2

Mar 30, 2026

1.4.1

Mar 20, 2026

1.4.0

Mar 3, 2026

1.3.1

Mar 2, 2026

1.3.0

Mar 2, 2026

1.2.0

Feb 27, 2026

1.1.4

Feb 27, 2026

1.1.3

Feb 27, 2026

1.1.2

Feb 27, 2026

1.1.1

Feb 27, 2026

1.1.0

Feb 27, 2026

1.0.10

Feb 27, 2026

1.0.9

Feb 27, 2026

1.0.8

Feb 27, 2026

1.0.7

Feb 27, 2026

1.0.6

Feb 27, 2026

1.0.5

Feb 27, 2026

1.0.4

Feb 27, 2026

1.0.3

Feb 27, 2026

1.0.2

Feb 27, 2026

1.0.1

Feb 27, 2026

0.1.4

Feb 27, 2026

This version

0.1.3

Feb 27, 2026

0.1.2

Feb 25, 2026

0.1.1

Feb 25, 2026

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zllm_zse-0.1.3.tar.gz (245.0 kB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zllm_zse-0.1.3-py3-none-any.whl (260.1 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file zllm_zse-0.1.3.tar.gz.

File metadata

Download URL: zllm_zse-0.1.3.tar.gz
Upload date: Feb 27, 2026
Size: 245.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`66e9f480476b68e283deca6d5563b4cf491cbd190378efc9a4c16234a6bf9678`
MD5	`40b775c754361af0aefaf28a9bf1b274`
BLAKE2b-256	`705c7e86764604f6a626b271be7e09612583f577908cb6f2d721b4086d84e761`

See more details on using hashes here.

File details

Details for the file zllm_zse-0.1.3-py3-none-any.whl.

File metadata

Download URL: zllm_zse-0.1.3-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 260.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zllm_zse-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb956746129d0da00e8da70de6facf0d37742876d9593c1b602f300e0ff83dd7`
MD5	`95b964bc28827e9dd8638d3ed152049e`
BLAKE2b-256	`14c66c2c5f9d9b5a8b849779f7a2537a39619ee55990a9a89ba64bf7e9ce7c7e`

See more details on using hashes here.

zllm-zse 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ZSE - Z Server Engine

Key Features

⚡ Cold Start Benchmark

ZSE vs llama.cpp (72B)

Memory Benchmarks (Verified, A100-80GB)

Installation

Quick Start

Start Server

Interactive Chat

Convert to ZSE Format

Check Hardware

API Server

Efficiency Modes

Deployment

Developer Mode

Enterprise Mode

Architecture

GGUF Support

Docker Deployment

Development

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes