Zero-dependency LLM inference engine. Fastest cold start, smallest VRAM footprint. CUDA + ROCm + Metal.

These details have not been verified by PyPI

Project links

Project description

ZSE — Zero-dependency Server Engine for LLM Inference

The fastest cold start. The smallest memory footprint. On every GPU.

What is ZSE?

ZSE is a production LLM inference engine that owns the full stack — no PyTorch, no Triton, no bitsandbytes, no transformers. Just pure Python, ctypes, and a kernel compiler that emits CUDA, ROCm (HIP), and Metal directly.

The result: models load in seconds, not minutes, and serve at a fraction of the memory other engines need.

pip install zse-engine   # one package, zero transitive ML deps
zse serve qwen-7b.zse    # 7-second cold start. 5.8 GB on a T4.

Headline Numbers

Verified on Modal (T4, L4, A10G, A100), DigitalOcean (MI300X), and Apple M1. ZSE INT4 vs vLLM AWQ INT4, Qwen2.5-7B / 14B / 32B.

Cold start — every GPU, every model size

GPU	Model	ZSE	vLLM	Speedup
NVIDIA T4 (16 GB)	Qwen2.5-7B	7.25s	218.96s	30.2×
NVIDIA L4 (24 GB)	Qwen2.5-7B	5.58s	145.22s	26.0×
NVIDIA A10G (24 GB)	Qwen2.5-7B	6.01s	193.05s	32.1×
NVIDIA A100-80GB	Qwen2.5-14B	6.29s	127.02s	20.2×
AMD MI300X (192 GB)	Qwen2.5-32B	3.14s	42.65s	13.6×

VRAM — fits where others can't

GPU	Model	ZSE	vLLM	Reduction
NVIDIA T4	Qwen2.5-7B	5.79 GB	~14 GB	~2.5×
NVIDIA A100-80GB	Qwen2.5-14B	12.28 GB	71.45 GB	5.82×
AMD MI300X	Qwen2.5-32B	22.07 GB	161.77 GB	7.33×

ZSE runs 32B INT4 in 22 GB of VRAM — on a single MI300X with room for 8 more models. vLLM's PyTorch allocator + KV slab grabs the entire GPU regardless of quantization.

Single-sequence throughput — matches or beats vLLM on data-center GPUs

GPU	Model	ZSE	vLLM	Ratio
NVIDIA A100-80GB	Qwen2.5-14B	37.0 tok/s	26.5	1.40×
NVIDIA A10G	Qwen2.5-7B	48.6 tok/s	50.9	0.95×
NVIDIA L4	Qwen2.5-7B	36.3 tok/s	47.3	0.77×
AMD MI300X	Qwen2.5-32B	38.4 tok/s	56.4	0.68×
NVIDIA T4	Qwen2.5-7B	18.8 tok/s	35.2	0.53×

Why ZSE

	vLLM	ZSE
Cold start (7B)	30s – 4 min	5–7s
VRAM (14B INT4)	71 GB	12 GB
Dependencies	PyTorch + Triton + CUDA toolkit (~12 GB)	Zero
Pip install size	~3 GB	~5 MB
Backends	CUDA primarily	CUDA + ROCm + Metal
Model format	safetensors (deserialize on load)	`.zse` (mmap, pre-quantized, instant)
KV cache	Fixed 16-token blocks, LRU eviction	Adaptive blocks, token-level smart eviction
Model conversion	None — runtime quant	One-time, ~600× faster than pure Python
Built-in RAG	❌	✅ (hybrid retrieval + cross-encoder rerank + ZPF compression)
Built-in auth + rate limiting	❌	✅ (SQLite-backed)
LoRA hot-swap	✅ (S-LoRA)	✅

Hardware Validated

Hardware	Vendor	Arch	Status
NVIDIA T4	NVIDIA	Turing (sm_75)	✅
NVIDIA L4	NVIDIA	Ada (sm_89)	✅
NVIDIA A10G	NVIDIA	Ampere (sm_86)	✅
NVIDIA A100 (40 GB, 80 GB)	NVIDIA	Ampere DC (sm_80)	✅
NVIDIA H100 / H200	NVIDIA	Hopper (sm_90)	✅
AMD Instinct MI300X (192 GB)	AMD	CDNA3 (gfx942)	✅
Apple M1	Apple	Apple Silicon	✅

A new arch usually works on day one — the compiler queries compute capability at runtime and emits the correct PTX / GCN / MSL automatically.

Install

pip install zse-engine

Requirements:

Python 3.11+
One of: NVIDIA driver + CUDA runtime, AMD ROCm 6+, or Apple Silicon
That's it. No PyTorch. No Triton. No transformers.

Quick Start

1. Get a model

# Pull a pre-converted model (instant)
zse pull qwen-7b              # 5.18 GB
zse pull qwen-32b             # 17.9 GB
zse pull mistral-7b           # 3.86 GB

# Or convert any HuggingFace model yourself
zse convert Qwen/Qwen2.5-7B-Instruct qwen-7b.zse --quant int4

2. Serve

zse serve qwen-7b.zse --port 8000

OpenAI-compatible API at http://localhost:8000/v1:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain mixture of experts in one paragraph."}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

3. Multi-GPU (optional)

zse serve qwen-72b.zse --tp 4 --port 8000        # tensor parallel

Features

Inference engine

OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/models)
Continuous batching with disaggregated prefill/decode scheduling
SLO-aware request ordering, predictive memory budgeting, chunked prefill
Speculative decoding (n-gram + self-draft, lossless accept/reject)
CUDA Graphs + HIP Graphs for low-latency decode
Tensor parallelism (NCCL/RCCL, multi-process weight sharding)
LoRA hot-swap — 100s of adapters per GPU, per-request routing

Model format (.zse)

Pre-quantized INT4 / INT8 / FP16, mmap-friendly
One file = weights + tokenizer + config + kernel cache
Architectures supported out of the box: Llama, Mistral, Qwen2, Gemma2, Phi3

Built-in RAG

Hybrid retrieval: BM25 + TF-IDF + dense embeddings (mean-pooled LLM hidden states — no extra model)
Reciprocal Rank Fusion + LLM cross-encoder reranker
ZPF compressed document format — 25% fewer LLM tokens at 100% retrieval accuracy
PDF parser handles encrypted (RC4 / AES-128 / AES-256), multi-column reflow, /ObjStm, OCR hook

Server

API key management + per-key RPM/TPM rate limiting (SQLite)
Admin API, LoRA management API, RAG ingest API
Web dashboard for chat + session management
SSE streaming, pure asyncio, zero web framework dependency

Kernel compiler (zse-compiler)

Write GPU kernels in pure Python with @zse.kernel
Emits CUDA C, HIP C, and Metal Shading Language
Auto-tuning, kernel fusion, WMMA / MFMA matrix-core intrinsics
Standalone — pip install zse-compiler works on its own

Architecture

┌───────────────────────────────────────────────────────────────────┐
│  HTTP / SSE  ·  OpenAI API  ·  Web dashboard  ·  API key + RAG   │
├───────────────────────────────────────────────────────────────────┤
│           ZStreamer — continuous batching, scheduling             │
├───────────────────────────────────────────────────────────────────┤
│   Orchestrator   │   KV Cache (PagedAttention)   │   LoRA Mgr    │
│   29 GPU kernels │   adaptive blocks · token-evict│  hot-swap     │
├───────────────────────────────────────────────────────────────────┤
│  .zse format    │  VRAM allocator (unified)  │  CUDA/HIP Graphs  │
├───────────────────────────────────────────────────────────────────┤
│         ZSE Kernel Compiler — Python DSL → GPU code               │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐   │
│   │  CUDA C     │    │   HIP C     │    │  Metal Shading Lang │   │
│   │  (nvrtc)    │    │  (hiprtc)   │    │  (Metal compiler)   │   │
│   └─────────────┘    └─────────────┘    └─────────────────────┘   │
└───────────────────────────────────────────────────────────────────┘
              No PyTorch · No Triton · No transformers

Honest Limitations

We believe in numbers, not marketing. Things ZSE does not beat vLLM on yet:

Concurrent throughput at N≥4 on INT4. vLLM's hand-tuned AWQ Marlin kernels hit memory bandwidth ceilings we haven't matched yet on NVIDIA. Closed to 2.12× on AMD via our wave-64 bgemv rewrite; NVIDIA-side equivalent is the next throughput lever. See CLAUDE.md Gap #6 for the full story.
Apple Silicon full inference. Kernel-level validated on M1 (E2E vector_add: 0/1024 mismatches). Full transformer inference path needs a hardware run — wired and ready.
Tensor parallelism on socket-restricted environments. All NCCL primitives validated multi-GPU on Modal; full TP inference works on bare-metal multi-GPU servers but the worker bootstrap needs real network access (not a code bug — Modal's sandbox blocks AF_UNIX sockets used by ncclCommInitRank in child processes).

If steady-state batched throughput is your only metric and you have ~50× the VRAM budget — use vLLM. If you care about cold start, footprint, vendor lock-in, or running on anything other than an H100 — use ZSE.

Benchmark Reproduction

All numbers in this README are reproducible. Scripts live in tests/:

modal run tests/test_modal_benchmark_7b_rtx.py        # T4, L4, A10G  vs vLLM AWQ
modal run tests/test_modal_bench_vs_vllm.py           # A100-80GB     vs vLLM AWQ + FP16
python tests/bench_zse_mi300x_v3.py                   # MI300X

Raw JSON outputs for every run are committed alongside the scripts.

What's Inside

zse-compiler/        Pure-Python kernel compiler. Standalone, pip-installable.
  ast_parser/        Python AST → IR
  ir/                25+ IR node types, fusion pass, type inference
  codegen/           CUDA · HIP · Metal backends
  runtime/           NVRTC · HIPRTC · Metal · NCCL/RCCL · auto-tune · profiler

zse-engine/          Production inference engine.
  format/            .zse binary format, quantization, conversion CLI
  orchestrator/      29 GPU kernels, model runner, sampler, VRAM allocator
  cache/             PagedAttention, dedup, smart eviction, COW forking
  zstreamer/         Continuous batching, SLO scheduling, spec-decode
  server/            HTTP, OpenAI API, auth, rate limit, admin, LoRA, RAG
  rag/               Hybrid retrieval, reranker, ZPF, PDF parser (full PDF spec)

~40,600 lines of code. Zero third-party dependencies.

License

Apache 2.0 — see LICENSE.

Acknowledgments

This project is supported by:

ZSE's AMD MI300X validation, the 32B-parameter benchmarks, and a large share of our ROCm kernel development work was made possible by DigitalOcean's Open Source Sponsorship Program, which provides cloud GPU credits to independent open-source projects. The MI300X numbers throughout this README — cold start, VRAM, throughput, the wave-64 INT4 GEMV rewrite — were all measured on DigitalOcean infrastructure. Thank you to the DigitalOcean team for backing zero-dep infrastructure work.

If you maintain an open-source project that needs serious GPU time, apply here: https://www.digitalocean.com/open-source/credits-for-projects

Contact

Website: zllm.in
Company: Zyora Labs
Email: zse@zyoralabs.com
Sponsor: github.com/sponsors/Zyora-Dev

Built in Nagercoil, India. Run anywhere a GPU runs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.1

May 22, 2026

This version

2.0.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zse_engine-2.0.0.tar.gz (228.4 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zse_engine-2.0.0-py3-none-any.whl (258.9 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file zse_engine-2.0.0.tar.gz.

File metadata

Download URL: zse_engine-2.0.0.tar.gz
Upload date: May 22, 2026
Size: 228.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zse_engine-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`145ca01ccb86dc5e34b0a7f55ddf06eab2477506f1ab77edbafd3e94b046dafa`
MD5	`16df99fa3f7c744c80ba6b5c4f5c7161`
BLAKE2b-256	`78685b6bda2afc1354c71ef33d4efab9afa45cc23031ff81eb553879a68d0c3c`

See more details on using hashes here.

File details

Details for the file zse_engine-2.0.0-py3-none-any.whl.

File metadata

Download URL: zse_engine-2.0.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 258.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for zse_engine-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f41e52ba9c906d85d6cb25c0da3561dc1641f3291375ff1d534be3bd11325c2`
MD5	`67fc5d509e2da45dccf639c991c9ba02`
BLAKE2b-256	`f913d8192a478cfae73443304372a16df08be82d0754a366261509acb77d0472`

See more details on using hashes here.

zse-engine 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ZSE — Zero-dependency Server Engine for LLM Inference

What is ZSE?

Headline Numbers

Cold start — every GPU, every model size

VRAM — fits where others can't

Single-sequence throughput — matches or beats vLLM on data-center GPUs

Why ZSE

Hardware Validated

Install

Quick Start

1. Get a model

2. Serve

3. Multi-GPU (optional)

Features

Architecture

Honest Limitations

Benchmark Reproduction

What's Inside

License

Acknowledgments

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes