First OSS production-grade serving engine for diffusion language models
Reason this release was yanked:
Crashes on first run due to missing bitsandbytes runtime dep, fixed in 0.1.1
Project description
dlmserve
OpenAI-compatible HTTP serving for diffusion language models. LLaDA-8B-Instruct and LLaDA-1.5 in v0.1. Dream-7B in v0.1.1 (issue #1).
Why
Diffusion LLMs use bidirectional attention, a fixed-length canvas, and confidence-ranked parallel commit — not the causal attention, growing KV cache, and one-token decode loop that mainstream serving engines are built around. dlmserve is designed around the diffusion contract directly: per-step batching, no KV reuse assumption, and per-row acceleration (LocalLeap) that composes with batching.
Quick start
pip install dlmserve
# Serve LLaDA-8B-Instruct (downloads ~5.6 GB INT4 weights on first run)
dlmserve
# Or with Docker
docker run --gpus all -p 8000:8000 \
-e DLMSERVE_MODEL=gsai-ml/LLaDA-8B-Instruct \
ghcr.io/iOptimizeThings/dlmserve:latest
# Use it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gsai-ml/LLaDA-8B-Instruct","messages":[{"role":"user","content":"What is the capital of France?"}],"num_denoising_steps":16}'
Once running, interactive API docs are at http://localhost:8000/docs (Swagger UI) or http://localhost:8000/redoc. Prometheus metrics at /metrics.
Examples
# Interactive multi-turn chat (loads model locally, no server, single user)
uv run python examples/chat.py
uv run python examples/chat.py --model gsai-ml/LLaDA-1.5 --local-leap
# Compare dlmserve throughput vs raw HuggingFace generate() (shows batching speedup)
uv run python benchmarks/compare_hf.py
examples/chat.py runs at batch=1 by design (one user, one prompt at a time). Batching is a server feature — it kicks in when multiple clients hit the running dlmserve HTTP server concurrently. To see batching numbers, run compare_hf.py or hit the server with concurrent requests.
Performance (RTX 5070 12 GB, INT4)
| Mode | LLaDA-8B-Instruct | LLaDA-1.5 |
|---|---|---|
| Batch=1, baseline | 32.9 tok/s (1.01× HF ref) | 32.7 tok/s (1.00× HF ref) |
| Batch=4, baseline | 81.7 tok/s (2.52× HF ref) | 82.0 tok/s (2.51× HF ref) |
| Batch=8, baseline | 110.6 tok/s | 110.3 tok/s |
| Batch=1, +LocalLeap | 58.1 tok/s (~1.8× baseline) | 56.5 tok/s (~1.7× baseline) |
| Batch=8, +LocalLeap | 146.8 tok/s (~4.5× batch=1 baseline) | 147.2 tok/s |
dlmserve batch=1 matches the HF reference loop (reference/llada_reference.py) to within measurement noise — token-identical at temperature=0 (proven by tests/test_reference_match.py). The throughput gain comes from step-level batching and optional LocalLeap, not from changing the math.
Full numbers, settings, and reproduction: docs/benchmarks.md and docs/perf_log.md.
Supported models
| Model | Status | INT4 VRAM |
|---|---|---|
gsai-ml/LLaDA-8B-Instruct |
✓ v0.1 | ~5.6 GB |
gsai-ml/LLaDA-1.5 |
✓ v0.1 | ~5.6 GB |
Dream-org/Dream-v0-Instruct-7B |
v0.1.1 (#1) | ~5.6 GB |
diffusionfamily/diffullama |
v0.1.1 (#3) | ~5.6 GB INT4 |
| LLaDA-2.0 (inclusionAI) | v0.1.1 (#2) | — |
Batching
Automatic continuous batching at the denoising-step level. Concurrent requests share a forward pass, capped by DLMSERVE_MAX_BATCH (default 8). LocalLeap composes per-row on top. Live batch-size distribution at /metrics (dlmserve_step_batch_size). Opt out with force_single_batch: true for bit-reproducible output.
API
OpenAI-compatible /v1/chat/completions with documented deviations
(ADR 005).
Diffusion-specific parameters (beyond OpenAI spec):
| Param | Default | Description |
|---|---|---|
num_denoising_steps |
16 | More steps = higher quality, lower throughput. Range [1, 64]. |
block_length |
= max_tokens |
Denoising block size. |
use_local_leap |
false | LocalLeap anchor-propagation acceleration (arXiv:2510.07081). |
force_single_batch |
false | Disable batching for reproducible output. |
Environment variables
| Variable | Default | Description |
|---|---|---|
DLMSERVE_MODEL |
gsai-ml/LLaDA-8B-Instruct |
Model ID (HuggingFace). |
DLMSERVE_DTYPE |
int4 |
Weight dtype: int4, fp16, bf16. |
DLMSERVE_DEVICE |
cuda |
Device. |
DLMSERVE_PORT |
8000 |
HTTP port. |
DLMSERVE_MAX_BATCH |
8 |
Max concurrent requests per step. |
DLMSERVE_LOG_LEVEL |
info |
Log level: debug, info, warning, error. |
Honest limitations
- Linux x86_64 only — macOS and Windows are not supported. Windows users: WSL2 with CUDA passthrough may work but is untested and unsupported.
- Single-GPU only — multi-GPU TP is deferred to v0.5+.
- INT4 fits in 12 GB; FP16 weights (~16 GB) need a 24 GB+ card.
- Docker image targets SM 8.0–8.9 (A100, H100, RTX 3090/4090/A6000). Blackwell GPUs (RTX 50-series, SM 12.0) are not supported by the bundled image — PyTorch 2.5.1 has no SM 12.0 kernels yet. On Blackwell, install from source (
pip install dlmserve) against a PyTorch nightly that ships SM 12.0. Seedocs/docker.md. - Attention backend is PyTorch SDPA. FlashAttention-2 is optional (
pip install dlmserve[attn]) and HF will use it automatically when present — but FA2 also lacks SM 12.0 kernels, so Blackwell stays on SDPA regardless. - Per-step SSE streaming not yet implemented — v0.1 emits one SSE chunk for the full output (issue #5).
max_tokensis a canvas size, not a stop threshold — generation always fills the canvas and truncates at the first EOS. See ADR 005.
Built on
- LLaDA: Large Language Diffusion with mAsking — Nie et al., 2025. Model weights:
gsai-ml/LLaDA-8B-Instruct(MIT). - LocalLeap: Accelerating Diffusion Language Models via Local Determinism Propagation — Klear Team, 2024. Apache-2.0.
Full attribution: CREDITS.md.
Roadmap
v0.1.1 Dream-7B, LLaDA-2.0, DiffuLLaMA INT4, Fast-dLLM KV cache, per-step SSE
v0.2 BD3-LMs block diffusion, AdaBlock-dLLM adaptive block size
v0.5+ Multi-GPU tensor parallelism
Contributing
See CONTRIBUTING.md. Issues and PRs welcome.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dlmserve-0.1.0.tar.gz.
File metadata
- Download URL: dlmserve-0.1.0.tar.gz
- Upload date:
- Size: 251.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3886ebda0115aa6b7081576cc17ca13ba0454a99e8a5cc0492b04eb463a762b8
|
|
| MD5 |
efa6e1732d95bc44abb6e4f4414ec9bf
|
|
| BLAKE2b-256 |
ff25cdf63bdfc8643e3863e2ef11294de8cf8d6c367e87604ddf465e186b6c6c
|
File details
Details for the file dlmserve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dlmserve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1cf6e5d76f176b42208fcc94144e0d7e89165ec9a5da3b9363f9265901ee2b9
|
|
| MD5 |
4f2fae64bfeee16b43f06c16ed3fda3f
|
|
| BLAKE2b-256 |
c01abd8587abc4f08e398df5b1e40e417fd22855cea007c7b789b077a7657da6
|