First OSS production-grade serving engine for diffusion language models

These details have not been verified by PyPI

Project links

Project description

dlmserve

OpenAI-compatible HTTP serving for diffusion language models. LLaDA-8B-Instruct and LLaDA-1.5 in v0.1. Dream-7B in v0.1.1 (issue #1).

Why

Diffusion LLMs use bidirectional attention, a fixed-length canvas, and confidence-ranked parallel commit — not the causal attention, growing KV cache, and one-token decode loop that mainstream serving engines are built around. dlmserve is designed around the diffusion contract directly: per-step batching, no KV reuse assumption, and per-row acceleration (LocalLeap) that composes with batching.

Quick start

pip install dlmserve

# Serve LLaDA-8B-Instruct (downloads ~5.6 GB INT4 weights on first run)
dlmserve

# Or with Docker
docker run --gpus all -p 8000:8000 \
  -e DLMSERVE_MODEL=gsai-ml/LLaDA-8B-Instruct \
  ghcr.io/iOptimizeThings/dlmserve:latest

# Use it
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gsai-ml/LLaDA-8B-Instruct","messages":[{"role":"user","content":"What is the capital of France?"}],"num_denoising_steps":16}'

Once running, interactive API docs are at http://localhost:8000/docs (Swagger UI) or http://localhost:8000/redoc. Prometheus metrics at /metrics.

Examples

# Interactive multi-turn chat (loads model locally, no server, single user)
uv run python examples/chat.py
uv run python examples/chat.py --model gsai-ml/LLaDA-1.5 --local-leap

# Compare dlmserve throughput vs raw HuggingFace generate() (shows batching speedup)
uv run python benchmarks/compare_hf.py

examples/chat.py runs at batch=1 by design (one user, one prompt at a time). Batching is a server feature — it kicks in when multiple clients hit the running dlmserve HTTP server concurrently. To see batching numbers, run compare_hf.py or hit the server with concurrent requests.

Performance (RTX 5070 12 GB, INT4)

Mode	LLaDA-8B-Instruct	LLaDA-1.5
Batch=1, baseline	32.9 tok/s (1.01× HF ref)	32.7 tok/s (1.00× HF ref)
Batch=4, baseline	81.7 tok/s (2.52× HF ref)	82.0 tok/s (2.51× HF ref)
Batch=8, baseline	110.6 tok/s	110.3 tok/s
Batch=1, +LocalLeap	58.1 tok/s (~1.8× baseline)	56.5 tok/s (~1.7× baseline)
Batch=8, +LocalLeap	146.8 tok/s (~4.5× batch=1 baseline)	147.2 tok/s

dlmserve batch=1 matches the HF reference loop (reference/llada_reference.py) to within measurement noise — token-identical at temperature=0 (proven by tests/test_reference_match.py). The throughput gain comes from step-level batching and optional LocalLeap, not from changing the math.

Full numbers, settings, and reproduction: docs/benchmarks.md and docs/perf_log.md.

Supported models

Model	Status	INT4 VRAM
`gsai-ml/LLaDA-8B-Instruct`	✓ v0.1	~5.6 GB
`gsai-ml/LLaDA-1.5`	✓ v0.1	~5.6 GB
`Dream-org/Dream-v0-Instruct-7B`	v0.1.1 (#1)	~5.6 GB
`diffusionfamily/diffullama`	v0.1.1 (#3)	~5.6 GB INT4
`LLaDA-2.0 (inclusionAI)`	v0.1.1 (#2)	—

Batching

Automatic continuous batching at the denoising-step level. Concurrent requests share a forward pass, capped by DLMSERVE_MAX_BATCH (default 8). LocalLeap composes per-row on top. Live batch-size distribution at /metrics (dlmserve_step_batch_size). Opt out with force_single_batch: true for bit-reproducible output.

API

OpenAI-compatible /v1/chat/completions with documented deviations (ADR 005).

Diffusion-specific parameters (beyond OpenAI spec):

Param	Default	Description
`num_denoising_steps`	16	More steps = higher quality, lower throughput. Range [1, 64].
`block_length`	= `max_tokens`	Denoising block size.
`use_local_leap`	false	LocalLeap anchor-propagation acceleration (arXiv:2510.07081).
`force_single_batch`	false	Disable batching for reproducible output.

Environment variables

Variable	Default	Description
`DLMSERVE_MODEL`	`gsai-ml/LLaDA-8B-Instruct`	Model ID (HuggingFace).
`DLMSERVE_DTYPE`	`int4`	Weight dtype: `int4`, `fp16`, `bf16`.
`DLMSERVE_DEVICE`	`cuda`	Device.
`DLMSERVE_PORT`	`8000`	HTTP port.
`DLMSERVE_MAX_BATCH`	`8`	Max concurrent requests per step.
`DLMSERVE_LOG_LEVEL`	`info`	Log level: `debug`, `info`, `warning`, `error`.

Honest limitations

Linux x86_64 only — macOS and Windows are not supported. Windows users: WSL2 with CUDA passthrough may work but is untested and unsupported.
Single-GPU only — multi-GPU TP is deferred to v0.5+.
INT4 fits in 12 GB; FP16 weights (~16 GB) need a 24 GB+ card.
Docker image targets SM 8.0–8.9 (A100, H100, RTX 3090/4090/A6000). Blackwell GPUs (RTX 50-series, SM 12.0) are not supported by the bundled image — PyTorch 2.5.1 has no SM 12.0 kernels yet. On Blackwell, install from source (pip install dlmserve) against a PyTorch nightly that ships SM 12.0. See docs/docker.md.
Attention backend is PyTorch SDPA. FlashAttention-2 is optional (pip install dlmserve[attn]) and HF will use it automatically when present — but FA2 also lacks SM 12.0 kernels, so Blackwell stays on SDPA regardless.
Per-step SSE streaming not yet implemented — v0.1 emits one SSE chunk for the full output (issue #5).
max_tokens is a canvas size, not a stop threshold — generation always fills the canvas and truncates at the first EOS. See ADR 005.

Built on

LLaDA: Large Language Diffusion with mAsking — Nie et al., 2025. Model weights: gsai-ml/LLaDA-8B-Instruct (MIT).
LocalLeap: Accelerating Diffusion Language Models via Local Determinism Propagation — Klear Team, 2024. Apache-2.0.

Full attribution: CREDITS.md.

Roadmap

v0.1.1  Dream-7B, LLaDA-2.0, DiffuLLaMA INT4, Fast-dLLM KV cache, per-step SSE
v0.2    BD3-LMs block diffusion, AdaBlock-dLLM adaptive block size
v0.5+   Multi-GPU tensor parallelism

Contributing

See CONTRIBUTING.md. Issues and PRs welcome.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 25, 2026

0.1.0 yanked

May 25, 2026

Reason this release was yanked:

Crashes on first run due to missing bitsandbytes runtime dep, fixed in 0.1.1

0.0.0

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlmserve-0.1.1.tar.gz (251.7 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dlmserve-0.1.1-py3-none-any.whl (31.0 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file dlmserve-0.1.1.tar.gz.

File metadata

Download URL: dlmserve-0.1.1.tar.gz
Upload date: May 25, 2026
Size: 251.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dlmserve-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7e6d2fe8cf91a17f54e07f2f5e40e0b5dd84573a7a2d1d0d956245fa856f8e04`
MD5	`af78a39577fd49741f15f1e2da6a99ac`
BLAKE2b-256	`8c8a1e3f06944a7f41b732793c0778372870212559d3efb1274feecb8aaaaa51`

See more details on using hashes here.

File details

Details for the file dlmserve-0.1.1-py3-none-any.whl.

File metadata

Download URL: dlmserve-0.1.1-py3-none-any.whl
Upload date: May 25, 2026
Size: 31.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dlmserve-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c30d9d077c9c38f8b5fffa213fc8fb92a0ce79215c77f958f826516e41d807b9`
MD5	`0916d5b245ae3cb73e1579a9ef7f7a05`
BLAKE2b-256	`d130096daaa3b3468c4da9b599debe126e37a665de5b6e16ce94a47c383e4d2b`

See more details on using hashes here.

dlmserve 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dlmserve

Why

Quick start

Examples

Performance (RTX 5070 12 GB, INT4)

Supported models

Batching

API

Environment variables

Honest limitations

Built on

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes