Skip to main content

DSpark + DFlash speculative decoding (semi-autoregressive / block-diffusion drafting) for Apple Silicon via MLX

Project description

mlx-dspark

DeepSeek's DSpark and z-lab's DFlash speculative decoding — native on Apple Silicon via MLX.
Lossless drafters (same output, just faster) for the Qwen3 and Gemma-4 families — plus any matched
DSpark / DFlash checkpoint. Run them at the CLI, from Python, or serve an OpenAI-compatible API to LM Studio / any local tool.

PyPI Python Apple Silicon License

Baseline vs DSpark — same output, ~1.8x faster on Gemma-4 12B

mlx-dspark runs two EAGLE-family speculative-decoding drafters natively on Apple Silicon: DeepSeek's DSpark (semi-autoregressive, from the DeepSpec codebase, used to accelerate DeepSeek-V4) and z-lab's DFlash (block diffusion). Both are lossless — the target verifies every token, so output is identical to normal decoding — and run under one verify loop, so you can serve them, script them, or benchmark them head-to-head.

What this is not: DeepSeek-V4 inference. The targets are dense models (Gemma-4, Qwen3) that DeepSeek published DSpark drafters for — so this runs their real drafter method on a Mac, but the model producing tokens is Gemma / Qwen, not V4. V4 Flash/Pro (MoE, batched serving) is DSpark's own headline use case.

Install

pip install mlx-dspark          # or:  uv pip install mlx-dspark

Apple Silicon + Python ≥ 3.10. Model weights download from the Hugging Face cache on first use (none bundled). No server framework is pulled in — the API server is built on the standard library.

Quickstart

You name the target model (--model, an HF repo or local path, exactly like mlx-lm); the matching drafter is resolved automatically for known targets (see Models), or pass --drafter.

Serve an OpenAI-compatible API

mlx-dspark serve --model mlx-community/Qwen3-8B-8bit        # → http://127.0.0.1:8080/v1
#   add --mode dflash for the DFlash drafter · --mode baseline for plain decoding
#       --no-thinking to silence Qwen3 <think> blocks by default · --api-key KEY to require auth

Then point any OpenAI client at it — the speculative speedup is transparent:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="Qwen3-8B-8bit",
    messages=[{"role": "user", "content": "Explain rainbows briefly."}],
).choices[0].message.content)

For LM Studio / other tools: set the OpenAI base URL to http://127.0.0.1:8080/v1.

The server speaks the OpenAI API: POST /v1/chat/completions (streaming and non-streaming, multi-turn), POST /v1/completions, GET /v1/models, GET /health, GET /metrics. It supports temperature, top_p, top_k, max_tokens, stop, seed, tool calling (tools / tool_calls), and a per-request thinking toggle (enable_thinking). Each response carries an x_mlx_dspark block (accept length + tok/s) so the spec-decode gain is visible. Prefix caching is on by default, so multi-turn chat doesn't re-prefill the conversation each turn (~13× faster follow-up turns on a long shared context — see Prefix caching).

One-shot generation (CLI)

# downloads the drafter + instruct target on first run
mlx-dspark generate --model mlx-community/Qwen3-4B-8bit --prompt "Explain how rainbows form."

# baseline (plain target) vs dspark — same output, faster (record each, stack for a demo)
mlx-dspark generate --model mlx-community/Qwen3-4B-8bit --mode baseline --prompt "..." --max-new-tokens 400
mlx-dspark generate --model mlx-community/Qwen3-4B-8bit --mode dspark   --prompt "..." --max-new-tokens 400

# z-lab DFlash drafter (--max-draft 0 = full 16-block; best on code/math)
mlx-dspark generate --model mlx-community/gemma-4-12B-it-8bit --mode dflash --max-draft 0 --prompt "Write a binary search."

# sampled (not greedy) — lossless w.r.t. the target at temperature T (dspark and dflash)
mlx-dspark generate --model mlx-community/Qwen3-4B-8bit --prompt "Write a short poem." --temperature 1.0 --top-p 0.95 --seed 0

python -m mlx_dspark … works too, and the old flat --prompt … form still maps to generate.

Python

from mlx_dspark import load_pair, speculative_generate

target, tok, drafter, cfg = load_pair("mlx-community/Qwen3-8B-8bit")   # drafter auto-resolved
res = speculative_generate(target, tok, drafter, "Explain how rainbows form.")
print(res.text, res.mean_accept_len, res.tokens_per_sec)
from mlx_dspark import load_dflash_pair, dflash_generate   # z-lab DFlash instead

target, tok, drafter, cfg = load_dflash_pair("mlx-community/gemma-4-12B-it-8bit")
res = dflash_generate(target, tok, drafter, "Write a binary search in Python.")  # max_draft_tokens=None = full block
print(res.text, res.mean_accept_len, res.tokens_per_sec)

Models

Pass any target repo/path to --model; the matched drafter auto-resolves for the targets below (quantization-agnostic — a -4bit / -8bit / -bf16 of the same model resolves the same drafter). For anything else, add --drafter <repo>. Run mlx-dspark models to print this table.

target (--model) DSpark drafter (--mode dspark) DFlash drafter (--mode dflash) peak RAM
mlx-community/Qwen3-4B-8bit deepseek-ai/dspark_qwen3_4b_block7 z-lab/Qwen3-4B-DFlash-b16 ~8 GB
mlx-community/Qwen3-8B-8bit deepseek-ai/dspark_qwen3_8b_block7 z-lab/Qwen3-8B-DFlash-b16 ~11 GB
mlx-community/gemma-4-12B-it-8bit deepseek-ai/dspark_gemma4_12b_block7 z-lab/gemma4-12B-it-DFlash ~15 GB

Peak RAM is measured on an M4 Pro (8-bit target + 4-bit drafter + KV cache); add headroom for macOS. A 4-bit target (--model …-it-4bit) roughly halves the target's share (fits smaller Macs). Use the matched instruct target the drafter was trained against — a base model drops acceptance sharply. --drafter lets you run any other matched z-lab / DeepSpec checkpoint (e.g. Qwen3-14B) with no code change. The legacy --family qwen3|gemma4 flags still work but are deprecated in favor of --model.

How it works

  • DSpark — a parallel backbone (5 layers) consumes the target's hidden states (EAGLE3-style) and proposes a 7-token block at once; a rank-256 Markov head adds a cheap previous-token correction that kills "suffix decay"; a confidence head scores each position (optional adaptive block length).
  • DFlash (--mode dflash) — a block-diffusion drafter that denoises a whole 16-token block in one parallel pass and reuses the target's own embed/lm-head. Different trade-offs (see below).
  • The target verifies every token, so output is greedy-correct by construction (identical to plain decoding up to floating-point tie-breaking). --temperature > 0 switches to lossless speculative sampling — an exact sample from the target at temperature T (with --top-p / --top-k).

The drafter loads 1:1 from the HF checkpoint and is 4-bit quantized by default (cheap to run each round; quantization doesn't change acceptance — that's set by the drafter↔target match).

Which target & drafter should I use?

The winner is model-dependent, not just content-dependent — it comes down to how expensive the target's verify step is:

target (verify cost) DSpark (--mode dspark, cap 2) DFlash (--mode dflash --max-draft 0) pick
Gemma-4 12B — expensive verify 1.65× chat, ~1.9× code/math ~2.1× code/math, ~1.0× chat DFlash on code/math, DSpark on chat
Qwen3-8B — cheap verify ~1.6× everywhere ~0.9–1.1× (a wash) DSpark
Qwen3-4B — cheapest verify ~1.4× modest DSpark

Rule of thumb: bigger / slower-verify target → DFlash's full block pays off on code/math; smaller / fast target → DSpark wins outright. For target precision: 8-bit is the sweet spot (best acceptance + quality); 4-bit gives the highest absolute throughput and fits smaller Macs but a smaller speedup ratio; bf16 is slower on M-series (verify dominates). The drafter stays 4-bit either way. Full numbers and the reasoning are in Benchmarks & deep dive.

Results at a glance

DSpark vs the official MLX tools (mlx_lm.generate / mlx_vlm.generate) on the same model, at its cap=2 optimum (M4 Pro, warm, 8-bit instruct target, 4-bit drafter):

target accept len baseline (official) mlx-dspark speedup
Gemma-4 12B ~2.5 18.4 tok/s ~30 tok/s ~1.6× (≤2× on code/math)
Qwen3-8B ~2.44 29.4 tok/s ~47 tok/s ~1.6×
Qwen3-4B ~2.25 52.9 tok/s ~73 tok/s ~1.4×

All paths produce identical output to plain decoding — they're just faster. These land in the DSpark paper's own band (60–85% per-user speedup in batched serving = ~1.6–1.85×); the "2–4×" figures elsewhere are other papers on datacenter GPUs. Why a Mac can't go much higher, the full DSpark-vs-DFlash head-to-head, and the cost model are below.

Prefix caching

The server keeps the target KV cache (and, for DSpark, the drafter context) from the previous turn and reuses the shared conversation prefix instead of re-prefilling it. On a ~750-token shared context this makes follow-up turns ~13× faster (measured: 87 ms vs 1132 ms). It's lossless to the same standard as the rest of the project (a warm turn differs from a cold one only at logit-margin≈0 ties) and invalidates itself on any error so it can't desync.

On by default for --mode dspark / baseline on dense targets (Qwen3). It's disabled for DFlash and for Gemma-4 (whose sliding-window KV caches can't be safely rolled back to an arbitrary prefix) — those fall back to a fresh prefill. Flags: --no-prefix-cache, and --prefix-cache-dir DIR + --prefix-cache-max-ram-mb N to enable the optional SSD spill tier for very long contexts.


Benchmarks & deep dive

Everything below is for readers who want the numbers and the why. The sections above are enough to use it.

The Apple-Silicon speedup ceiling

Speculative decoding amortizes a memory-bound single-token decode across the K tokens verified in one forward. On a datacenter GPU that arbitrage is huge (parallel verify is nearly free, so speedup ≈ acceptance length). On an M-series chip it's much weaker — verify cost grows with the number of tokens (measured ≈ +14 ms/token for Gemma-4 12B, +1.5 ms/token for Qwen3-4B; multi-token verify drops out of MLX's fast quantized GEMV path). With the cost model tok/s ≈ A / (drafter + 0.035 + slope·C) for accept length A and draft cap C, even a perfect drafter accepting the whole 7-token block tops out around ~2.2× here. The binding limiter is acceptance length (set by the drafter↔target match) — not drafter quantization (4-bit / 8-bit / bf16 give identical acceptance; 4-bit is simply fastest). After a drafter-slice fix and the cap=2 default, verify dominates (~76% of each round).

DSpark vs DFlash (head-to-head)

Three drafters from the same DeepSpec lineage, all EAGLE-family (a tiny drafter that consumes the target's hidden states): EAGLE3 is autoregressive (high quality, draft latency grows with block size); DFlash drafts a whole block in one pass (fast, but later positions collide — "suffix decay"); DSpark = DFlash's parallel backbone + a rank-256 Markov head that reinjects token-to-token dependency, fixing suffix decay for ~0.6 ms/round. This is the first MLX port of DSpark; it also runs z-lab's original DFlash (block diffusion, Chen et al., arXiv:2602.06036, MIT) through the same lossless loop.

Gemma-4 12B (it-8bit, M4 Pro, warm, greedy, 4 prompts/domain — accept / tok·s; greedy ≈ 17.3 tok/s):

method chat code math
DSpark (cap 2) 2.45 / 28.5 2.78 / 32.8 2.86 / 32.4
DFlash (cap 2) 2.15 / 24.2 2.76 / 31.3 2.71 / 29.6
DFlash (full 16) 2.68 / 16.9 5.95 / 36.6 6.20 / 36.3

They're complementary, matching the paper's framing: DFlash's block-16 wins structured content on both axes (accept ~6.0 on code/math vs DSpark's block-7 ceiling ~2.8; ~2.1× throughput) because high acceptance amortizes the wide verify; DSpark's Markov head wins open chat (2.45 / 1.65×; DFlash's block never fills on unpredictable text — full-16 chat is a slight net loss).

But the winner flips on a smaller, cheap-verify target. Qwen3-8B-8bit (warm, greedy, 3 prompts/domain; greedy ≈ 28.8):

method chat code math
DSpark (cap 2) 2.38 / 45.7 2.55 / 48.8 2.40 / 46.1
DFlash (cap 2) 1.99 / 33.8 2.22 / 37.0 2.11 / 35.7
DFlash (full 16) 2.19 / 21.1 2.94 / 27.6 2.66 / 25.5

Here DSpark wins everywhere (~1.6×) and DFlash's block advantage evaporates — full-16 is a net loss (~0.9×) because the cheap verify makes the wide block cost more than it returns, and accept never climbs (~2.9 on code vs 5.95 on the 12B). Cross-checked against z-lab's own optimized runner dflash-mlx on the identical target+drafter: its baseline matches ours (29.3 tok/s) and its DFlash is also a net loss / wash at 8B (0.92× code full-block, ~1.08× adaptive) — even with its hand-written Metal verify kernels. So this is DFlash at this model scale on Apple Silicon, not an artifact of our verify loop.

Per the paper (accept length, full block, temp=1.0), DSpark beats DeepSpec's DFlash by +16–18% and EAGLE3 by +27–31%; our greedy exact-match numbers are lower than the paper's temp=1.0 speculative-sampling numbers because greedy is the strictest possible accept rule (not a bug).

Target precision

Since verify dominates, target precision is a speed/quality knob:

target 8-bit (default) 4-bit
Gemma-4 12B greedy 17.5 → spec 30 tok/s (1.73×) greedy 30.6 → spec 34–38 tok/s (1.1–1.25×)
Qwen3-4B greedy 49.8 → spec 73 tok/s (1.45×) greedy 82 → spec 96–103 tok/s (1.17–1.26×)

8-bit for the biggest spec benefit + best quality; 4-bit for max absolute throughput or small RAM (--model …-it-4bit). The drafter stays 4-bit; a bf16 target is not a win (verify roughly doubles).

Tuning

  • DSpark--max-draft 2 is the measured optimum for every target (default): verify cost grows per token and the marginal draft token rarely survives. --confidence-threshold 0.6 truncates the block adaptively via the confidence head instead.
  • DFlash — use --max-draft 0 (full 16-block, its native point) on code/math, where acceptance reaches ~6; use a short cap on open chat, where the block doesn't fill and the full block is a net loss.
  • Sampling--temperature > 0 (+ --top-p / --top-k) is lossless w.r.t. the target at temperature T (the paper's §2.1 method). On M-series it's ≈ greedy speed (the extra acceptance lives in a tail a short cap never reaches) — it's a sampled-output feature, not a speed lever.

License

MIT — see LICENSE. An independent MLX port of the inference path of DeepSeek's DSpark drafter; the z-lab DFlash drafter classes are vendored (MIT) with attribution in NOTICE. No model weights are bundled.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_dspark-0.1.0.tar.gz (6.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_dspark-0.1.0-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file mlx_dspark-0.1.0.tar.gz.

File metadata

  • Download URL: mlx_dspark-0.1.0.tar.gz
  • Upload date:
  • Size: 6.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mlx_dspark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e9cbe89e99dd9d870f0368956c2595d5650f1a0bd879f880f6d5d2dd4cefda6a
MD5 da91b3a91471e8553418adb1f1dec64d
BLAKE2b-256 e402405ae9f61e19d0c4bac6559a982c734a736b76e6bcf566eff10be585bee0

See more details on using hashes here.

File details

Details for the file mlx_dspark-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_dspark-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mlx_dspark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f60add85b94b3ba94315aa3d51ac82a0727f00724ae6a669d5710c6d1d363362
MD5 2cf63e58e7052834ad147c3ebfec1519
BLAKE2b-256 5394196a5f444acdcba9c31ad001951a458d530bb0b2f676f0e542ffca6c9008

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page