MTP speculative decoding tuner for Qwen3.6: vLLM/SGLang config generation, crossover analysis, and bug detection.
Project description
Qwen3.6-MTP
MTP speculative decoding tuner for Qwen3.6. Generates vLLM/SGLang configs, finds throughput crossover points, and catches known bugs.
What It Does
- Configuration advisor: Recommends MTP on/off with parameters via a decision tree over use case, objective, and GPU
- Backend configs: Generates vLLM (
method: mtp) and SGLang (NEXTNalgorithm) serve commands - Crossover analysis: Finds the batch size where MTP flips from net-positive to net-negative throughput
- Bug detection: Detects and blocks known-broken configurations (TurboQuant + MTP, prefix cache degradation)
- Benchmark sweep: Generate latency/throughput matrices across batch size, speculative tokens, and prefix cache settings
Installation
pip install qwen3.6-mtp
Quick Start
from qwen3_6_mtp import recommend, UseCase, Objective, Quantization
rec = recommend(
use_case=UseCase.SINGLE_USER,
objective=Objective.MINIMIZE_LATENCY,
gpu_id="rtx-4090",
quantization=Quantization.INT4,
)
print(rec.enable) # True
print(rec.expected_gain) # ~25-35% latency reduction (projected)
print(rec.vllm_command) # Full vllm serve command with MTP flags
print(rec.sglang_command) # Equivalent SGLang command
Crossover Analysis
from qwen3_6_mtp import quick_crossover
for s in quick_crossover(gpu_id="rtx-3090"):
print(f"MTP-{s.spec_tokens}: crossover at batch {s.crossover_batch_size}, "
f"best gain +{s.max_positive_delta_pct}%")
Backend Config Generation
from qwen3_6_mtp import vllm_mtp_command, sglang_mtp_command
vllm = vllm_mtp_command(model="Qwen/Qwen3.6-27B", num_speculative_tokens=2)
print(vllm.command)
sglang = sglang_mtp_command(model="Qwen/Qwen3.6-27B", num_speculative_tokens=2)
print(sglang.command)
Bug Detection
from qwen3_6_mtp import check_turboquant_conflict, check_prefix_cache_degradation
bug = check_turboquant_conflict(enable_turboquant=True, num_spec_tokens=2)
if bug:
print(f"BLOCKED: {bug.title} ({bug.upstream_issue})")
Key Findings
| Finding | Detail |
|---|---|
| MTP decode speedup | +27.5% faster decode TPOT at k=1 on RTX 3090 (with --no-enable-prefix-caching) |
| Prefix cache degradation | L457 bug drops hit rate ~92% to ~71% when MTP is enabled (vLLM #38182, OPEN) |
| TurboQuant conflict | TQ + MTP = degenerate token loops (vLLM #40831, CLOSED) |
| Crossover point | MTP throughput gain shrinks with batch size; net-negative varies by spec tokens and prefix cache (see quick_crossover()) |
| Sampling independence | MTP is algorithmically lossless; does not constrain sampling parameters |
Supported Models
| Model | Architecture | MTP Layers | Context |
|---|---|---|---|
| Qwen3.6-27B | Dense (GDN + Gated Attention) | 1 | 262K |
| Qwen3.6-35B-A3B | MoE (GDN + Gated Attention) | 1 | 262K |
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qwen3_6_mtp-0.1.1.tar.gz.
File metadata
- Download URL: qwen3_6_mtp-0.1.1.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cddd38b55c17809cea8a3d113584c01fd4dc7fda2debf3f4c330f01cebd78893
|
|
| MD5 |
b9b1f7c00d6a9f5cb46df33d082d174a
|
|
| BLAKE2b-256 |
043593186e5e35a745beba289955ffdb715d6064b0dc3456dc40cfe9a149ee26
|
Provenance
The following attestation bundles were made for qwen3_6_mtp-0.1.1.tar.gz:
Publisher:
publish.yml on ArkaD171717/Qwen3.6-MTP
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qwen3_6_mtp-0.1.1.tar.gz -
Subject digest:
cddd38b55c17809cea8a3d113584c01fd4dc7fda2debf3f4c330f01cebd78893 - Sigstore transparency entry: 1406384523
- Sigstore integration time:
-
Permalink:
ArkaD171717/Qwen3.6-MTP@0ea473ade416ad6015dc6cb304df798327d9331c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ArkaD171717
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0ea473ade416ad6015dc6cb304df798327d9331c -
Trigger Event:
push
-
Statement type:
File details
Details for the file qwen3_6_mtp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: qwen3_6_mtp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d482b098ae5b61c91a265e661bc48826ad4583afaa0789b092c3d493819b3d4d
|
|
| MD5 |
d204347b5141c1f274250a88afa97a52
|
|
| BLAKE2b-256 |
5e19aaf223ec14288394b63fb931df6b66ff3207f3e7b5e3d319d96dc20868b8
|
Provenance
The following attestation bundles were made for qwen3_6_mtp-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ArkaD171717/Qwen3.6-MTP
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qwen3_6_mtp-0.1.1-py3-none-any.whl -
Subject digest:
d482b098ae5b61c91a265e661bc48826ad4583afaa0789b092c3d493819b3d4d - Sigstore transparency entry: 1406384563
- Sigstore integration time:
-
Permalink:
ArkaD171717/Qwen3.6-MTP@0ea473ade416ad6015dc6cb304df798327d9331c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ArkaD171717
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0ea473ade416ad6015dc6cb304df798327d9331c -
Trigger Event:
push
-
Statement type: