Skip to main content

Hardware-aware MTP speculative decoding auto-tuner for Qwen3.6, with vLLM/SGLang backend normalization and bug detection.

Project description

Qwen3.6-MTP

MTP speculative decoding tuner for Qwen3.6. Generates vLLM/SGLang configs, finds throughput crossover points, and catches known bugs.

What It Does

  • Auto-tuner: Recommends MTP configuration (or explains why to disable it) based on use case, objective, and GPU
  • Backend configs: Generates vLLM (method: mtp) and SGLang (NEXTN algorithm) serve commands
  • Crossover analysis: Finds the batch size where MTP flips from net-positive to net-negative throughput
  • Bug detection: Detects and blocks known-broken configurations (TurboQuant + MTP, prefix cache degradation)
  • Benchmark sweep: Generate latency/throughput matrices across batch size, speculative tokens, and prefix cache settings

Installation

pip install qwen3.6-mtp

Quick Start

from qwen3_6_mtp import recommend, UseCase, Objective

rec = recommend(
    use_case=UseCase.SINGLE_USER,
    objective=Objective.MINIMIZE_LATENCY,
    gpu_id="rtx-4090",
)

print(rec.enable)           # True
print(rec.expected_gain)    # ~35-42% latency reduction
print(rec.vllm_command)     # Full vllm serve command with MTP flags
print(rec.sglang_command)   # Equivalent SGLang command

Crossover Analysis

from qwen3_6_mtp import quick_crossover

for s in quick_crossover(gpu_id="rtx-3090"):
    print(f"MTP-{s.spec_tokens}: crossover at batch {s.crossover_batch_size}, "
          f"best gain +{s.max_positive_delta_pct}%")

Backend Config Generation

from qwen3_6_mtp import vllm_mtp_command, sglang_mtp_command

vllm = vllm_mtp_command(model="Qwen/Qwen3.6-27B", num_speculative_tokens=2)
print(vllm.command)

sglang = sglang_mtp_command(model="Qwen/Qwen3.6-27B", num_speculative_tokens=2)
print(sglang.command)

Bug Detection

from qwen3_6_mtp import check_turboquant_conflict, check_prefix_cache_degradation

bug = check_turboquant_conflict(enable_turboquant=True, num_spec_tokens=2)
if bug:
    print(f"BLOCKED: {bug.title} ({bug.upstream_issue})")

Key Findings

Finding Detail
MTP decode speedup +27.5% faster decode TPOT at k=1 on RTX 3090 (with --no-enable-prefix-caching)
Prefix cache degradation L457 bug drops hit rate ~92% to ~71% when MTP is enabled (vLLM #38182, OPEN)
TurboQuant conflict TQ + MTP = degenerate token loops (vLLM #40831, CLOSED)
Crossover point MTP becomes net-negative at batch size 4-8 on consumer GPUs
Sampling independence MTP is algorithmically lossless; does not constrain sampling parameters

Supported Models

Model Architecture MTP Layers Context
Qwen3.6-27B Dense (GDN + Gated Attention) 1 262K
Qwen3.6-35B-A3B MoE (GDN + Gated Attention) 1 262K

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwen3_6_mtp-0.1.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwen3_6_mtp-0.1.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file qwen3_6_mtp-0.1.0.tar.gz.

File metadata

  • Download URL: qwen3_6_mtp-0.1.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qwen3_6_mtp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8dc926d84b44a34c011306272b3be7effa20168d5e7811441f553a9ef27db2f2
MD5 fb6c228d7f6f52cffe837f1a1b63eb97
BLAKE2b-256 c8542baebcc1dcb3bc78596d83b83730d2ba3c62e3c84c22f1e053dc153c3ad3

See more details on using hashes here.

Provenance

The following attestation bundles were made for qwen3_6_mtp-0.1.0.tar.gz:

Publisher: publish.yml on ArkaD171717/Qwen3.6-MTP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qwen3_6_mtp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: qwen3_6_mtp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qwen3_6_mtp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec971320899d2beeb440f7291432cecae0d14779a283145b2c7d72e11b3ea8e3
MD5 393fe4e469c3975a8a17d44387f776b6
BLAKE2b-256 73d07d99e978a38fe5473ebbf77fd6934c80592a12f873cf851246bbfff17cd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for qwen3_6_mtp-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ArkaD171717/Qwen3.6-MTP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page