Skip to main content

MTP speculative decoding tuner for Qwen3.6: vLLM/SGLang config generation, crossover analysis, and bug detection.

Project description

Qwen3.6-MTP

MTP speculative decoding tuner for Qwen3.6. Generates vLLM/SGLang configs, finds throughput crossover points, and catches known bugs.

What It Does

  • Configuration advisor: Recommends MTP on/off with parameters via a decision tree over use case, objective, and GPU
  • Backend configs: Generates vLLM (method: mtp) and SGLang (NEXTN algorithm) serve commands
  • Crossover analysis: Finds the batch size where MTP flips from net-positive to net-negative throughput
  • Bug detection: Detects and blocks known-broken configurations (TurboQuant + MTP, prefix cache degradation)
  • Benchmark sweep: Generate latency/throughput matrices across batch size, speculative tokens, and prefix cache settings

Installation

pip install qwen3.6-mtp

Quick Start

from qwen3_6_mtp import recommend, UseCase, Objective, Quantization

rec = recommend(
    use_case=UseCase.SINGLE_USER,
    objective=Objective.MINIMIZE_LATENCY,
    gpu_id="rtx-4090",
    quantization=Quantization.INT4,
)

print(rec.enable)           # True
print(rec.expected_gain)    # ~25-35% latency reduction (projected)
print(rec.vllm_command)     # Full vllm serve command with MTP flags
print(rec.sglang_command)   # Equivalent SGLang command

Crossover Analysis

from qwen3_6_mtp import quick_crossover

for s in quick_crossover(gpu_id="rtx-3090"):
    print(f"MTP-{s.spec_tokens}: crossover at batch {s.crossover_batch_size}, "
          f"best gain +{s.max_positive_delta_pct}%")

Backend Config Generation

from qwen3_6_mtp import vllm_mtp_command, sglang_mtp_command

vllm = vllm_mtp_command(model="Qwen/Qwen3.6-27B", num_speculative_tokens=2)
print(vllm.command)

sglang = sglang_mtp_command(model="Qwen/Qwen3.6-27B", num_speculative_tokens=2)
print(sglang.command)

Bug Detection

from qwen3_6_mtp import check_turboquant_conflict, check_prefix_cache_degradation

bug = check_turboquant_conflict(enable_turboquant=True, num_spec_tokens=2)
if bug:
    print(f"BLOCKED: {bug.title} ({bug.upstream_issue})")

Key Findings

Finding Detail
MTP decode speedup +27.5% faster decode TPOT at k=1 on RTX 3090 (with --no-enable-prefix-caching)
Prefix cache degradation L457 bug drops hit rate ~92% to ~71% when MTP is enabled (vLLM #38182, OPEN)
TurboQuant conflict TQ + MTP = degenerate token loops (vLLM #40831, CLOSED)
Crossover point MTP throughput gain shrinks with batch size; net-negative varies by spec tokens and prefix cache (see quick_crossover())
Sampling independence MTP is algorithmically lossless; does not constrain sampling parameters

Supported Models

Model Architecture MTP Layers Context
Qwen3.6-27B Dense (GDN + Gated Attention) 1 262K
Qwen3.6-35B-A3B MoE (GDN + Gated Attention) 1 262K

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwen3_6_mtp-0.1.1.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwen3_6_mtp-0.1.1-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file qwen3_6_mtp-0.1.1.tar.gz.

File metadata

  • Download URL: qwen3_6_mtp-0.1.1.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qwen3_6_mtp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cddd38b55c17809cea8a3d113584c01fd4dc7fda2debf3f4c330f01cebd78893
MD5 b9b1f7c00d6a9f5cb46df33d082d174a
BLAKE2b-256 043593186e5e35a745beba289955ffdb715d6064b0dc3456dc40cfe9a149ee26

See more details on using hashes here.

Provenance

The following attestation bundles were made for qwen3_6_mtp-0.1.1.tar.gz:

Publisher: publish.yml on ArkaD171717/Qwen3.6-MTP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qwen3_6_mtp-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: qwen3_6_mtp-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for qwen3_6_mtp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d482b098ae5b61c91a265e661bc48826ad4583afaa0789b092c3d493819b3d4d
MD5 d204347b5141c1f274250a88afa97a52
BLAKE2b-256 5e19aaf223ec14288394b63fb931df6b66ff3207f3e7b5e3d319d96dc20868b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for qwen3_6_mtp-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ArkaD171717/Qwen3.6-MTP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page