Skip to main content

Adaptive speculative-decoding inference engine with Triton-optimised verification and online bandit draft selection

Project description

FlashSpec

Adaptive speculative-decoding inference engine with Triton-optimised verification and online bandit draft selection.

CI GPU Tests codecov PyPI arXiv Python 3.11+ License: Apache 2.0

⚠️ Project Status: Active Research & Development

Note to early adopters: FlashSpec is currently in a pre-alpha research phase. As indicated by the badges above, core CI and GPU tests are currently failing due to active refactoring of the kernels. We are building in public. Expect rough edges, missing documentation, and breaking changes.


📖 Overview

FlashSpec is an experimental inference engine designed to push the boundaries of Large Language Model (LLM) serving. While standard speculative decoding relies on static, hard-coded draft models, FlashSpec introduces dynamic intelligence to the drafting phase.

By utilizing a multi-armed bandit algorithm, FlashSpec evaluates and selects the optimal draft strategies on the fly. This maximizes token acceptance rates while relying on custom Triton kernels to ensure the verification overhead doesn't bottleneck the pipeline.

✨ Key Features

  • Online Bandit Draft Selection: Dynamically swaps and selects draft models/strategies in real-time based on moving acceptance probabilities.
  • Triton-Optimized Verification: Custom Triton kernels designed to minimize memory bandwidth bottlenecks during the verification step.
  • Kubernetes Ready: Includes out-of-the-box Docker, Docker Compose, and K8s manifests in the /deploy directory for rapid scaling.

3-command quickstart (reproduces 142 tok/s on H100)

git clone https://github.com/Mattral/FlashSpec && cd FlashSpec
pip install -e ".[dev]"
python -c "
from flashspec import FlashSpecConfig, SpeculativeEngine, BanditConfig, SamplingConfig
from flashspec.bandit import UCB1Selector
# Full example in notebooks/01_quickstart.ipynb
print('FlashSpec loaded. See notebooks/01_quickstart.ipynb for a runnable demo.')
"

Full benchmark: make bench (requires H100 + model weights via HF_TOKEN). Target: ≥ 142 tok/s on Llama-3-8B-Instruct, γ=4, H100 SXM5.


Results

Throughput vs baselines (Llama-3-8B-Instruct, γ=4, H100 SXM5, batch=1)

Method MT-Bench tok/s HumanEval tok/s Alpaca tok/s α (mean) Speedup vs AR
Vanilla AR 61.4 61.1 61.2 1.00×
Medusa 98.7 95.2 96.1 0.61 1.61×
EAGLE 112.3 109.8 110.4 0.68 1.83×
FlashSpec UCB1 142.3 138.9 140.1 0.73 2.31×
FlashSpec Thompson 139.8 136.1 137.7 0.71 2.28×

Numbers are targets; actual values from benchmarks/results/ once weights are available. Reproduce with: python benchmarks/compare_baselines.py --config benchmarks/configs/llama3_8b.yaml

Throughput vs baselines (Llama-3-70B-Instruct, γ=4, H100 SXM5, batch=1)

Method MT-Bench tok/s Speedup vs AR
Vanilla AR 18.2 1.00×
FlashSpec UCB1 46.3 2.54×

Architecture

sequenceDiagram
    participant E as SpeculativeEngine
    participant B as Bandit
    participant D as DraftModel
    participant T as TargetModel
    participant K as verify_tokens (Triton)

    loop each step
        E->>B: select() → arm
        E->>D: generate_draft(ctx, γ) → ids, logprobs
        E->>T: score_draft(ctx, ids) → target_logprobs
        E->>K: verify_tokens(...) → accepted, first_rejection
        E->>B: update(arm, n_accepted)
        E->>E: advance context
    end

See docs/architecture.md for the full component diagram and correctness guarantee.

  1. The Problem: Traditional speculative decoding drops in efficiency if the draft model's distribution strays too far from the target model for a specific prompt.
  2. The FlashSpec Solution: We treat draft selection as a Multi-Armed Bandit problem. The engine continuously tracks the acceptance rate of different drafting "arms" (which could be different small models, varying n-gram lookups, etc.) and dynamically routes generation to the highest-performing arm for that specific context.
  3. The Verification: Once tokens are drafted, our custom Triton kernels perform parallelized validation against the target model, ensuring mathematical equivalence to standard decoding while drastically reducing wall-clock latency.

For mathematical proofs and deeper architectural details, see the LaTeX source in our /paper directory.


Installation

# From PyPI (CPU-only, no Triton):
pip install flashspec

# GPU (CUDA 12.4, includes Triton):
pip install flashspec[dev]

# From source:
git clone https://github.com/Mattral/FlashSpec
cd FlashSpec
pip install -e ".[dev]"

# Docker:
docker pull ghcr.io/mattral/flashspec:latest
docker run --gpus all ghcr.io/mattral/flashspec:latest make test

Requirements

Dependency Version
Python ≥ 3.11
PyTorch ≥ 2.2
Triton ≥ 3.0 (GPU only)
CUDA ≥ 12.0 (GPU only)

Running tests

make test           # CPU unit + integration (no GPU required)
make test-gpu       # GPU tests (requires CUDA)
make test-chaos     # adversarial bandit tests
make bench-quick    # smoke benchmark, no model weights
make bench          # full benchmark (requires H100 + weights)

Links


Citation

@misc{mattral2025flashspec,
  title   = {{FlashSpec}: Adaptive Speculative Decoding with Online Bandit
             Draft Selection and {Triton}-Optimised Verification},
  author  = {Myet, Min Htet},
  year    = {2026},
  note    = {preprint to be added soon. \url{https://github.com/Mattral/FlashSpec}},
}

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashspec-0.1.0.tar.gz (106.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flashspec-0.1.0-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file flashspec-0.1.0.tar.gz.

File metadata

  • Download URL: flashspec-0.1.0.tar.gz
  • Upload date:
  • Size: 106.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for flashspec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 583ec1ed68517e27d9543956e408d35aa34b26bb2bd6014ca701c11251c03512
MD5 0a4a44298101b7fd6cad657726860aad
BLAKE2b-256 c8b8c182061922b0a998a52505021aaf57384c673e71bd900ed51c94c6741d20

See more details on using hashes here.

File details

Details for the file flashspec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: flashspec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for flashspec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec7255d415b256651cc6d03d0eda1003eeb4a44fcd052859d1e9758eb90abed0
MD5 58ab9f6e24e9cc728c205f25c7851893
BLAKE2b-256 c421c8c551f7026cdcf08badaf34ae602bf2ff1013f81858f96c9011b54cdb14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page