Skip to main content

A unified library for creating, representing, and storing speculative decoding algorithms for LLM serving such as in vLLM.

Project description

Speculators logo

License Python Versions docs PyPI tests

Overview

Speculators is a unified library for building, training and storing speculative decoding algorithms for large language model (LLM) inference, including in frameworks like vLLM. Speculative decoding is a lossless technique that speeds up LLM inference by using a smaller, faster draft model (i.e "the speculator") to propose tokens, which are then verified by the larger base model, reducing latency without compromising output quality. The speculator intelligently drafts multiple tokens ahead of time, and the base model verifies them in a single forward pass. This approach boosts performance without sacrificing output quality, as every accepted token is guaranteed to match what the main model would have generated on its own.

Speculators standardizes this process by providing a productionized end-to-end framework to train draft models with reusable formats and tools. Trained models can seamlessly run in vLLM, enabling the deployment of speculative decoding in production-grade inference servers.

Speculators user flow diagram


💬 Join us on the vLLM Community Slack and share your questions, thoughts, or ideas in:

  • #speculators
  • #feat-spec-decode

🎥 Watch our Office Hours presentation: Video | Slides


Key Features

  • Draft Model Training Support: E2E training support of single and multi-layer draft models. Training is supported for MoE, non-MoE, and Vision-Language models.
  • Standardized, Extensible Format: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
  • Seamless vLLM Integration: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.

[!TIP] Read more about Speculators features in this vLLM blog post.

Supported Models

The following table summarizes the models that have been trained end-to-end by our team as well as others in the roadmap:

Verifier Architecture Verifier Size Training Support vLLM Deployment Support
Llama 8B-Instruct EAGLE-3
70B-Instruct EAGLE-3
Qwen3 8B EAGLE-3
14B EAGLE-3
32B EAGLE-3
gpt-oss 20b EAGLE-3
120b EAGLE-3 ✅
Qwen3 MoE 30B-Instruct EAGLE-3
235B-Instruct EAGLE-3
235B EAGLE-3
Qwen3-VL 235B-A22B EAGLE-3
Mistral 3 Large 675B-Instruct EAGLE-3 ⏳

✅ = Supported, ⏳ = In Progress, ❌ = Not Yet Supported

vLLM Inference

Models trained through Speculators can run seamlessly in vLLM using a simple vllm serve <speculator_model> command. This will run the model in vLLM using default arguments, defined in the speculator_config of the model's config.json.

vllm serve RedHatAI/Qwen3-8B-speculator.eagle3

Served models can then be benchmarked using GuideLLM. Below, we show sample benchmark results where we compare our speculator with its dense counterpart. We also additionally compare quantization to explore additional performance improvements by swapping the dense verifier, Qwen/Qwen3-8B with the quantized FP8 model, RedHatAI/Qwen3-8B-FP8-dynamic in the speculator_config.

GuideLLM Logo

Getting Started

Installation

Prerequisites

Before installing, ensure you have the following:

  • Operating System: Linux or macOS
  • Python: 3.10 or higher
  • Package Manager: pip (recommended) or conda

Install from PyPI (Recommended)

Install the latest stable release from PyPI:

pip install speculators

Install from Source

For the latest development version or to contribute to the project:

git clone https://github.com/vllm-project/speculators.git
cd speculators

pip install -e .

For development with additional tools:

pip install -e ".[dev]"

Verify Installation

You can verify your installation by checking the version:

speculators --version

Or by importing the package in Python:

import speculators
print(speculators.__version__)

License

Speculators is licensed under the Apache License 2.0.

Cite

If you find Speculators helpful in your research or projects, please consider citing it:

@misc{speculators2025,
  title={Speculators: A Unified Library for Speculative Decoding Algorithms in LLM Serving},
  author={Red Hat},
  year={2025},
  howpublished={\url{https://github.com/vllm-project/speculators}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speculators-0.4.0a0.tar.gz (79.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speculators-0.4.0a0-py3-none-any.whl (87.1 kB view details)

Uploaded Python 3

File details

Details for the file speculators-0.4.0a0.tar.gz.

File metadata

  • Download URL: speculators-0.4.0a0.tar.gz
  • Upload date:
  • Size: 79.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for speculators-0.4.0a0.tar.gz
Algorithm Hash digest
SHA256 a937a90c9fa40806f7e3aad3ace3663c703b0320f9c803d666ced9494f1192c1
MD5 be6626e4fce5bbfc999a99a7803265ea
BLAKE2b-256 250985d571c3c5ff72ce15f7e2c297f95a3a91dc60e494bcf28a36cccfe92e88

See more details on using hashes here.

File details

Details for the file speculators-0.4.0a0-py3-none-any.whl.

File metadata

  • Download URL: speculators-0.4.0a0-py3-none-any.whl
  • Upload date:
  • Size: 87.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for speculators-0.4.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 b1b90e439dbb6923c99a9a243eee5b1f4dca2c211ae74f65f472d14054ae9793
MD5 88605875704ad81cdd30f2a3e8c9771a
BLAKE2b-256 dd844d6ce14a1cf890529710a97535d21f7f7b36a3d7f2e6b5f3b198045fae23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page