speculators

A unified library for creating, representing, and storing speculative decoding algorithms for LLM serving such as in vLLM.

These details have not been verified by PyPI

Project links

Project description

Overview

Speculators is a library for training speculative decoding draft models that deploy directly to LLM inference engines like vLLM. Speculative decoding is a lossless technique that speeds up LLM inference by using a smaller, faster draft model (i.e. "the speculator") to propose tokens, which are then verified by the larger base model, reducing latency without compromising output quality. The speculator intelligently drafts multiple tokens ahead of time, and the base model verifies them in a single forward pass. This approach boosts performance without sacrificing output quality, as every accepted token is guaranteed to match what the main model would have generated on its own.

Speculators standardizes this process by providing a productionized end-to-end framework to train draft models with reusable formats and tools. Trained models can seamlessly run in vLLM, enabling the deployment of speculative decoding in production-grade inference servers.

Speculators user flow diagram

💬 Join us on the vLLM Community Slack and share your questions, thoughts, or ideas in:

#speculators
#feat-spec-decode

🎥 Watch our Office Hours presentation: Video | Slides

🚀 What's New!

Big updates have landed in Speculators! To get a more in-depth look, check out the Speculators documentation.

Some of the exciting new features include:

Qwen3-8B DFlash Speculator: The RedHat team published a DFlash speculator for Qwen3-8B, achieving average speculative token acceptance lengths of up to 3.74 on math_reasoning.
Gemma 4 Speculators: The RedHat team published speculators for Gemma 4 31B-it, including both DFlash and EAGLE-3 checkpoints, enabling production-grade speculative decoding for Gemma 4 models.
DFlash Training Algorithm: Added support for the DFlash training algorithm with anchored-block drafting, using auxiliary hidden states from multiple verifier layers. Includes CLI options for block size and max anchors, plus DFlash metrics, utilities, and draft model. DFlash models trained through Speculators can now run seamlessly in vLLM as of vLLM PR #38300.
Online Training Support: Added support for online training using the new vLLM hidden extraction system, enabling real-time hidden state generation during training without requiring separate offline data generation steps.

Key Features

Offline Training Data Generation using vLLM: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
Draft Model Training Support: E2E training support of single and multi-layer draft models. Training is supported for MoE, non-MoE, and Vision Language models.
Standardized, Extensible Format: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
Seamless vLLM Integration: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.

[!TIP] Read more about Speculators features in this vLLM blog post.

Supported Models

The following table summarizes the models that have been trained end-to-end by our team as well as others in the roadmap:

Verifier Architecture	Verifier Size	Training Support	vLLM Deployment Support
Llama	8B-Instruct	EAGLE-3 ✅	✅
	70B-Instruct	EAGLE-3 ✅	✅

Qwen3	8B	EAGLE-3 ✅ DFlash ✅	✅
	14B	EAGLE-3 ✅	✅
	32B	EAGLE-3 ✅	✅
gpt-oss	20b	EAGLE-3 ✅	✅
gpt-oss	120b	EAGLE-3 ✅	✅
Qwen3 MoE	30B-Instruct	EAGLE-3 ✅	✅
	235B-Instruct	EAGLE-3 ✅	✅
	235B	EAGLE-3 ✅	✅
Qwen3-VL	235B-A22B	EAGLE-3 ✅	✅
Mistral 3 Large	675B-Instruct	EAGLE-3 ⏳	⏳
Gemma 4	31B-it	EAGLE-3 ✅ DFlash ✅	✅
Gemma 4 MoE	26B-A4B-it	EAGLE-3 ✅	✅

✅ = Supported, ⏳ = In Progress, ❌ = Not Yet Supported

vLLM Inference

Models trained through Speculators can run seamlessly in vLLM using a simple vllm serve <speculator_model> command. This will run the model in vLLM using default arguments, defined in the speculator_config of the model's config.json.

vllm serve RedHatAI/Qwen3-8B-speculator.eagle3

Served models can then be benchmarked using GuideLLM. Below, we show sample benchmark results where we compare our speculator with its dense counterpart. We also additionally compare quantization to explore additional performance improvements by swapping the dense verifier, Qwen/Qwen3-8B with the quantized FP8 model, RedHatAI/Qwen3-8B-FP8-dynamic in the speculator_config.

GuideLLM Logo

Additional Utility Scripts

Regenerate responses to enhance your training data

Getting Started

Installation

Prerequisites

Before installing, ensure you have the following:

Operating System: Linux or macOS
Python: 3.10 or higher
Package Manager: pip (recommended) or conda

Install from PyPI (Recommended)

Install the latest stable release from PyPI:

pip install speculators

Install from Source

For the latest development version or to contribute to the project:

git clone https://github.com/vllm-project/speculators.git
cd speculators

pip install -e .

For development with additional tools:

pip install -e ".[dev]"

Verify Installation

You can verify your installation by checking the version:

speculators --version

Or by importing the package in Python:

import speculators
print(speculators.__version__)

License

Speculators is licensed under the Apache License 2.0.

Cite

If you find Speculators helpful in your research or projects, please consider citing it:

@misc{speculators2025,
  title={Speculators: A Unified Library for Speculative Decoding Algorithms in LLM Serving},
  author={Red Hat},
  year={2025},
  howpublished={\url{https://github.com/vllm-project/speculators}},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Jun 16, 2026

0.5.0

Apr 24, 2026

0.5.0a0 pre-release

Apr 24, 2026

0.4.0.1

Mar 26, 2026

0.4.0

Mar 4, 2026

0.4.0a1 pre-release

Mar 25, 2026

0.4.0a0 pre-release

Mar 2, 2026

0.3.0

Dec 10, 2025

0.2.0

Nov 3, 2025

0.2.0a0 pre-release

Nov 3, 2025

0.1.0

Aug 8, 2025

0.1.0a9 pre-release

Jul 7, 2025

0.1.0a8 pre-release

Jul 3, 2025

0.1.0a7 pre-release

Jul 1, 2025

0.1.0a6 pre-release

Jun 28, 2025

0.0.1

May 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speculators-0.6.0.tar.gz (125.5 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speculators-0.6.0-py3-none-any.whl (145.0 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file speculators-0.6.0.tar.gz.

File metadata

Download URL: speculators-0.6.0.tar.gz
Upload date: Jun 16, 2026
Size: 125.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for speculators-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`6f8babb769bc1719883f2020f745fdc44c481d33d94114695f219011dc3fee03`
MD5	`a647839acbc563ad7fc9e23013f954f7`
BLAKE2b-256	`f32fa21297bc36aeee0d8942a4fa52a330a9acd29c93d26345c5d5bfe7e8a3cd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for speculators-0.6.0.tar.gz:

Publisher: speculators-upload.yml on neuralmagic/llm-compressor-testing

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: speculators-0.6.0.tar.gz
- Subject digest: 6f8babb769bc1719883f2020f745fdc44c481d33d94114695f219011dc3fee03
- Sigstore transparency entry: 1839823022
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: neuralmagic/llm-compressor-testing@833cc3d17d2b388514d9652bdefed5f6217e8f6a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/neuralmagic
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: speculators-upload.yml@833cc3d17d2b388514d9652bdefed5f6217e8f6a
- Trigger Event: workflow_dispatch

File details

Details for the file speculators-0.6.0-py3-none-any.whl.

File metadata

Download URL: speculators-0.6.0-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 145.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for speculators-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2541c33d10d48b4f9641fc00650e52d99c2a66276a89a3a853a79b25ce868f1`
MD5	`29e7713df2e14b6201270d5c1aedd564`
BLAKE2b-256	`f2719bf353a849d90705e4a4dd50b638664d83de385c27978672fd8f00857720`

See more details on using hashes here.

Provenance

The following attestation bundles were made for speculators-0.6.0-py3-none-any.whl:

Publisher: speculators-upload.yml on neuralmagic/llm-compressor-testing

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: speculators-0.6.0-py3-none-any.whl
- Subject digest: e2541c33d10d48b4f9641fc00650e52d99c2a66276a89a3a853a79b25ce868f1
- Sigstore transparency entry: 1839823089
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: neuralmagic/llm-compressor-testing@833cc3d17d2b388514d9652bdefed5f6217e8f6a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/neuralmagic
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: speculators-upload.yml@833cc3d17d2b388514d9652bdefed5f6217e8f6a
- Trigger Event: workflow_dispatch

speculators 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Overview

🚀 What's New!

Key Features

Supported Models

vLLM Inference

Additional Utility Scripts

Getting Started

Installation

Prerequisites

Install from PyPI (Recommended)

Install from Source

Verify Installation

License

Cite

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance