Skip to main content

No project description provided

Project description

vLLM TPU vLLM TPU

| Documentation | Blog | User Forum | Developer Slack (#sig-tpu) |


๐Ÿค Contribute to the Project
Looking to help? Click a badge below to find issues that need your attention.

bug good first issue enhancement contribution-welcome auto-generated View All Issues

Latest News

Previous News ๐Ÿ”ฅ

About

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path within the vLLM project. The new backend now provides a framework for developers to:

  • Push the limits of TPU hardware performance in open source.
  • Provide more flexibility to JAX and PyTorch users by running PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX.
  • Retain vLLM standardization: keep the same user experience, telemetry, and interface.

Recommended models and features

Although vLLM TPUโ€™s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.

For this reason, weโ€™ve provided a Recommended Models and Features page detailing the models and features that are validated through unit, integration, and performance testing.


Get started

Get started with vLLM on TPUs by following the quickstart guide.

Visit our documentation to learn more.

Compatible TPU Generations

  • Recommended: v7x, v5e, v6e
  • Experimental: v3, v4, v5p

Recipes


TPU Support Matrix Dashboard

Below is the live status of our supported models, features, and kernels. Click on any category to expand the detailed support table. It is automatically updated from our detailed Support Matrices.

Last Updated: 2026-05-01 04:54 PM UTC

๐Ÿšฆ Status Legend
  • โœ… Passing: Tested and works as expected. Ready for use.
  • โŒ Failing: Known to be broken or not functional. Help is wanted to fix this!
  • ๐Ÿงช Experimental: Works, but unoptimized or pending community validation.
  • ๐Ÿ“ Planned: Not yet implemented, but on the official roadmap.
  • โ›”๏ธ Unplanned: There is no benefit to adding this.
  • โ“ Untested: The functionality exists but has not been recently or thoroughly verified.
๐Ÿ“ View Matrix Aggregation Rules (v6e/v7x & C+P)
  • ๐Ÿ› ๏ธ Correctness + Performance (C + P)

    • โŒ Failing: If either check fails.
    • โœ… Passing: If BOTH checks pass successfully.
    • โ“ Untested: If any check is untested (and neither fails).
  • ๐ŸŒ Hardware Rollups (v6e + v7x)

    • โŒ Failing: If the feature fails on either v6e or v7x.
    • โœ… Passing: If the feature passes on BOTH v6e and v7x.
    • โ“ Untested: If either generation is untested (and neither fails).

Release Support Matrices

Click to expand support matrices

Stable support status for official releases and production deployments.

โœ… Tested Models
Model Type Unitย Test Correctnessย Test Performanceย Test
Qwen/Qwen2.5-VL-7B-Instruct Multimodal โœ… โœ… โœ…
google/gemma-3-27b-it Text โœ… โœ… โœ…
meta-llama/Llama-3.1-8B-Instruct Text โœ… โœ… โœ…
meta-llama/Llama-3.3-70B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3-30B-A3B Text โœ… โœ… โœ…
Qwen/Qwen3-32B Text โœ… โœ… โœ…
Qwen/Qwen3-4B Text โœ… โœ… โœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3.5-397B-A17B Text โœ… โœ… โœ…
openai/gpt-oss-120b Text โœ… โœ… โ“
deepseek-ai/DeepSeek-R1 Text โœ… โ“ โ“
deepseek-ai/DeepSeek-OCR Multimodal โ“ โ“ โ“
moonshotai/Kimi-K2.5 Multimodal โ“ โ“ โ“
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3-VL-8B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3.5-9B Multimodal โ“ โ“ โ“
deepseek-ai/DeepSeek-Math-V2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.1 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2-Speciale Text โ“ โ“ โ“
MiniMaxAI/MiniMax-M2.5 Text โ“ โ“ โ“
moonshotai/Kimi-K2-Thinking Text โ“ โ“ โ“
openai/gpt-oss-20b Text โ“ โ“ โ“
zai-org/GLM-5 Text โ“ โ“ โ“
๐Ÿš€ย  Advanced Capabilities
Core Features
Feature Flax Torchax Default
Chunked Prefill โœ… โœ… โœ…
DCN-based P/D disaggregation โœ… โœ… โœ…
LoRA_Torch โœ… โœ… โœ…
Out-of-tree model support โœ… โœ… โœ…
Prefix Caching โœ… โœ… โœ…
Single Program Multi Data โœ… โœ… โœ…
Speculative Decoding: Ngram โœ… โœ… โœ…
Multimodal Inputs โœ… โŒ โœ…
Speculative Decoding: Eagle3 โœ… โŒ โœ…
async scheduler โŒ โœ… โœ…
runai_model_streamer_loader โ“ โŒ โ“
hybrid kv cache โ“ โ“ โ“
KV cache host offloading โ“ โ“ โ“
multi-host โ“ โ“ โ“
sampling_params โ“ โ“ โ“
Single-Host-P-D-disaggregation โ“ โ“ โ“
structured_decoding โ“ โ“ โ“
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP โœ… โœ… โœ… โœ…
DP โœ… โ“ โœ… โ“
EP โœ… โ“ โœ… โ“
TP โœ… โ“ โœ… โ“
CP โ“ โ“ โ“ โ“
SPย (voteย toย prioritize) โ“ โ“ โ“ โ“
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7 โ“ โ“
FP8 W8A16 compressed-tensor v7 โ“ โ“
FP8 W8A8 compressed-tensor v7 โ“ โ“
INT4 W4A16 awq v5, v6 โ“ โ“
INT8 W8A8 compressed-tensor v5, v6 โ“ โ“

Note:

  • This table only tests checkpoint loading compatibility.
๐Ÿ”ฌ Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fusedย MoE โ“ โ“ โ“ โ“ โ“ โ“
gmm โ“ โ“ โ“ โ“ โ“ โ“
Dense Allโ€‘gatherย matmul โ“ โ“ โ“ โ“ โ“ โ“
Attention Genericย Raggedย Paged
Attentionย V3*
โ“ โ“ โ“ โ“ โ“ โ“
MLA โ“ โ“ โ“ โ“ โ“ โ“
Raggedย Paged
Attentionย V3ย Head_Dim
64*
โ“ โ“ โ“ โ“ โ“ โ“

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Nightly Support Matrices

Click to expand support matrices

Support status for the latest nightly/main branch developments.

โœ… Tested Models
Model Type Unitย Test Correctnessย Test Performanceย Test
Qwen/Qwen2.5-VL-7B-Instruct Multimodal โœ… โœ… โœ…
google/gemma-3-27b-it Text โœ… โœ… โœ…
meta-llama/Llama-3.1-8B-Instruct Text โœ… โœ… โœ…
meta-llama/Llama-3.3-70B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3-30B-A3B Text โœ… โœ… โœ…
Qwen/Qwen3-32B Text โœ… โœ… โœ…
Qwen/Qwen3-4B Text โœ… โœ… โœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3.5-397B-A17B Text โœ… โœ… โœ…
openai/gpt-oss-120b Text โœ… โœ… โ“
deepseek-ai/DeepSeek-R1 Text โœ… โ“ โ“
deepseek-ai/DeepSeek-OCR Multimodal โ“ โ“ โ“
moonshotai/Kimi-K2.5 Multimodal โ“ โ“ โ“
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3-VL-8B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3.5-9B Multimodal โ“ โ“ โ“
deepseek-ai/DeepSeek-Math-V2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.1 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2-Speciale Text โ“ โ“ โ“
MiniMaxAI/MiniMax-M2.5 Text โ“ โ“ โ“
moonshotai/Kimi-K2-Thinking Text โ“ โ“ โ“
openai/gpt-oss-20b Text โ“ โ“ โ“
zai-org/GLM-5 Text โ“ โ“ โ“
๐Ÿš€ย  Advanced Capabilities
Core Features
Feature Flax Torchax Default
Chunked Prefill โœ… โœ… โœ…
DCN-based P/D disaggregation โœ… โœ… โœ…
LoRA_Torch โœ… โœ… โœ…
Prefix Caching โœ… โœ… โœ…
Single Program Multi Data โœ… โœ… โœ…
Speculative Decoding: Ngram โœ… โœ… โœ…
Speculative Decoding: Eagle3 โœ… โŒ โœ…
async scheduler โŒ โœ… โœ…
Out-of-tree model support โŒ โœ… โœ…
Multimodal Inputs โŒ โŒ โŒ
hybrid kv cache โ“ โ“ โ“
KV cache host offloading โ“ โ“ โ“
multi-host โ“ โ“ โ“
runai_model_streamer_loader โ“ โ“ โ“
sampling_params โ“ โ“ โ“
structured_decoding โ“ โ“ โ“
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP โœ… โœ… โœ… โœ…
EP โœ… โ“ โœ… โ“
TP โœ… โ“ โœ… โ“
DP โŒ โ“ โŒ โ“
CP โ“ โ“ โ“ โ“
SPย (voteย toย prioritize) โ“ โ“ โ“ โ“
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7 โ“ โ“
FP8 W8A16 compressed-tensor v7 โ“ โ“
FP8 W8A8 compressed-tensor v7 โ“ โ“
INT4 W4A16 awq v5, v6 โ“ โ“
INT8 W8A8 compressed-tensor v5, v6 โ“ โ“

Note:

  • This table only tests checkpoint loading compatibility.
๐Ÿ”ฌ Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fusedย MoE โ“ โ“ โ“ โ“ โ“ โ“
gmm โ“ โ“ โ“ โ“ โ“ โ“
Dense Allโ€‘gatherย matmul โ“ โ“ โ“ โ“ โ“ โ“
Attention Genericย Raggedย Paged
Attentionย V3*
โ“ โ“ โ“ โ“ โ“ โ“
MLA โ“ โ“ โ“ โ“ โ“ โ“
Raggedย Paged
Attentionย V3ย Head_Dim
64*
โ“ โ“ โ“ โ“ โ“ โ“

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

๐Ÿค Contribute

bug good first issue enhancement contribution-welcome auto-generated View All Issues

We're thrilled you're interested in contributing to the vLLM TPU project! Your help is essential for making our tools better for everyone. There are many ways to get involved, even if you're not ready to write code.

Ways to Contribute:

  • ๐Ÿž Submit Bugs & Suggest Features: See an issue or have an idea? Open a new issue to let us know.
  • ๐Ÿ‘€ Provide Feedback on Pull Requests: Lend your expertise by reviewing open pull requests and helping us improve the quality of our codebase.
  • ๐Ÿ“š Improve Our Documentation: Help us make our guides clearer. Fix a typo, clarify a confusing section, or write a new recipe.

If you're ready to contribute code, our Contributing Guide is the best place to start. It covers everything you need to know, including:

  • Tips for finding an issue to work on (we recommend starting with our good-first issues!.

๐ŸŒŸ Contributors Wall

A huge thank you to everyone who has helped build and improve vllm-project/tpu-inference!

๐ŸŒŸ Contribution Type Legend & Ranking
Emoji Contribution Meaning
๐Ÿ’ป Code Submitted merged pull requests or code changes.
๐Ÿ› Issues Opened valid issues or bug reports.
๐Ÿ‘€ Reviews Reviewed pull requests and provided feedback.

๐Ÿ† Ranking: Contributors are sorted from highest to lowest based on their total effort score (Total Commits + Unique Issues Opened + PRs Reviewed). If there is a tie, contributors are displayed alphabetically.


xiangxu-google
xiangxu-google

๐Ÿ’ป
jrplatin
jrplatin

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
buildkite-bot
buildkite-bot

๐Ÿ’ป
kyuyeunk
kyuyeunk

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
py4
py4

๐Ÿ’ป
fenghuizhang
fenghuizhang

๐Ÿ’ป
lk-chen
lk-chen

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
wenxindongwork
wenxindongwork

๐Ÿ‘€ ๐Ÿ’ป
vanbasten23
vanbasten23

๐Ÿ‘€ ๐Ÿ’ป
sixiang-google
sixiang-google

๐Ÿ’ป
lsy323
lsy323

๐Ÿ’ป
Lumosis
Lumosis

๐Ÿ’ป
QiliangCui
QiliangCui

๐Ÿ‘€ ๐Ÿ’ป
Chenyaaang
Chenyaaang

๐Ÿ‘€ ๐Ÿ’ป
bzgoogle
bzgoogle

๐Ÿ‘€ ๐Ÿ’ป
gpolovets1
gpolovets1

๐Ÿ‘€ ๐Ÿ’ป
mrjunwan-lang
mrjunwan-lang

๐Ÿ‘€ ๐Ÿ’ป
yarongmu-google
yarongmu-google

๐Ÿ’ป
wwl2755-google
wwl2755-google

๐Ÿ’ป
yaochengji
yaochengji

๐Ÿ’ป
patemotter
patemotter

๐Ÿ‘€ ๐Ÿ’ป

...and more! Click to view all contributors.
boe20211
boe20211

๐Ÿ’ป
jcyang43
jcyang43

๐Ÿ‘€ ๐Ÿ’ป
kwang3939
kwang3939

๐Ÿ‘€ ๐Ÿ’ป
bythew3i
bythew3i

๐Ÿ’ป
pv97
pv97

๐Ÿ‘€ ๐Ÿ’ป
karan
karan

๐Ÿ› ๐Ÿ’ป
dennisYehCienet
dennisYehCienet

๐Ÿ‘€ ๐Ÿ’ป
syhuang22
syhuang22

๐Ÿ‘€ ๐Ÿ’ป
helloworld1
helloworld1

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
ica-chao
ica-chao

๐Ÿ’ป
richardsliu
richardsliu

๐Ÿ‘€ ๐Ÿ’ป
catswe
catswe

๐Ÿ‘€ ๐Ÿ’ป
RobMulla
RobMulla

๐Ÿ› ๐Ÿ’ป
xingliu14
xingliu14

๐Ÿ› ๐Ÿ’ป
juncgu-google
juncgu-google

๐Ÿ‘€
saltysoup
saltysoup

๐Ÿ›
weiyu0824
weiyu0824

๐Ÿ‘€ ๐Ÿ’ป
andrewkvuong
andrewkvuong

๐Ÿ’ป
rupengliu-meta
rupengliu-meta

๐Ÿ› ๐Ÿ’ป
bvrockwell
bvrockwell

๐Ÿ› ๐Ÿ’ป
sierraisland
sierraisland

๐Ÿ’ป
wang2yn84
wang2yn84

๐Ÿ’ป
wdhongtw
wdhongtw

๐Ÿ’ป
JiriesKaileh
JiriesKaileh

๐Ÿ’ป
ylangtsou
ylangtsou

๐Ÿ’ป
amacaskill
amacaskill

๐Ÿ’ป
BirdsOfAFthr
BirdsOfAFthr

๐Ÿ’ป
patrickji2014
patrickji2014

๐Ÿ‘€ ๐Ÿ’ป
qihqi
qihqi

๐Ÿ› ๐Ÿ’ป
yuanfz98
yuanfz98

๐Ÿ›
cychiuak
cychiuak

๐Ÿ’ป
hosseinsarshar
hosseinsarshar

๐Ÿ› ๐Ÿ’ป
samos123
samos123

๐Ÿ›
AlienKevin
AlienKevin

๐Ÿ›
dgouju
dgouju

๐Ÿ›
eitanporat
eitanporat

๐Ÿ›
ernie-chang
ernie-chang

๐Ÿ’ป
lepan-google
lepan-google

๐Ÿ› ๐Ÿ’ป
muskansh-google
muskansh-google

๐Ÿ›
saikat-royc
saikat-royc

๐Ÿ‘€
abhinavclemson
abhinavclemson

๐Ÿ’ป
aman2930
aman2930

๐Ÿ’ป
BabyChouSr
BabyChouSr

๐Ÿ›
CienetStingLin
CienetStingLin

๐Ÿ’ป
coolkp
coolkp

๐Ÿ’ป
functionstackx
functionstackx

๐Ÿ›
helloleah
helloleah

๐Ÿ’ป
mailvijayasingh
mailvijayasingh

๐Ÿ’ป
QiliangCui2023
QiliangCui2023

๐Ÿ‘€
shireen-bean
shireen-bean

๐Ÿ›
utkarshsharma1
utkarshsharma1

๐Ÿ’ป
A9isha
A9isha

๐Ÿ’ป
AahilA
AahilA

๐Ÿ’ป
amishacorns
amishacorns

๐Ÿ’ป
carlesoctav
carlesoctav

๐Ÿ›
dannikay
dannikay

๐Ÿ’ป
depksingh
depksingh

๐Ÿ›
Dineshkumar-Anandan-ZS0367
Dineshkumar-Anandan-ZS0367

๐Ÿ›
dtrifiro
dtrifiro

๐Ÿ›
erfanzar
erfanzar

๐Ÿ›
inho9606
inho9606

๐Ÿ’ป
jk1333
jk1333

๐Ÿ›
jyj0w0
jyj0w0

๐Ÿ‘€
kuafou
kuafou

๐Ÿ’ป
kyle-google
kyle-google

๐Ÿ’ป
Mhdaw
Mhdaw

๐Ÿ›
mokeddembillel
mokeddembillel

๐Ÿ›
oindrila-b
oindrila-b

๐Ÿ›
oliverdutton
oliverdutton

๐Ÿ›
pathfinder-pf
pathfinder-pf

๐Ÿ›
piotrfrankowski
piotrfrankowski

๐Ÿ›
reeaz27-droid
reeaz27-droid

๐Ÿ›
rupeng-liu
rupeng-liu

๐Ÿ’ป
salmanmohammadi
salmanmohammadi

๐Ÿ›
vlad-karp
vlad-karp

๐Ÿ’ป
XMaster96
XMaster96

๐Ÿ›
yixinshi
yixinshi

๐Ÿ‘€
yuyanpeng-google
yuyanpeng-google

๐Ÿ’ป
zixi-qi
zixi-qi

๐Ÿ’ป
zongweiz
zongweiz

๐Ÿ›
zzzwen
zzzwen

๐Ÿ’ป

๐Ÿ’ฌย  Contact us

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpu_inference-0.19.0.dev20260506.tar.gz (902.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpu_inference-0.19.0.dev20260506-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file tpu_inference-0.19.0.dev20260506.tar.gz.

File metadata

File hashes

Hashes for tpu_inference-0.19.0.dev20260506.tar.gz
Algorithm Hash digest
SHA256 db8d98f5fc934cf49e641fb12f42f768deff269b8f75cd38526f719191de733c
MD5 77a1dfc398fe68670b9f751441ae1a02
BLAKE2b-256 293319f509aeaeb6324309f70bd1b4c1b86d7e7c558caf89129a71ac03176734

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpu_inference-0.19.0.dev20260506.tar.gz:

Publisher: release.yml on vllm-project/tpu-inference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpu_inference-0.19.0.dev20260506-py3-none-any.whl.

File metadata

File hashes

Hashes for tpu_inference-0.19.0.dev20260506-py3-none-any.whl
Algorithm Hash digest
SHA256 dc5de0a984f1c17a9f6e9816c45026cde183a4f80d07bd8aced35622271fffa5
MD5 2c241812027648ea9fb9ce04f67954be
BLAKE2b-256 4363bc9a73344fb1bffe811ca00619dd756372499964823a81ad69f211c09ad0

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpu_inference-0.19.0.dev20260506-py3-none-any.whl:

Publisher: release.yml on vllm-project/tpu-inference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page