Skip to main content

No project description provided

Project description

vLLM TPU vLLM TPU

| Documentation | Blog | User Forum | Developer Slack (#sig-tpu) |


๐Ÿค Contribute to the Project
Looking to help? Click a badge below to find issues that need your attention.

bug good first issue enhancement contribution-welcome auto-generated View All Issues

Latest News

Previous News ๐Ÿ”ฅ

About

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path within the vLLM project. The new backend now provides a framework for developers to:

  • Push the limits of TPU hardware performance in open source.
  • Provide more flexibility to JAX and PyTorch users by running PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX.
  • Retain vLLM standardization: keep the same user experience, telemetry, and interface.

Recommended models and features

Although vLLM TPUโ€™s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.

For this reason, weโ€™ve provided a Recommended Models and Features page detailing the models and features that are validated through unit, integration, and performance testing.


Get started

Get started with vLLM on TPUs by following the quickstart guide.

Visit our documentation to learn more.

Compatible TPU Generations

  • Recommended: v7x, v5e, v6e
  • Experimental: v3, v4, v5p

Recipes


TPU Support Matrix Dashboard

Below is the live status of our supported models, features, and kernels. Click on any category to expand the detailed support table. It is automatically updated from our detailed Support Matrices.

Last Updated: 2026-04-16 10:24 PM UTC

๐Ÿšฆ Status Legend
  • โœ… Passing: Tested and works as expected. Ready for use.
  • โŒ Failing: Known to be broken or not functional. Help is wanted to fix this!
  • ๐Ÿงช Experimental: Works, but unoptimized or pending community validation.
  • ๐Ÿ“ Planned: Not yet implemented, but on the official roadmap.
  • โ›”๏ธ Unplanned: There is no benefit to adding this.
  • โ“ Untested: The functionality exists but has not been recently or thoroughly verified.
๐Ÿ“ View Matrix Aggregation Rules (v6e/v7x & C+P)
  • ๐Ÿ› ๏ธ Correctness + Performance (C + P)

    • โŒ Failing: If either check fails.
    • โœ… Passing: If BOTH checks pass successfully.
    • โ“ Untested: If any check is untested (and neither fails).
  • ๐ŸŒ Hardware Rollups (v6e + v7x)

    • โŒ Failing: If the feature fails on either v6e or v7x.
    • โœ… Passing: If the feature passes on BOTH v6e and v7x.
    • โ“ Untested: If either generation is untested (and neither fails).

Release Support Matrices

Click to expand support matrices

Stable support status for official releases and production deployments.

โœ… Tested Models
Model Type Unitย Test Correctnessย Test Performanceย Test
google/gemma-3-27b-it Text โœ… โœ… โœ…
meta-llama/Llama-3.1-8B-Instruct Text โœ… โœ… โœ…
meta-llama/Llama-3.3-70B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3-30B-A3B Text โœ… โœ… โœ…
Qwen/Qwen3-32B Text โœ… โœ… โœ…
Qwen/Qwen3-4B Text โœ… โœ… โœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen2.5-VL-7B-Instruct Multimodal โœ… โœ… โŒ
deepseek-ai/DeepSeek-OCR Multimodal โ“ โ“ โ“
moonshotai/Kimi-K2.5 Multimodal โ“ โ“ โ“
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3-VL-8B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3.5-9B Multimodal โ“ โ“ โ“
deepseek-ai/DeepSeek-Math-V2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-R1 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.1 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2-Speciale Text โ“ โ“ โ“
MiniMaxAI/MiniMax-M2.5 Text โ“ โ“ โ“
moonshotai/Kimi-K2-Thinking Text โ“ โ“ โ“
openai/gpt-oss-120b Text โ“ โ“ โ“
openai/gpt-oss-20b Text โ“ โ“ โ“
zai-org/GLM-5 Text โ“ โ“ โ“
๐Ÿš€ย  Advanced Capabilities
Core Features
Feature Flax Torchax Default
async scheduler โœ… โœ… โœ…
Chunked Prefill โœ… โœ… โœ…
DCN-based P/D disaggregation โœ… โœ… โœ…
LoRA_Torch โœ… โœ… โœ…
Out-of-tree model support โœ… โœ… โœ…
Prefix Caching โœ… โœ… โœ…
Single Program Multi Data โœ… โœ… โœ…
Speculative Decoding: Ngram โœ… โœ… โœ…
Multimodal Inputs โœ… โŒ โœ…
Speculative Decoding: Eagle3 โœ… โŒ โœ…
hybrid kv cache โ“ โ“ โ“
KV cache host offloading โ“ โ“ โ“
multi-host โ“ โ“ โ“
runai_model_streamer_loader โ“ โ“ โ“
sampling_params โ“ โ“ โ“
Single-Host-P-D-disaggregation โ“ โ“ โ“
structured_decoding โ“ โ“ โ“
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
EP โœ… โ“ โœ… โ“
TP โœ… โ“ โœ… โ“
PP โŒ โœ… โŒ โŒ
DP โŒ โ“ โœ… โ“
CP โ“ โ“ โ“ โ“
SPย (voteย toย prioritize) โ“ โ“ โ“ โ“
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
AWQ INT4 v5, v6 โ“ โ“
FP4 W4A16 mxfp4 v7 โ“ โ“
FP8 W8A16 compressed-tensor v7 โ“ โ“
FP8 W8A8 compressed-tensor v7 โ“ โ“
INT4 W4A16 awq v5, v6 โ“ โ“
INT8 W8A8 compressed-tensor v5, v6 โ“ โ“

Note:

  • This table only tests checkpoint loading compatibility.
๐Ÿ”ฌ Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fusedย MoE โ“ โ“ โ“ โ“ โ“ โ“
gmm โ“ โ“ โ“ โ“ โ“ โ“
Dense Allโ€‘gatherย matmul โ“ โ“ โ“ โ“ โ“ โ“
Attention Genericย Raggedย Paged
Attentionย V3*
โ“ โ“ โ“ โ“ โ“ โ“
MLA โ“ โ“ โ“ โ“ โ“ โ“
Raggedย Paged
Attentionย V3ย Head_Dim
64*
โ“ โ“ โ“ โ“ โ“ โ“

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Nightly Support Matrices

Click to expand support matrices

Support status for the latest nightly/main branch developments.

โœ… Tested Models
Model Type Unitย Test Correctnessย Test Performanceย Test
google/gemma-3-27b-it Text โœ… โœ… โœ…
meta-llama/Llama-3.1-8B-Instruct Text โœ… โœ… โœ…
meta-llama/Llama-3.3-70B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3-30B-A3B Text โœ… โœ… โœ…
Qwen/Qwen3-32B Text โœ… โœ… โœ…
Qwen/Qwen3-4B Text โœ… โœ… โœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text โœ… โœ… โœ…
Qwen/Qwen3.5-397B-A17B Text โœ… โœ… โŒ
openai/gpt-oss-120b Text โœ… โœ… โ“
Qwen/Qwen2.5-VL-7B-Instruct Multimodal โœ… โŒ โ“
deepseek-ai/DeepSeek-R1 Text โœ… โ“ โ“
google/gemma-4-26B-A4B-it Multimodal โŒ โ“ โ“
google/gemma-4-31B-it Multimodal โŒ โ“ โ“
deepseek-ai/DeepSeek-OCR Multimodal โ“ โ“ โ“
moonshotai/Kimi-K2.5 Multimodal โ“ โ“ โ“
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3-VL-8B-Instruct Multimodal โ“ โ“ โ“
Qwen/Qwen3.5-9B Multimodal โ“ โ“ โ“
deepseek-ai/DeepSeek-Math-V2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.1 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2 Text โ“ โ“ โ“
deepseek-ai/DeepSeek-V3.2-Speciale Text โ“ โ“ โ“
MiniMaxAI/MiniMax-M2.5 Text โ“ โ“ โ“
moonshotai/Kimi-K2-Thinking Text โ“ โ“ โ“
openai/gpt-oss-20b Text โ“ โ“ โ“
zai-org/GLM-5 Text โ“ โ“ โ“
๐Ÿš€ย  Advanced Capabilities
Core Features
Feature Flax Torchax Default
Chunked Prefill โœ… โœ… โœ…
DCN-based P/D disaggregation โœ… โœ… โœ…
LoRA_Torch โœ… โœ… โœ…
Prefix Caching โœ… โœ… โœ…
Single Program Multi Data โœ… โœ… โœ…
Speculative Decoding: Ngram โœ… โœ… โœ…
async scheduler โœ… โœ… โŒ
Speculative Decoding: Eagle3 โœ… โŒ โœ…
Out-of-tree model support โŒ โœ… โŒ
Multimodal Inputs โŒ โŒ โŒ
Single-Host-P-D-disaggregation โŒ โ“ โŒ
hybrid kv cache โ“ โ“ โ“
KV cache host offloading โ“ โ“ โ“
multi-host โ“ โ“ โ“
runai_model_streamer_loader โ“ โ“ โ“
sampling_params โ“ โ“ โ“
structured_decoding โ“ โ“ โ“
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
EP โœ… โ“ โœ… โ“
TP โœ… โ“ โœ… โ“
PP โŒ โŒ โœ… โœ…
DP โŒ โ“ โœ… โ“
CP โ“ โ“ โ“ โ“
SPย (voteย toย prioritize) โ“ โ“ โ“ โ“
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7 โ“ โ“
FP8 W8A16 compressed-tensor v7 โ“ โ“
FP8 W8A8 compressed-tensor v7 โ“ โ“
INT4 W4A16 awq v5, v6 โ“ โ“
INT8 W8A8 compressed-tensor v5, v6 โ“ โ“

Note:

  • This table only tests checkpoint loading compatibility.
๐Ÿ”ฌ Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fusedย MoE โ“ โ“ โ“ โ“ โ“ โ“
gmm โ“ โ“ โ“ โ“ โ“ โ“
Dense Allโ€‘gatherย matmul โ“ โ“ โ“ โ“ โ“ โ“
Attention Genericย Raggedย Paged
Attentionย V3*
โ“ โ“ โ“ โ“ โ“ โ“
MLA โ“ โ“ โ“ โ“ โ“ โ“
Raggedย Paged
Attentionย V3ย Head_Dim
64*
โ“ โ“ โ“ โ“ โ“ โ“

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

๐Ÿค Contribute

bug good first issue enhancement contribution-welcome auto-generated View All Issues

We're thrilled you're interested in contributing to the vLLM TPU project! Your help is essential for making our tools better for everyone. There are many ways to get involved, even if you're not ready to write code.

Ways to Contribute:

  • ๐Ÿž Submit Bugs & Suggest Features: See an issue or have an idea? Open a new issue to let us know.
  • ๐Ÿ‘€ Provide Feedback on Pull Requests: Lend your expertise by reviewing open pull requests and helping us improve the quality of our codebase.
  • ๐Ÿ“š Improve Our Documentation: Help us make our guides clearer. Fix a typo, clarify a confusing section, or write a new recipe.

If you're ready to contribute code, our Contributing Guide is the best place to start. It covers everything you need to know, including:

  • Tips for finding an issue to work on (we recommend starting with our good-first issues!.

๐ŸŒŸ Contributors Wall

A huge thank you to everyone who has helped build and improve vllm-project/tpu-inference!

๐ŸŒŸ Contribution Type Legend & Ranking
Emoji Contribution Meaning
๐Ÿ’ป Code Submitted merged pull requests or code changes.
๐Ÿ› Issues Opened valid issues or bug reports.
๐Ÿ‘€ Reviews Reviewed pull requests and provided feedback.

๐Ÿ† Ranking: Contributors are sorted from highest to lowest based on their total effort score (Total Commits + Unique Issues Opened + PRs Reviewed). If there is a tie, contributors are displayed alphabetically.


xiangxu-google
xiangxu-google

๐Ÿ’ป
jrplatin
jrplatin

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
buildkite-bot
buildkite-bot

๐Ÿ’ป
kyuyeunk
kyuyeunk

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
py4
py4

๐Ÿ’ป
fenghuizhang
fenghuizhang

๐Ÿ’ป
lk-chen
lk-chen

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
wenxindongwork
wenxindongwork

๐Ÿ‘€ ๐Ÿ’ป
vanbasten23
vanbasten23

๐Ÿ‘€ ๐Ÿ’ป
sixiang-google
sixiang-google

๐Ÿ’ป
lsy323
lsy323

๐Ÿ’ป
Lumosis
Lumosis

๐Ÿ’ป
QiliangCui
QiliangCui

๐Ÿ‘€ ๐Ÿ’ป
Chenyaaang
Chenyaaang

๐Ÿ‘€ ๐Ÿ’ป
bzgoogle
bzgoogle

๐Ÿ‘€ ๐Ÿ’ป
gpolovets1
gpolovets1

๐Ÿ‘€ ๐Ÿ’ป
mrjunwan-lang
mrjunwan-lang

๐Ÿ‘€ ๐Ÿ’ป
yarongmu-google
yarongmu-google

๐Ÿ’ป
wwl2755-google
wwl2755-google

๐Ÿ’ป
yaochengji
yaochengji

๐Ÿ’ป
patemotter
patemotter

๐Ÿ‘€ ๐Ÿ’ป

...and more! Click to view all contributors.
boe20211
boe20211

๐Ÿ’ป
jcyang43
jcyang43

๐Ÿ‘€ ๐Ÿ’ป
kwang3939
kwang3939

๐Ÿ‘€ ๐Ÿ’ป
bythew3i
bythew3i

๐Ÿ’ป
pv97
pv97

๐Ÿ‘€ ๐Ÿ’ป
karan
karan

๐Ÿ› ๐Ÿ’ป
dennisYehCienet
dennisYehCienet

๐Ÿ‘€ ๐Ÿ’ป
syhuang22
syhuang22

๐Ÿ‘€ ๐Ÿ’ป
helloworld1
helloworld1

๐Ÿ› ๐Ÿ‘€ ๐Ÿ’ป
ica-chao
ica-chao

๐Ÿ’ป
richardsliu
richardsliu

๐Ÿ‘€ ๐Ÿ’ป
catswe
catswe

๐Ÿ‘€ ๐Ÿ’ป
RobMulla
RobMulla

๐Ÿ› ๐Ÿ’ป
xingliu14
xingliu14

๐Ÿ› ๐Ÿ’ป
juncgu-google
juncgu-google

๐Ÿ‘€
saltysoup
saltysoup

๐Ÿ›
weiyu0824
weiyu0824

๐Ÿ‘€ ๐Ÿ’ป
andrewkvuong
andrewkvuong

๐Ÿ’ป
rupengliu-meta
rupengliu-meta

๐Ÿ› ๐Ÿ’ป
bvrockwell
bvrockwell

๐Ÿ› ๐Ÿ’ป
sierraisland
sierraisland

๐Ÿ’ป
wang2yn84
wang2yn84

๐Ÿ’ป
wdhongtw
wdhongtw

๐Ÿ’ป
JiriesKaileh
JiriesKaileh

๐Ÿ’ป
ylangtsou
ylangtsou

๐Ÿ’ป
amacaskill
amacaskill

๐Ÿ’ป
BirdsOfAFthr
BirdsOfAFthr

๐Ÿ’ป
patrickji2014
patrickji2014

๐Ÿ‘€ ๐Ÿ’ป
qihqi
qihqi

๐Ÿ› ๐Ÿ’ป
yuanfz98
yuanfz98

๐Ÿ›
cychiuak
cychiuak

๐Ÿ’ป
hosseinsarshar
hosseinsarshar

๐Ÿ› ๐Ÿ’ป
samos123
samos123

๐Ÿ›
AlienKevin
AlienKevin

๐Ÿ›
dgouju
dgouju

๐Ÿ›
eitanporat
eitanporat

๐Ÿ›
ernie-chang
ernie-chang

๐Ÿ’ป
lepan-google
lepan-google

๐Ÿ› ๐Ÿ’ป
muskansh-google
muskansh-google

๐Ÿ›
saikat-royc
saikat-royc

๐Ÿ‘€
abhinavclemson
abhinavclemson

๐Ÿ’ป
aman2930
aman2930

๐Ÿ’ป
BabyChouSr
BabyChouSr

๐Ÿ›
CienetStingLin
CienetStingLin

๐Ÿ’ป
coolkp
coolkp

๐Ÿ’ป
functionstackx
functionstackx

๐Ÿ›
helloleah
helloleah

๐Ÿ’ป
mailvijayasingh
mailvijayasingh

๐Ÿ’ป
QiliangCui2023
QiliangCui2023

๐Ÿ‘€
shireen-bean
shireen-bean

๐Ÿ›
utkarshsharma1
utkarshsharma1

๐Ÿ’ป
A9isha
A9isha

๐Ÿ’ป
AahilA
AahilA

๐Ÿ’ป
amishacorns
amishacorns

๐Ÿ’ป
carlesoctav
carlesoctav

๐Ÿ›
dannikay
dannikay

๐Ÿ’ป
depksingh
depksingh

๐Ÿ›
Dineshkumar-Anandan-ZS0367
Dineshkumar-Anandan-ZS0367

๐Ÿ›
dtrifiro
dtrifiro

๐Ÿ›
erfanzar
erfanzar

๐Ÿ›
inho9606
inho9606

๐Ÿ’ป
jk1333
jk1333

๐Ÿ›
jyj0w0
jyj0w0

๐Ÿ‘€
kuafou
kuafou

๐Ÿ’ป
kyle-google
kyle-google

๐Ÿ’ป
Mhdaw
Mhdaw

๐Ÿ›
mokeddembillel
mokeddembillel

๐Ÿ›
oindrila-b
oindrila-b

๐Ÿ›
oliverdutton
oliverdutton

๐Ÿ›
pathfinder-pf
pathfinder-pf

๐Ÿ›
piotrfrankowski
piotrfrankowski

๐Ÿ›
reeaz27-droid
reeaz27-droid

๐Ÿ›
rupeng-liu
rupeng-liu

๐Ÿ’ป
salmanmohammadi
salmanmohammadi

๐Ÿ›
vlad-karp
vlad-karp

๐Ÿ’ป
XMaster96
XMaster96

๐Ÿ›
yixinshi
yixinshi

๐Ÿ‘€
yuyanpeng-google
yuyanpeng-google

๐Ÿ’ป
zixi-qi
zixi-qi

๐Ÿ’ป
zongweiz
zongweiz

๐Ÿ›
zzzwen
zzzwen

๐Ÿ’ป

๐Ÿ’ฌย  Contact us

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpu_inference-0.18.0.tar.gz (750.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpu_inference-0.18.0-py3-none-any.whl (970.1 kB view details)

Uploaded Python 3

File details

Details for the file tpu_inference-0.18.0.tar.gz.

File metadata

  • Download URL: tpu_inference-0.18.0.tar.gz
  • Upload date:
  • Size: 750.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for tpu_inference-0.18.0.tar.gz
Algorithm Hash digest
SHA256 966a643ddab4ca8094d492a48f548a555398b37a6f463294fe6d8a6190a9d036
MD5 9088c417f2be6ded813ed2b463ebc43a
BLAKE2b-256 4ae91439764de20923559333eee6bdd975d54cdc6c78d530e441eeb7555ef651

See more details on using hashes here.

File details

Details for the file tpu_inference-0.18.0-py3-none-any.whl.

File metadata

  • Download URL: tpu_inference-0.18.0-py3-none-any.whl
  • Upload date:
  • Size: 970.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for tpu_inference-0.18.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54ff44d86654356a73570136504d9322de3b043e952791111d76d3a574407e91
MD5 5a92a6711899e12ab3b0c749297da8b1
BLAKE2b-256 3f1f5f2aef5f244acc8550c2743b66acd6c8b02ec4925c45a3794ddd0fca25fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page