FlashInfer: Kernel Library for LLM Serving

These details have not been verified by PyPI

Project links

Homepage

Project description

FlashInfer

Kernel Library for LLM Serving

FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more. FlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.

Check our v0.2 release blog for new features!

The core features of FlashInfer include:

Efficient Sparse/Dense Attention Kernels: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size.
Load-Balanced Scheduling: FlashInfer decouples plan/run stage of attention computation where we schedule the computation of variable-length inputs in plan stage to alleviate load-imbalance issue.
Memory Efficiency: FlashInfer offers Cascade Attention for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
Customizable Attention: Bring your own attention variants through JIT-compilation.
CUDAGraph and torch.compile Compatibility: FlashInfer kernels can be captured by CUDAGraphs and torch.compile for low-latency inference.
Efficient LLM-specific Operators: High-Performance fused kernel for Top-P, Top-K/Min-P sampling without the need to sorting.

FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.

News

[Mar 10, 2025] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
[Mar 1, 2025] Checkout flashinfer's intra-kernel profiler for visualizing the timeline of each threadblock in GPU kernels.
[Dec 16, 2024] Blog Post FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
[Sept 2024] We've launched a Slack workspace for Flashinfer users and developers. Join us for timely support, discussions, updates and knowledge sharing!
[Jan 31, 2024] Blog Post Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding
[Jan 31, 2024] Blog Post Accelerating Self-Attentions for LLM Serving with FlashInfer

Getting Started

Using our PyTorch API is the easiest way to get started:

Install from PyPI

FlashInfer is available as a Python package for Linux. Install the core package with:

pip install flashinfer-python

Package Options:

flashinfer-python: Core package that compiles/downloads kernels on first use
flashinfer-cubin: Pre-compiled kernel binaries for all supported GPU architectures
flashinfer-jit-cache: Pre-built kernel cache for specific CUDA versions

For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:

pip install flashinfer-python flashinfer-cubin
# JIT cache package (replace cu129 with your CUDA version: cu128, cu129, or cu130)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129

This eliminates compilation and downloading overhead at runtime.

Install from Source

Build the core package from source:

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
python -m pip install -v .

For development, install in editable mode:

python -m pip install --no-build-isolation -e . -v

Build optional packages:

flashinfer-cubin:

cd flashinfer-cubin
python -m build --no-isolation --wheel
python -m pip install dist/*.whl

flashinfer-jit-cache (customize FLASHINFER_CUDA_ARCH_LIST for your target GPUs):

export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl

For more details, see the Install from Source documentation.

Install Nightly Build

Nightly builds are available for testing the latest features:

# Core and cubin packages
pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps # Install the nightly package from custom index, without installing dependencies
pip install flashinfer-python  # Install flashinfer-python's dependencies from PyPI
pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/
# JIT cache package (replace cu129 with your CUDA version: cu128, cu129, or cu130)
pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129

Verify Installation

After installation, verify that FlashInfer is correctly installed and configured:

flashinfer show-config

This command displays:

FlashInfer version and installed packages (flashinfer-python, flashinfer-cubin, flashinfer-jit-cache)
PyTorch and CUDA version information
Environment variables and artifact paths
Downloaded cubin status and module compilation status

Trying it out

Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:

import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)

# decode attention

num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)

o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(0) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(0) # prefill attention
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask

Check out documentation for usage of batch decode/append/prefill kernels and shared-prefix cascading kernels.

API Logging

FlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:

# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics)
export FLASHINFER_LOGLEVEL=3

# Set log destination (stdout (default), stderr, or file path)
export FLASHINFER_LOGDEST=stdout

For detailed information about logging levels, configuration, and advanced features, see Logging in our documentation.

Custom Attention Variants

Starting from FlashInfer v0.2, users can customize their own attention variants with additional parameters. For more details, refer to our JIT examples.

GPU and CUDA Support

FlashInfer currently provides support for NVIDIA SM architectures 75 and higher and beta support for 103, 110, 120, and 121.

Supported CUDA Versions: 12.6, 12.8, 13.0, 13.1

Note: FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.

Adoption

We are thrilled to share that FlashInfer is being adopted by many cutting-edge projects, including but not limited to:

Acknowledgement

FlashInfer is inspired by FlashAttention 1&2, vLLM, stream-K, cutlass and AITemplate projects.

Citation

If you find FlashInfer helpful in your project or research, please consider citing our paper:

@article{ye2025flashinfer,
    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
    author = {
      Ye, Zihao and
      Chen, Lequn and
      Lai, Ruihang and
      Lin, Wuwei and
      Zhang, Yineng and
      Wang, Stephanie and
      Chen, Tianqi and
      Kasikci, Baris and
      Grover, Vinod and
      Krishnamurthy, Arvind and
      Ceze, Luis
    },
    journal = {arXiv preprint arXiv:2501.01005},
    year = {2025},
    url = {https://arxiv.org/abs/2501.01005}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.8.post1

Apr 18, 2026

0.6.8

Apr 16, 2026

0.6.8rc1 pre-release

Apr 14, 2026

0.6.7.post3

Apr 6, 2026

0.6.7.post2

Apr 4, 2026

0.6.7.post1

Apr 3, 2026

0.6.7

Mar 25, 2026

0.6.6

Mar 11, 2026

0.6.5

Mar 4, 2026

0.6.4

Feb 19, 2026

0.6.3

Feb 6, 2026

0.6.2

Jan 23, 2026

0.6.1

Jan 14, 2026

0.6.0

Jan 8, 2026

0.6.0rc2 pre-release

Dec 20, 2025

This version

0.6.0rc1 pre-release

Dec 18, 2025

0.5.3

Nov 20, 2025

0.5.2

Nov 7, 2025

0.5.1

Nov 4, 2025

0.5.0

Nov 2, 2025

0.5.0rc3 pre-release

Nov 1, 2025

0.5.0rc2 pre-release

Oct 31, 2025

0.5.0rc1 pre-release

Oct 30, 2025

0.4.1

Oct 14, 2025

0.4.0

Oct 9, 2025

0.4.0rc4 pre-release

Oct 2, 2025

0.4.0rc3 pre-release

Sep 24, 2025

0.4.0rc2 pre-release

Sep 23, 2025

0.4.0rc1 pre-release

Sep 19, 2025

0.4.0rc0 pre-release

Sep 18, 2025

0.3.1.post1

Sep 26, 2025

0.3.1

Sep 5, 2025

0.3.0.post1

Sep 26, 2025

0.3.0

Sep 1, 2025

0.3.0rc1 pre-release

Aug 29, 2025

0.2.14.post1

Aug 25, 2025

0.2.14

Aug 23, 2025

0.2.13

Aug 20, 2025

0.2.12

Aug 18, 2025

0.2.11.post3

Aug 14, 2025

0.2.11.post2

Aug 13, 2025

0.2.11.post1

Aug 11, 2025

0.2.11

Aug 10, 2025

0.2.10

Aug 5, 2025

0.2.9

Aug 5, 2025

0.2.9rc2 pre-release

Jul 27, 2025

0.2.9rc1 pre-release

Jul 23, 2025

0.2.8

Jul 21, 2025

0.2.8rc1 pre-release

Jul 8, 2025

0.2.7.post1

Jul 1, 2025

0.2.7

Jun 30, 2025

0.2.6.post1

Jun 7, 2025

0.2.6

Jun 6, 2025

0.2.5

Apr 4, 2025

0.2.4

Mar 30, 2025

0.2.3

Mar 11, 2025

0.2.2.post1

Feb 27, 2025

0.2.2

Feb 23, 2025

0.2.1.post2

Feb 19, 2025

0.2.1.post1

Feb 14, 2025

0.2.1

Feb 13, 2025

0.2.0.post2

Jan 31, 2025

0.2.0.post1

Jan 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashinfer_python-0.6.0rc1.tar.gz (5.0 MB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flashinfer_python-0.6.0rc1-py3-none-any.whl (7.4 MB view details)

Uploaded Dec 18, 2025 Python 3

File details

Details for the file flashinfer_python-0.6.0rc1.tar.gz.

File metadata

Download URL: flashinfer_python-0.6.0rc1.tar.gz
Upload date: Dec 18, 2025
Size: 5.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for flashinfer_python-0.6.0rc1.tar.gz
Algorithm	Hash digest
SHA256	`70e27b20f2235cbf52bf6d5ed2450fa3b790a81abb96be0f0f9ded9a93e00bc7`
MD5	`f79a05f10c886fa4586bd705335b36d9`
BLAKE2b-256	`532ae855be4851ad6bfcebed929807fb541715f9a3a7d7b239b696e635b49d0e`

See more details on using hashes here.

File details

Details for the file flashinfer_python-0.6.0rc1-py3-none-any.whl.

File metadata

Download URL: flashinfer_python-0.6.0rc1-py3-none-any.whl
Upload date: Dec 18, 2025
Size: 7.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for flashinfer_python-0.6.0rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6a7b6a29216eace097b65b3a84aed6e0348af644327886cac883a8aaa5fc75aa`
MD5	`0484ad77a6d85e6cb4c6002fb3abf6d8`
BLAKE2b-256	`b20ccb2d60eb86f0171451d676f17b90484ab66baf73c54cefe15c9a7c800739`

See more details on using hashes here.

flashinfer-python 0.6.0rc1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Kernel Library for LLM Serving

News

Getting Started

Install from PyPI

Install from Source

Install Nightly Build

Verify Installation

Trying it out

API Logging

Custom Attention Variants

GPU and CUDA Support

Adoption

Acknowledgement

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes