FlashInfer: Kernel Library for LLM Serving

These details have not been verified by PyPI

Project links

Homepage

Project description

FlashInfer

High-Performance GPU Kernels for Inference

FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.

Why FlashInfer?

State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios
Multiple Backends: Automatically selects the best backend for your hardware and workload
Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell)
Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations
Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving

Core Features

Attention Kernels

Paged and Ragged KV-Cache: Efficient memory management for dynamic batch serving
Decode, Prefill, and Append: Optimized kernels for all attention phases
MLA Attention: Native support for DeepSeek's Multi-Latent Attention
Cascade Attention: Memory-efficient hierarchical KV-Cache for shared prefixes
Sparse Attention: Block-sparse and variable block-sparse patterns
POD-Attention: Fused prefill+decode for mixed batching

GEMM & Linear Operations

BF16 GEMM: BF16 matrix multiplication for SM10.0+ GPUs.
FP8 GEMM: Per-tensor and groupwise scaling
FP4 GEMM: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs
Grouped GEMM: Efficient batched matrix operations for LoRA and multi-expert routing

Mixture of Experts (MoE)

Fused MoE Kernels
Multiple Routing Methods: DeepSeek-V3, Llama-4, and standard top-k routing
Quantized MoE: FP8 and FP4 expert weights with block-wise scaling

Sampling & Decoding

Sorting-Free Sampling: Efficient Top-K, Top-P, and Min-P without sorting
Speculative Decoding: Chain speculative sampling support

Communication

AllReduce: Custom implementations
Multi-Node NVLink: MNNVL support for multi-node inference
NVSHMEM Integration: For distributed memory operations

Other Operators

RoPE: LLaMA-style rotary position embeddings (including LLaMA 3.1)
Normalization: RMSNorm, LayerNorm, Gemma-style fused operations
Activations: SiLU, GELU with fused gating

GPU Support

Architecture	Compute Capability	Example GPUs
Turing	SM 7.5	T4, RTX 20 series
Ampere	SM 8.0, 8.6	A100, A10, RTX 30 series
Ada Lovelace	SM 8.9	L4, L40, RTX 40 series
Hopper	SM 9.0	H100, H200
Blackwell	SM 10.0, 10.3	B200, B300
Blackwell	SM 11.0	Jetson Thor
Blackwell	SM 12.0, 12.1	RTX 50 series, DGX Spark

Note: Not all features are supported across all compute capabilities.

News

Latest:

Notable updates:

[2025-10-08] Blackwell support added in v0.4.0
[2025-03-10] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.

Getting Started

Installation

Quickstart:

pip install flashinfer-python

Package Options:

flashinfer-python: Core package that compiles/downloads kernels on first use
flashinfer-cubin: Pre-compiled kernel binaries for all supported GPU architectures
flashinfer-jit-cache: Pre-built kernel cache for specific CUDA versions

For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:

pip install flashinfer-python flashinfer-cubin
# JIT cache (replace cu129 with your CUDA version)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129

For Blackwell (SM100+) CuTe DSL kernels, install with the CUDA 13 extra to enable Blackwell-optimized kernels:

pip install flashinfer-python[cu13]

Verify Installation

flashinfer show-config

Basic Usage

import torch
import flashinfer

# Single decode attention
q = torch.randn(32, 128, device="cuda", dtype=torch.float16)  # [num_qo_heads, head_dim]
k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16)  # [kv_len, num_kv_heads, head_dim]
v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16)

output = flashinfer.single_decode_with_kv_cache(q, k, v)

See documentation for comprehensive API reference and tutorials.

Install from Source

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
python -m pip install -v .

For development, install in editable mode:

python -m pip install --no-build-isolation -e . -v

Note: When using --no-build-isolation, pip does not automatically install build dependencies. FlashInfer requires setuptools>=77. If you encounter an error like AttributeError: module 'setuptools.build_meta' has no attribute 'prepare_metadata_for_build_editable', upgrade pip and setuptools first:
python -m pip install --upgrade pip setuptools

Build optional packages:

# flashinfer-cubin
cd flashinfer-cubin
python -m build --no-isolation --wheel
python -m pip install dist/*.whl

# flashinfer-jit-cache (customize for your target GPUs)
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl

For more details, see the Install from Source documentation.

Nightly Builds

pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps
pip install flashinfer-python  # Install dependencies from PyPI
pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/
# JIT cache (replace cu129 with your CUDA version)
pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129

CLI Tools

FlashInfer provides several CLI commands for configuration, module management, and development:

# Verify installation and view configuration
flashinfer show-config

# List and inspect modules
flashinfer list-modules
flashinfer module-status

# Manage artifacts and cache
flashinfer download-cubin
flashinfer clear-cache

# For developers: generate compile_commands.json for IDE integration
flashinfer export-compile-commands [output_path]

For complete documentation, see the CLI reference.

API Logging

FlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:

# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics)
export FLASHINFER_LOGLEVEL=3

# Set log destination (stdout (default), stderr, or file path)
export FLASHINFER_LOGDEST=stdout

For detailed information about logging levels, configuration, and advanced features, see Logging in our documentation.

Custom Attention Variants

Users can customize their own attention variants with additional parameters. For more details, refer to our JIT examples.

CUDA Support

Supported CUDA Versions: 12.6, 12.8, 13.0, 13.1

Note: FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.

Adoption

FlashInfer powers inference in:

Acknowledgement

FlashInfer is inspired by FlashAttention, vLLM, stream-K, CUTLASS, and AITemplate.

Citation

If you find FlashInfer helpful in your project or research, please consider citing our paper:

@article{ye2025flashinfer,
    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
    author = {
      Ye, Zihao and
      Chen, Lequn and
      Lai, Ruihang and
      Lin, Wuwei and
      Zhang, Yineng and
      Wang, Stephanie and
      Chen, Tianqi and
      Kasikci, Baris and
      Grover, Vinod and
      Krishnamurthy, Arvind and
      Ceze, Luis
    },
    journal = {arXiv preprint arXiv:2501.01005},
    year = {2025},
    url = {https://arxiv.org/abs/2501.01005}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.8.post1

Apr 18, 2026

0.6.8

Apr 16, 2026

This version

0.6.8rc1 pre-release

Apr 14, 2026

0.6.7.post3

Apr 6, 2026

0.6.7.post2

Apr 4, 2026

0.6.7.post1

Apr 3, 2026

0.6.7

Mar 25, 2026

0.6.6

Mar 11, 2026

0.6.5

Mar 4, 2026

0.6.4

Feb 19, 2026

0.6.3

Feb 6, 2026

0.6.2

Jan 23, 2026

0.6.1

Jan 14, 2026

0.6.0

Jan 8, 2026

0.6.0rc2 pre-release

Dec 20, 2025

0.6.0rc1 pre-release

Dec 18, 2025

0.5.3

Nov 20, 2025

0.5.2

Nov 7, 2025

0.5.1

Nov 4, 2025

0.5.0

Nov 2, 2025

0.5.0rc3 pre-release

Nov 1, 2025

0.5.0rc2 pre-release

Oct 31, 2025

0.5.0rc1 pre-release

Oct 30, 2025

0.4.1

Oct 14, 2025

0.4.0

Oct 9, 2025

0.4.0rc4 pre-release

Oct 2, 2025

0.4.0rc3 pre-release

Sep 24, 2025

0.4.0rc2 pre-release

Sep 23, 2025

0.4.0rc1 pre-release

Sep 19, 2025

0.4.0rc0 pre-release

Sep 18, 2025

0.3.1.post1

Sep 26, 2025

0.3.1

Sep 5, 2025

0.3.0.post1

Sep 26, 2025

0.3.0

Sep 1, 2025

0.3.0rc1 pre-release

Aug 29, 2025

0.2.14.post1

Aug 25, 2025

0.2.14

Aug 23, 2025

0.2.13

Aug 20, 2025

0.2.12

Aug 18, 2025

0.2.11.post3

Aug 14, 2025

0.2.11.post2

Aug 13, 2025

0.2.11.post1

Aug 11, 2025

0.2.11

Aug 10, 2025

0.2.10

Aug 5, 2025

0.2.9

Aug 5, 2025

0.2.9rc2 pre-release

Jul 27, 2025

0.2.9rc1 pre-release

Jul 23, 2025

0.2.8

Jul 21, 2025

0.2.8rc1 pre-release

Jul 8, 2025

0.2.7.post1

Jul 1, 2025

0.2.7

Jun 30, 2025

0.2.6.post1

Jun 7, 2025

0.2.6

Jun 6, 2025

0.2.5

Apr 4, 2025

0.2.4

Mar 30, 2025

0.2.3

Mar 11, 2025

0.2.2.post1

Feb 27, 2025

0.2.2

Feb 23, 2025

0.2.1.post2

Feb 19, 2025

0.2.1.post1

Feb 14, 2025

0.2.1

Feb 13, 2025

0.2.0.post2

Jan 31, 2025

0.2.0.post1

Jan 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashinfer_python-0.6.8rc1.tar.gz (6.7 MB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flashinfer_python-0.6.8rc1-py3-none-any.whl (9.4 MB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file flashinfer_python-0.6.8rc1.tar.gz.

File metadata

Download URL: flashinfer_python-0.6.8rc1.tar.gz
Upload date: Apr 14, 2026
Size: 6.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for flashinfer_python-0.6.8rc1.tar.gz
Algorithm	Hash digest
SHA256	`039d6883128665bf13b44c16d191a2e50d0147be6a34f5a08233e6f8ce0ed9f8`
MD5	`5020b2ece27cfb73d29e1244832a9c7c`
BLAKE2b-256	`68e167b0b5eb9f3ea23e05e7d454571ad7a186ede6a9c30fec55e51291bfa461`

See more details on using hashes here.

File details

Details for the file flashinfer_python-0.6.8rc1-py3-none-any.whl.

File metadata

Download URL: flashinfer_python-0.6.8rc1-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 9.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for flashinfer_python-0.6.8rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eec196ad6084311fa9012b7df82c89b54b00b2267dc02138cec093d5718059d3`
MD5	`1052ddecf3409c1a39b61c0a4ba44c86`
BLAKE2b-256	`6a0ae8ae05fd59f800e74ec24fa6a58a04c6c0d9308917880c42f2b53cfe36bb`

See more details on using hashes here.

flashinfer-python 0.6.8rc1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

High-Performance GPU Kernels for Inference

Why FlashInfer?

Core Features

Attention Kernels

GEMM & Linear Operations

Mixture of Experts (MoE)

Sampling & Decoding

Communication

Other Operators

GPU Support

News

Getting Started

Installation

Verify Installation

Basic Usage

Install from Source

Nightly Builds

CLI Tools

API Logging

Custom Attention Variants

CUDA Support

Adoption

Acknowledgement

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes