Skip to main content

FlashInfer: Kernel Library for LLM Serving

Project description

FlashInfer

High-Performance GPU Kernels for Inference

| Documentation | Latest Release | Blog | Slack | Discussion Forum |

Build Status Documentation

FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.

Why FlashInfer?

  • State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios
  • Multiple Backends: Automatically selects the best backend for your hardware and workload
  • Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell)
  • Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations
  • Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving

Core Features

Attention Kernels

  • Paged and Ragged KV-Cache: Efficient memory management for dynamic batch serving
  • Decode, Prefill, and Append: Optimized kernels for all attention phases
  • MLA Attention: Native support for DeepSeek's Multi-Latent Attention
  • Cascade Attention: Memory-efficient hierarchical KV-Cache for shared prefixes
  • Sparse Attention: Block-sparse and variable block-sparse patterns
  • POD-Attention: Fused prefill+decode for mixed batching

GEMM & Linear Operations

  • FP8 GEMM: Per-tensor and groupwise scaling
  • FP4 GEMM: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs
  • Grouped GEMM: Efficient batched matrix operations for LoRA and multi-expert routing

Mixture of Experts (MoE)

  • Fused MoE Kernels
  • Multiple Routing Methods: DeepSeek-V3, Llama-4, and standard top-k routing
  • Quantized MoE: FP8 and FP4 expert weights with block-wise scaling

Sampling & Decoding

  • Sorting-Free Sampling: Efficient Top-K, Top-P, and Min-P without sorting
  • Speculative Decoding: Chain speculative sampling support

Communication

  • AllReduce: Custom implementations
  • Multi-Node NVLink: MNNVL support for multi-node inference
  • NVSHMEM Integration: For distributed memory operations

Other Operators

  • RoPE: LLaMA-style rotary position embeddings (including LLaMA 3.1)
  • Normalization: RMSNorm, LayerNorm, Gemma-style fused operations
  • Activations: SiLU, GELU with fused gating

GPU Support

Architecture Compute Capability Example GPUs
Turing SM 7.5 T4, RTX 20 series
Ampere SM 8.0, 8.6 A100, A10, RTX 30 series
Ada Lovelace SM 8.9 L4, L40, RTX 40 series
Hopper SM 9.0 H100, H200
Blackwell SM 10.0, 10.3 B200, B300
Blackwell SM 12.0, 12.1 RTX 50 series, DGX Spark, Jetson Thor

News

Latest: GitHub Release

Notable updates:

  • [2025-10-08] Blackwell support added in v0.4.0
  • [2025-03-10] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.

Getting Started

Installation

Quickstart:

pip install flashinfer-python

Package Options:

  • flashinfer-python: Core package that compiles/downloads kernels on first use
  • flashinfer-cubin: Pre-compiled kernel binaries for all supported GPU architectures
  • flashinfer-jit-cache: Pre-built kernel cache for specific CUDA versions

For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:

pip install flashinfer-python flashinfer-cubin
# JIT cache (replace cu129 with your CUDA version)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129

Verify Installation

flashinfer show-config

Basic Usage

import torch
import flashinfer

# Single decode attention
q = torch.randn(32, 128, device="cuda", dtype=torch.float16)  # [num_qo_heads, head_dim]
k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16)  # [kv_len, num_kv_heads, head_dim]
v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16)

output = flashinfer.single_decode_with_kv_cache(q, k, v)

See documentation for comprehensive API reference and tutorials.

Install from Source

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
python -m pip install -v .

For development, install in editable mode:

python -m pip install --no-build-isolation -e . -v

Build optional packages:

# flashinfer-cubin
cd flashinfer-cubin
python -m build --no-isolation --wheel
python -m pip install dist/*.whl
# flashinfer-jit-cache (customize for your target GPUs)
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl

For more details, see the Install from Source documentation.

Nightly Builds

pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps
pip install flashinfer-python  # Install dependencies from PyPI
pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/
# JIT cache (replace cu129 with your CUDA version)
pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129

CLI Tools

FlashInfer provides several CLI commands for configuration, module management, and development:

# Verify installation and view configuration
flashinfer show-config

# List and inspect modules
flashinfer list-modules
flashinfer module-status

# Manage artifacts and cache
flashinfer download-cubin
flashinfer clear-cache

# For developers: generate compile_commands.json for IDE integration
flashinfer export-compile-commands [output_path]

For complete documentation, see the CLI reference.

API Logging

FlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:

# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics)
export FLASHINFER_LOGLEVEL=3

# Set log destination (stdout (default), stderr, or file path)
export FLASHINFER_LOGDEST=stdout

For detailed information about logging levels, configuration, and advanced features, see Logging in our documentation.

Custom Attention Variants

Users can customize their own attention variants with additional parameters. For more details, refer to our JIT examples.

CUDA Support

Supported CUDA Versions: 12.6, 12.8, 13.0, 13.1

Note: FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.

Adoption

FlashInfer powers inference in:

Acknowledgement

FlashInfer is inspired by FlashAttention, vLLM, stream-K, CUTLASS, and AITemplate.

Citation

If you find FlashInfer helpful in your project or research, please consider citing our paper:

@article{ye2025flashinfer,
    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
    author = {
      Ye, Zihao and
      Chen, Lequn and
      Lai, Ruihang and
      Lin, Wuwei and
      Zhang, Yineng and
      Wang, Stephanie and
      Chen, Tianqi and
      Kasikci, Baris and
      Grover, Vinod and
      Krishnamurthy, Arvind and
      Ceze, Luis
    },
    journal = {arXiv preprint arXiv:2501.01005},
    year = {2025},
    url = {https://arxiv.org/abs/2501.01005}
}

Project details


Release history Release notifications | RSS feed

This version

0.6.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashinfer_python-0.6.3.tar.gz (5.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flashinfer_python-0.6.3-py3-none-any.whl (7.6 MB view details)

Uploaded Python 3

File details

Details for the file flashinfer_python-0.6.3.tar.gz.

File metadata

  • Download URL: flashinfer_python-0.6.3.tar.gz
  • Upload date:
  • Size: 5.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for flashinfer_python-0.6.3.tar.gz
Algorithm Hash digest
SHA256 84a762538247a86bc52ff31d9505d161ce1ec059174c1821c87c3ed1e44670fc
MD5 48b891c12167ef24cd055c93b936258c
BLAKE2b-256 d6aac564313b42dee7573da4ed0e441844f0c2bd827aecc9f29ea02c3838ffae

See more details on using hashes here.

File details

Details for the file flashinfer_python-0.6.3-py3-none-any.whl.

File metadata

File hashes

Hashes for flashinfer_python-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0fe2de934a4b3690c543dafb03f38d7bb4a762431abe8ae4f7292d6fef10c65d
MD5 ba653414f9bc5779cec3b28de56bb76f
BLAKE2b-256 33132d95248101d8cb978db9000a4dceafb5b122484a694b53e84df1ac2a7b3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page