Skip to main content

A tile level programming language to generate high performance code.

Project description

Tile Language ❤️ PaddlePaddle

PyPI version Ask DeepWiki Discord Puzzles

[!NOTE]

This repo is a fork of the original tilelang project, with modifications to enhance compatibility and integration with PaddlePaddle.

Installation

pip install tilelang-paddle

Usage

import paddle
paddle.enable_compat(scope={"tilelang"})  # Enable torch compat before importing tilelang
import tilelang
# use tilelang

The original README.md content is as follows:


Tile Language (tile-lang) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of TVM, tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.

Latest News

  • 02/02/2026 🧩: Check out TileLang Puzzles, a fun and interactive way to learn TileLang programming with 10 progressively harder puzzles!
  • 12/18/2025 🚀: Added CuTeDSL backend support, enabling compilation to NVIDIA CUTLASS CuTe DSL! Join us in building and optimizing this exciting new backend: Issue #1454.
  • 12/17/2025 🔬: Integrated Z3 theorem prover into TVM Arith Analyzer, bringing SMT-based symbolic reasoning for enhanced optimizations and automatic correctness verification!
  • 10/31/2025 🔧: Migrated to apache-tvm-ffi, significantly reducing CPU overhead!
  • 10/30/2025 📦: We have released v0.1.6.post2, which is the last version compatible with Python 3.8.
  • 10/07/2025 🍎: Added Apple Metal Device support, check out Pull Request #799 for details.
  • 09/29/2025 🎉: Thrilled to announce that ​​AscendC​​ and ​Ascend​NPU IR​​ backends targeting Huawei Ascend chips are now supported! Check out the preview here: 🔗 link. This includes implementations across two branches: ascendc_pto and npuir. Feel free to explore and share your feedback!
  • 07/04/2025 🚀: Introduced T.gemm_sp for 2:4 sparse tensor core support, check out Pull Request #526 for details.
  • 06/05/2025 ✨: Added NVRTC Backend to significantly reduce compilation time for cute templates!
  • 04/14/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See example_mla_amd for details.
  • 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see example_mla_decode.py)! We also provide documentation explaining how TileLang achieves this.
  • 02/15/2025 ✨: Added WebGPU Codegen support, see Pull Request #86!
  • 02/12/2025 ✨: Excited to announce the release of v0.1.0!
  • 02/10/2025 🚀: Added debug tools for TileLang—T.print for printing variables/buffers (docs) and a memory layout plotter (examples/plot_layout).
  • 01/20/2025 ✨: We are excited to announce that tile-lang, a dsl for high performance AI workloads, is now open source and available to the public!

Tested Devices

Although tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H100 (with Auto TMA/WGMMA support), A100, V100, RTX 4090, RTX 3090, and RTX A6000; for AMD GPUs, it includes the MI250 (with Auto MatrixCore support) and the MI300X (with Async Copy support).

OP Implementation Examples

tile-lang provides the building blocks to implement a wide variety of operators. Some examples include:

Within the examples directory, you will also find additional complex kernels—such as convolutions, forward/backward passes for FlashAttention, more operators will continuously be added.

Benchmark Summary

TileLang achieves exceptional performance across a variety of computational patterns. Comprehensive benchmark scripts and settings are available at tilelang-benchmark. Below are selected results showcasing its capabilities:

  • MLA Decoding Performance on H100

    mla decode performance bs64 on H100
    mla decode performance bs128 on H100
  • Flash Attention Performance on H100

    operator performance on H100
  • Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)

    gemm fp16 performance on Gpus
  • Dequantize Matmul Performance on A100

    dequantize gemv performance on A100

Installation

Method 1: Install with Pip

The quickest way to get started is to install the latest release from PyPI:

pip install tilelang

Alternatively, you can install directly from the GitHub repository:

pip install git+https://github.com/tile-ai/tilelang

Or install locally:

# install required system dependencies
sudo apt-get update
sudo apt-get install -y python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev

pip install -e . -v # remove -e option if you don't want to install in editable mode, -v for verbose output

Method 2: Build from Source

We currently provide three ways to install tile-lang from source:

Method 3: Install with Nightly Version

For users who want access to the latest features and improvements before official releases, we provide nightly builds of tile-lang.

pip install tilelang -f https://tile-ai.github.io/whl/nightly
# or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly

Note: Nightly builds contain the most recent code changes but may be less stable than official releases. They're ideal for testing new features or if you need a specific bugfix that hasn't been released yet.

Quick Start

In this section, you'll learn how to write and execute a straightforward GEMM (matrix multiplication) kernel using tile-lang, followed by techniques for layout optimizations, pipelining, and L2-cache–friendly swizzling.

GEMM Example with Annotations (Layout, L2 Cache Swizzling, and Pipelining, etc.)

Below is an example that demonstrates more advanced features: layout annotation, parallelized copy, and swizzle for improved L2 cache locality. This snippet shows how to adapt your kernel to maximize performance on complex hardware.

# @tilelang.jit(target="cuda")
# target currently can be "cuda" or "hip" or "cpu".
# if not specified, it will be inferred from the input tensors during compile time
@tilelang.jit
def matmul_relu(
    A, B,
    block_M: int = 64,
    block_N: int = 64,
    block_K: int = 64,
    dtype: T.dtype = T.float16,
    accum_dtype: T.dtype = T.float32,
):
    # declare compilation shape constant
    M, N, K = T.const('M, N, K')

    # annotate input tensor shape
    A: T.Tensor[[M, K], dtype]
    B: T.Tensor[[K, N], dtype]

    # allocate output tensor
    C = T.empty([M, N], dtype)

    with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
        A_shared = T.alloc_shared((block_M, block_K), dtype)
        B_shared = T.alloc_shared((block_K, block_N), dtype)
        C_local = T.alloc_fragment((block_M, block_N), accum_dtype)

        # Enable rasterization for better L2 cache locality (Optional)
        # T.use_swizzle(panel_size=10, enable=True)

        # Clear local accumulation
        T.clear(C_local)

        for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
            # Copy tile of A
            # This is a sugar syntax for parallelized copy
            T.copy(A[by * block_M, ko * block_K], A_shared)

            # Copy tile of B
            T.copy(B[ko * block_K, bx * block_N], B_shared)

            # Perform a tile-level GEMM on the shared buffers
            # Currently we dispatch to the cute/hip on Nvidia/AMD GPUs
            T.gemm(A_shared, B_shared, C_local)

        # relu
        for i, j in T.Parallel(block_M, block_N):
            C_local[i, j] = T.max(C_local[i, j], 0)

        # Copy result back to global memory
        T.copy(C_local, C[by * block_M, bx * block_N])

    # You can write multiple cuda kernel in one function, they execute sequentially
    # with T.Kernel(...) as ...

    # Return the tensor, you can also return multiple tensors
    return C


M, N, K = 1024, 1024, 1024

a = torch.randn(M, K, device="cuda", dtype=torch.float16)
b = torch.randn(K, N, device="cuda", dtype=torch.float16)
c_ref = torch.relu(a @ b)

# Call the kernel
c = matmul_relu(a, b)
torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)

# Call the kernel with overwritten compilation constants
c = matmul_relu(a, b, block_M=128, block_N=128, block_K=64)
torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)

# Retrieve the compiled kernel
kernel = matmul_relu.compile(a, b) # use torch.Tensor
kernel = matmul_relu.compile(      # use T.Tensor as placeholder
  T.Tensor((M, K), T.float16),
  T.Tensor((K, N), T.float16)
)
kernel = matmul_relu.compile(      # directly specify the shape constants
  M=M, N=N, K=K,
  block_M=128, block_N=128, block_K=64
)
print(kernel.get_kernel_source())
c = kernel(a, b)

# 5.Profile latency with kernel
profiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)
latency = profiler.do_bench()
print(f"Latency: {latency} ms")

Dive Deep into TileLang Beyond GEMM

In addition to GEMM, we provide a variety of examples to showcase the versatility and power of TileLang, including:

  • Dequantize GEMM: Achieve high-performance dequantization by fine-grained control over per-thread operations, with many features now adopted as default behaviors in BitBLAS, which utilizing magic layout transformation and intrins to accelerate dequantize gemm.
  • FlashAttention: Enable cross-operator fusion with simple and intuitive syntax, and we also provide an example of auto tuning.
  • LinearAttention: Examples include RetNet and Mamba implementations.
  • Convolution: Implementations of Convolution with IM2Col.

Upcoming Features

Check our tilelang v0.2.0 release plan for upcoming features.


TileLang has now been used in project BitBLAS and AttentionEngine.

Join the Discussion

Welcome to join our Discord community for discussions, support, and collaboration!

Join our Discord

Acknowledgments

We would like to express our gratitude to the TVM community for their invaluable contributions. The initial version of this project was mainly developed by LeiWang1999, chengyupku and nox-410 with supervision from Prof. Zhi Yang at Peking University. Part of this work was carried out during an internship at Microsoft Research, where Dr. Lingxiao Ma, Dr. Yuqing Xia, Dr. Jilong Xue, and Dr. Fan Yang offered valuable advice and support. We deeply appreciate their mentorship and contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tilelang_paddle-0.1.11.tar.gz (93.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tilelang_paddle-0.1.11-cp38-abi3-win_amd64.whl (33.7 MB view details)

Uploaded CPython 3.8+Windows x86-64

tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_34_aarch64.whl (45.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ ARM64

tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (50.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

tilelang_paddle-0.1.11-cp38-abi3-macosx_11_0_arm64.whl (38.3 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file tilelang_paddle-0.1.11.tar.gz.

File metadata

  • Download URL: tilelang_paddle-0.1.11.tar.gz
  • Upload date:
  • Size: 93.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tilelang_paddle-0.1.11.tar.gz
Algorithm Hash digest
SHA256 a7247cabb3dd02d13468aa1674d86a25c620b5baa3d53476c34dbf2f72995770
MD5 c354c6d01a39221f9675b7eb7f0e441f
BLAKE2b-256 df48bcd76819bd0fd66589cfce365c18ab361a76b2c5b1b09ff1b6c42234cb3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for tilelang_paddle-0.1.11.tar.gz:

Publisher: dist-paddle.yml on PFCCLab/tilelang-paddle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tilelang_paddle-0.1.11-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for tilelang_paddle-0.1.11-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2edc3ba46e2c538ffd4227e7a752b5bcaef5ed91f74c54791910bd96056c7740
MD5 6d53641a3b7699daeb9464cf4f6dab0b
BLAKE2b-256 3b58989021b2c0922002ce0cde5c19160b7080088ceedae80f51028e3d386441

See more details on using hashes here.

Provenance

The following attestation bundles were made for tilelang_paddle-0.1.11-cp38-abi3-win_amd64.whl:

Publisher: dist-paddle.yml on PFCCLab/tilelang-paddle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 363fb90c31f055a018801c0813c5e04890532c1d6bc6ca446ca8bec1cef317f4
MD5 2f68c2f0b3a9061065e78fbd44ba998b
BLAKE2b-256 14b386ae41566c7e1c0c6bd4a2ba580de76c6d1a9de3070c6df455b351127f10

See more details on using hashes here.

Provenance

The following attestation bundles were made for tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_34_aarch64.whl:

Publisher: dist-paddle.yml on PFCCLab/tilelang-paddle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3d49c78d0d9df5ab176dd7f20b3b91b7931c3033404d0e3ebe2346673b855051
MD5 6a5b27965aa2151eebf78d2b9acd207b
BLAKE2b-256 6c37d1423a7ab7846f71d7b8b5222e7befda4aa3e31b1015b3b1aee57a828e3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for tilelang_paddle-0.1.11-cp38-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: dist-paddle.yml on PFCCLab/tilelang-paddle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tilelang_paddle-0.1.11-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tilelang_paddle-0.1.11-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8787fb7d50a5074aa23be0b20bc5f4a734ad316a716a046c8c79abef964841fd
MD5 038e7c98231942b9cc71aedd34dfed63
BLAKE2b-256 362501772e1fbce46d19f750b2b966aecb7916c41d8b473e7620c68176f985d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for tilelang_paddle-0.1.11-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: dist-paddle.yml on PFCCLab/tilelang-paddle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page