A tile level programming language to generate high performance code.

These details have not been verified by PyPI

Project links

Homepage

Project description

Tile Language

Tile Language (tile-lang) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of TVM, tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.

Latest News

14/04/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See example_mla_amd for details.
03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see example_mla_decode.py)! We also provide documentation explaining how TileLang achieves this.
02/15/2025 ✨: Added WebGPU Codegen support, see Pull Request #86!
02/12/2025 ✨: Excited to announce the release of v0.1.0!
02/10/2025 🚀: Added debug tools for TileLang—T.print for printing variables/buffers (docs) and a memory layout plotter (examples/plot_layout).
01/20/2025 ✨: We are excited to announce that tile-lang, a dsl for high performance AI workloads, is now open source and available to the public!

Tested Devices

Although tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H100 (with Auto TMA/WGMMA support), A100, V100, RTX 4090, RTX 3090, and RTX A6000; for AMD GPUs, it includes the MI250 (with Auto MatrixCore support) and the MI300X (with Async Copy support).

OP Implementation Examples

tile-lang provides the building blocks to implement a wide variety of operators. Some examples include:

Within the examples directory, you will also find additional complex kernels—such as convolutions, forward/backward passes for FlashAttention, more operators will continuously be added.

Benchmark Summary

TileLang achieves exceptional performance across a variety of computational patterns. Comprehensive benchmark scripts and settings are available at tilelang-benchmark. Below are selected results showcasing its capabilities:

MLA Decoding Performance on H100
Flash Attention Performance on H100
Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)
Dequantize Matmul Performance on A100

Installation

Method 1: Install with Pip

The quickest way to get started is to install the latest release from PyPI:

pip install tilelang

Alternatively, you can install directly from the GitHub repository:

pip install git+https://github.com/tile-ai/tilelang

Or install locally:

# install required system dependencies
sudo apt-get update
sudo apt-get install -y python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev

pip install -e . -v # remove -e option if you don't want to install in editable mode, -v for verbose output

Method 2: Build from Source

We currently provide three ways to install tile-lang from source:

Method 3: Install with Nightly Version

For users who want access to the latest features and improvements before official releases, we provide nightly builds of tile-lang.

pip install tilelang -f https://tile-ai.github.io/whl/nightly/cu121/
# or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly/cu121/

Note: Nightly builds contain the most recent code changes but may be less stable than official releases. They're ideal for testing new features or if you need a specific bugfix that hasn't been released yet.

Quick Start

In this section, you'll learn how to write and execute a straightforward GEMM (matrix multiplication) kernel using tile-lang, followed by techniques for layout optimizations, pipelining, and L2-cache–friendly swizzling.

GEMM Example with Annotations (Layout, L2 Cache Swizzling, and Pipelining, etc.)

Below is an example that demonstrates more advanced features: layout annotation, parallelized copy, and swizzle for improved L2 cache locality. This snippet shows how to adapt your kernel to maximize performance on complex hardware.

import tilelang
import tilelang.language as T
# `make_mma_swizzle_layout` is a python defined layout function
# specifically designed for for MMA operations
# which ensures the consistency with the nvidia CUTLASS Library.
# to avoid bank conflicts and maximize the performance.
from tilelang.intrinsics import (
    make_mma_swizzle_layout as make_swizzle_layout,)

def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):
    # add decorator @tilelang.jit if you want to return a torch function
    @T.prim_func
    def main(
        A: T.Tensor((M, K), dtype),
        B: T.Tensor((K, N), dtype),
        C: T.Tensor((M, N), dtype),
    ):
        # Initialize Kernel Context
        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
            A_shared = T.alloc_shared((block_M, block_K), dtype)
            B_shared = T.alloc_shared((block_K, block_N), dtype)
            C_local  = T.alloc_fragment((block_M, block_N), accum_dtype)

            # Apply layout optimizations or define your own layout (Optional)
            # If not specified, we will deduce the layout automatically
            # T.annotate_layout({
            #     A_shared: make_swizzle_layout(A_shared),
            #     B_shared: make_swizzle_layout(B_shared),
            # })

            # Enable rasterization for better L2 cache locality (Optional)
            # T.use_swizzle(panel_size=10, enable=True)

            # Clear local accumulation
            T.clear(C_local)

            for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
                # Copy tile of A
                # This is a sugar syntax for parallelized copy
                T.copy(A[by * block_M, ko * block_K], A_shared)

                # Demonstrate parallelized copy from global to shared for B
                for k, j in T.Parallel(block_K, block_N):
                    B_shared[k, j] = B[ko * block_K + k, bx * block_N + j]

                # Perform a tile-level GEMM on the shared buffers
                # Currently we dispatch to the cute/hip on Nvidia/AMD GPUs
                T.gemm(A_shared, B_shared, C_local)

            # Copy result back to global memory
            T.copy(C_local, C[by * block_M, bx * block_N])

    return main


# 1. Define the kernel (matmul) with the desired dimensions
func = matmul(1024, 1024, 1024, 128, 128, 32)

# 2. Compile the kernel into a torch function
# out_idx specifies the index of the output buffer in the argument list
# if out_idx is specified, the tensor will be created during runtime
# target currently can be "cuda" or "hip" or "cpu".
jit_kernel = tilelang.compile(func, out_idx=[2], target="cuda")

# 3. Test the kernel in Python with PyTorch data
import torch

# Create random input tensors on the GPU
a = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
b = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)


# Run the kernel through the JIT-compiled function
c = jit_kernel(a, b)

# Reference multiplication using PyTorch
ref_c = a @ b

# Validate correctness
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
print("Kernel output matches PyTorch reference.")

# 4. Retrieve and inspect the generated CUDA source (optional)
cuda_source = jit_kernel.get_kernel_source()
print("Generated CUDA kernel:\n", cuda_source)

# 5.Pofile latency with the profiler
profiler = jit_kernel.get_profiler()

latency = profiler.do_bench()

print(f"Latency: {latency} ms")

Dive Deep into TileLang Beyond GEMM

In addition to GEMM, we provide a variety of examples to showcase the versatility and power of TileLang, including:

Dequantize GEMM: Achieve high-performance dequantization by fine-grained control over per-thread operations, with many features now adopted as default behaviors in BitBLAS, which utilizing magic layout transformation and intrins to accelerate dequantize gemm.
FlashAttention: Enable cross-operator fusion with simple and intuitive syntax, and we also provide an example of auto tuning.
LinearAttention: Examples include RetNet and Mamba implementations.
Convolution: Implementations of Convolution with IM2Col.

Upcoming Features

Check our tilelang v0.2.0 release plan for upcoming features.

TileLang has now been used in project BitBLAS and AttentionEngine.

Join the Discussion

Welcome to join our Discord community for discussions, support, and collaboration!

Acknowledgements

We would like to express our gratitude to the TVM community for their invaluable contributions. The initial version of this project was mainly developed by LeiWang1999, chengyupku and nox-410 with supervision from Prof. Zhi Yang at Peking University. Part of this work was carried out during an internship at Microsoft Research, where Dr. Lingxiao Ma, Dr. Yuqing Xia, Dr. Jilong Xue, and Dr. Fan Yang offered valuable advice and support. We deeply appreciate their mentorship and contributions.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.9

Apr 22, 2026

0.1.8

Feb 16, 2026

0.1.7.post3

Jan 18, 2026

0.1.7.post2

Dec 31, 2025

0.1.7.post1

Dec 24, 2025

0.1.7

Dec 7, 2025

0.1.6.post2

Oct 31, 2025

0.1.6.post1

Sep 21, 2025

0.1.6 yanked

Sep 19, 2025

Reason this release was yanked:

static link g++

0.1.5

Jun 5, 2025

This version

0.1.4

Apr 18, 2025

0.1.3

Mar 23, 2025

0.1.2.post1

Mar 7, 2025

0.1.2

Mar 6, 2025

0.1.1

Feb 23, 2025

0.1.0

Feb 12, 2025

0.0.1

Jan 20, 2025

0.0.1.dev0 pre-release

Jan 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tilelang-0.1.4-cp312-cp312-manylinux_2_27_x86_64.whl (68.0 MB view details)

Uploaded Apr 18, 2025 CPython 3.12manylinux: glibc 2.27+ x86-64

tilelang-0.1.4-cp311-cp311-manylinux_2_27_x86_64.whl (68.0 MB view details)

Uploaded Apr 18, 2025 CPython 3.11manylinux: glibc 2.27+ x86-64

tilelang-0.1.4-cp310-cp310-manylinux_2_27_x86_64.whl (68.0 MB view details)

Uploaded Apr 18, 2025 CPython 3.10manylinux: glibc 2.27+ x86-64

tilelang-0.1.4-cp39-cp39-manylinux_2_27_x86_64.whl (68.0 MB view details)

Uploaded Apr 18, 2025 CPython 3.9manylinux: glibc 2.27+ x86-64

tilelang-0.1.4-cp38-cp38-manylinux_2_27_x86_64.whl (68.0 MB view details)

Uploaded Apr 18, 2025 CPython 3.8manylinux: glibc 2.27+ x86-64

File details

Details for the file tilelang-0.1.4-cp312-cp312-manylinux_2_27_x86_64.whl.

File metadata

Download URL: tilelang-0.1.4-cp312-cp312-manylinux_2_27_x86_64.whl
Upload date: Apr 18, 2025
Size: 68.0 MB
Tags: CPython 3.12, manylinux: glibc 2.27+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for tilelang-0.1.4-cp312-cp312-manylinux_2_27_x86_64.whl
Algorithm	Hash digest
SHA256	`793a7e85d7fdb656b39e262ac86915adf61d269f4168ffaeb2ae5ead8ab50d5c`
MD5	`40025bde44b5df0f4ee232289d8f0e7f`
BLAKE2b-256	`93f9b9648af677aa52f16953c50d21074cf56df3568e68413e8f7c4d24f7519b`

See more details on using hashes here.

File details

Details for the file tilelang-0.1.4-cp311-cp311-manylinux_2_27_x86_64.whl.

File metadata

Download URL: tilelang-0.1.4-cp311-cp311-manylinux_2_27_x86_64.whl
Upload date: Apr 18, 2025
Size: 68.0 MB
Tags: CPython 3.11, manylinux: glibc 2.27+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for tilelang-0.1.4-cp311-cp311-manylinux_2_27_x86_64.whl
Algorithm	Hash digest
SHA256	`9880ef36a93acf586f993cdb25668218fa989f0cdeca9c48cbbe973fa1e1ff97`
MD5	`27c499ae8061049b7789c199727e5cd6`
BLAKE2b-256	`6cc91e2ab9e0052171ec910d23346ce5d2dfeefdf9417b8c4938ab565a259855`

See more details on using hashes here.

File details

Details for the file tilelang-0.1.4-cp310-cp310-manylinux_2_27_x86_64.whl.

File metadata

Download URL: tilelang-0.1.4-cp310-cp310-manylinux_2_27_x86_64.whl
Upload date: Apr 18, 2025
Size: 68.0 MB
Tags: CPython 3.10, manylinux: glibc 2.27+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for tilelang-0.1.4-cp310-cp310-manylinux_2_27_x86_64.whl
Algorithm	Hash digest
SHA256	`3b176d38da5be9f442d7743c05a792caf06ddada4669b7c1a9fdb309bf839685`
MD5	`47a4b3abf41575d510ff0ee5d39b121e`
BLAKE2b-256	`80317ff6f9c692ed284b3eb40d1cad5c93bc9bd933846f586de1293641613a43`

See more details on using hashes here.

File details

Details for the file tilelang-0.1.4-cp39-cp39-manylinux_2_27_x86_64.whl.

File metadata

Download URL: tilelang-0.1.4-cp39-cp39-manylinux_2_27_x86_64.whl
Upload date: Apr 18, 2025
Size: 68.0 MB
Tags: CPython 3.9, manylinux: glibc 2.27+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for tilelang-0.1.4-cp39-cp39-manylinux_2_27_x86_64.whl
Algorithm	Hash digest
SHA256	`a1aa16d30db6f87fe901d02ef3ebd361e11e239e94a5a1b6ec104c282d7dae3d`
MD5	`1c1761dd0c8825243be5a28abea663a3`
BLAKE2b-256	`bf9f13e7298c07f891dc4ded2682f9051c659856e810c771fbb5902bc1611e85`

See more details on using hashes here.

File details

Details for the file tilelang-0.1.4-cp38-cp38-manylinux_2_27_x86_64.whl.

File metadata

Download URL: tilelang-0.1.4-cp38-cp38-manylinux_2_27_x86_64.whl
Upload date: Apr 18, 2025
Size: 68.0 MB
Tags: CPython 3.8, manylinux: glibc 2.27+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for tilelang-0.1.4-cp38-cp38-manylinux_2_27_x86_64.whl
Algorithm	Hash digest
SHA256	`e3fa1ca304f50d90ba6c9027548f61a3e9c77f8c91a69a2c20de7c3ab0fd7f5e`
MD5	`a43514528a89e58a3a5fd0233dcf9d6d`
BLAKE2b-256	`df4d483a7762e7d7d0ccf1dc320e876b29cc702803b86e4650345afbd12357e9`

See more details on using hashes here.

tilelang 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Tile Language

Latest News

Tested Devices

OP Implementation Examples

Benchmark Summary

Installation

Method 1: Install with Pip

Method 2: Build from Source

Method 3: Install with Nightly Version

Quick Start

GEMM Example with Annotations (Layout, L2 Cache Swizzling, and Pipelining, etc.)

Dive Deep into TileLang Beyond GEMM

Upcoming Features

Join the Discussion

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes