Skip to main content

TileOPs kernels for efficient inference

Project description

TileOPs (TOP)

TileOPs (TOP) is a high-performance machine learning operator collections built on top of TileLang. It offers efficient, modular, and composable implementations optimized for AI workloads.

Note: TileOPs is still under rapid development.


DeepSeek-V3.2-Exp DeepSeek Sparse Attention (DSA) performance on H800 SXM DeepSeek-V3.2-Exp DeepSeek Sparse Attention (DSA) performance on H800 SXM

📦 Installation

Requirements

  • Python 3.8+
  • PyTorch >= 2.1
  • GLIBCXX_3.4.32
  • TileLang

Method 1: Install with Pip

pip install tileops

Method 2: Install from source (editable mode for development)

git clone https://github.com/tile-ai/TileOPs
cd TileOPs
pip install -e '.[dev]' -v # remove -e option if you don't want to install in editable mode, -v for verbose output

🚀 Quick Usage

Sparse MLA

import torch
from top import SparseMLAKernel

batch_size = 1
seq_len = 1024
seq_len_kv = 2048
q_start_index_s = 1024
n_heads = 128
head_dim = 512
tail_dim = 64
topk = 2048
kv_stride = 1
kv_group = 1
sm_scale = None

sparse_mla = SparseMLAKernel(
    batch=batch_size,
    seq_len=seq_len,
    seq_len_kv=seq_len_kv,
    q_start_index_s=q_start_index_s,
    heads=n_heads,
    dim=head_dim,
    tail_dim=tail_dim,
    topk=topk,
    kv_stride=kv_stride,
    kv_group=kv_group,
    sm_scale=sm_scale,
    is_casual=True,
    dtype=torch.bfloat16,
    device='cuda',
)

# Evaluate the Sparse MLA kernel performance
sparse_mla.check()
latency = sparse_mla.profile()
print(f"Latency: {latency:.4f} ms")
print(f'fwd tflops = ',
        (batch_size * seq_len * (head_dim + tail_dim + head_dim) * topk * 2 * n_heads) / (latency * 1e-3) / 1e12)

MLA

import torch
import top
from top import MLAKernel

device = "cuda"
dtype = torch.float16

batch = 128
heads = 64
kv_heads = 1
kv_ctx = 8192
dim = 512
pe_dim = 64

# Query input: [batch, heads, dim]
q = torch.randn(batch, heads, dim, device=device, dtype=dtype)

# Query positional encoding: [batch, heads, pe_dim]
q_pe = torch.randn(batch, heads, pe_dim, device=device, dtype=dtype)

# KV cache input: [batch, kv_ctx, kv_heads, dim]
kv = torch.randn(batch, kv_ctx, kv_heads, dim, device=device, dtype=dtype)

# KV positional encoding: [batch, kv_ctx, kv_heads, pe_dim]
k_pe = torch.randn(batch, kv_ctx, kv_heads, pe_dim, device=device, dtype=dtype)

# Use MLA kernel
block_N = 64
block_H = 64
num_split = 1

mla = MLAKernel(batch, heads, kv_heads, kv_ctx, dim, pe_dim, block_N, block_H, num_split)

out = mla(q, q_pe, kv, k_pe)

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tileops-0.0.1.dev2.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tileops-0.0.1.dev2-py3-none-any.whl (47.2 kB view details)

Uploaded Python 3

File details

Details for the file tileops-0.0.1.dev2.tar.gz.

File metadata

  • Download URL: tileops-0.0.1.dev2.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for tileops-0.0.1.dev2.tar.gz
Algorithm Hash digest
SHA256 30428e26a53b249d2ce35692ac32746b7ad8dc60b653ed8f63b3a7ae5c1fdbb3
MD5 091e11e142af28da67bc129ebea67818
BLAKE2b-256 9ed5b331a02afac98ad19a02232a9866e608c607fc29ecd899a0b40da223e4ee

See more details on using hashes here.

File details

Details for the file tileops-0.0.1.dev2-py3-none-any.whl.

File metadata

  • Download URL: tileops-0.0.1.dev2-py3-none-any.whl
  • Upload date:
  • Size: 47.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for tileops-0.0.1.dev2-py3-none-any.whl
Algorithm Hash digest
SHA256 3bcfd562bc95466ab4855aa7afff363cdd2f6ab381abe24f8deea91dd3986d5f
MD5 7366b29b5ab3764ab33e7937caf430b6
BLAKE2b-256 882cdd2b61f917ea3df1bf4df2347fc17404fd0823495b1fec4bd87e72b3cfbb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page