TileRT: Tile-Based Runtime for Ultra-Low-Latency LLM Inference.
Project description
TileRT is a project designed to serve large language models (LLMs) in ultra-low-latency scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—enabling models with hundreds of billions of parameters to achieve millisecond-level time per output token (TPOT).
Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes responsiveness, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
To achieve this, TileRT introduces a tile-level runtime engine. Leveraging a compiler-driven approach, LLM operators are decomposed into fine-grained tile-level tasks, while the runtime dynamically reschedules computation, I/O, and communication across multiple devices in a highly overlapped manner. This design minimizes idle time and improves hardware utilization.
In our latest v0.1.3 release, we tested TileRT's performance on the newest GLM-5 model, demonstrating the effectiveness of our approach in real-world applications. We were among the first to support this latest model, validating the power of the technology we've developed.
Figure 1. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with synthetic data. SGLang v0.5.9.dev0 with MTP=3; vLLM v0.16.0rc2.dev173 with MTP=1 (vLLM failed when MTP=3, so we set MTP=1 as vLLM-GPT5-recipe); TileRT v0.1.3 with MTP=3.
Figure 2. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with synthetic data. SGLang v0.5.9.dev0; vLLM v0.16.0rc2.dev173; TileRT v0.1.3.
Using the GLM-5 model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs, we evaluated TileRT’s preliminary performance. As shown in the benchmarks below, TileRT demonstrates substantial improvements over existing inference systems.
The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into TileLang and TileScale.
Installation
Before installing the TileRT wheel package, please ensure your environment meets the following requirements:
Supported Environment
This wheel is built and tested under the following conditions:
- Hardware: 8× NVIDIA B200 GPUs
- Operating System: Linux x86_64 (Ubuntu 20.04+ recommended)
- Python Versions: 3.11 – 3.12
- CUDA Version: 12.9
- CUDA Driver: Compatible with the B200 runtime environment
- PyTorch Build: PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200)
Python Package Installation
Disclaimer: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image. For more details on the Docker environment and usage instructions, please refer to the TileRT project homepage on GitHub.
Docker Installation
To get started, pull the Docker image:
docker pull tileai/tilert:v0.1.0
Then, launch a Docker container using the following command:
IMAGE_NAME="tileai/tilert:v0.1.0"
WORKSPACE_PATH="xxx" # Path to the workspace you want to mount
docker run --gpus all -it \
-v $WORKSPACE_PATH:/workspace/ \
$IMAGE_NAME
After the container starts, install the TileRT package:
pip install tilert
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tilert-0.1.3-py3-none-manylinux2014_x86_64.whl.
File metadata
- Download URL: tilert-0.1.3-py3-none-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
405eb82de70d08c62383516a59382b7baedc05bf550bf505eb4256e5d2c61dc1
|
|
| MD5 |
7099222102a64ad1152d90719985951e
|
|
| BLAKE2b-256 |
9b2679d4d387d659091af10839d0a16746e67227258ea4af0b4a822388b2fb45
|