Skip to main content

TileRT: Tile-Based Runtime for Ultra-Low-Latency LLM Inference.

Project description

TileRT: Tile-Based Runtime for
Ultra-Low-Latency LLM Inference

GitHub repository PyPI version

TileRT is a project designed to serve large language models (LLMs) in ultra-low-latency scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—enabling models with hundreds of billions of parameters to achieve millisecond-level time per output token (TPOT).

Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes responsiveness, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.

To achieve this, TileRT introduces a tile-level runtime engine. Leveraging a compiler-driven approach, LLM operators are decomposed into fine-grained tile-level tasks, while the runtime dynamically reschedules computation, I/O, and communication across multiple devices in a highly overlapped manner. This design minimizes idle time and improves hardware utilization.

In our latest v0.1.3 release, we tested TileRT's performance on the newest GLM-5 model, demonstrating the effectiveness of our approach in real-world applications. We were among the first to support this latest model, validating the power of the technology we've developed.

TileRT Benchmark
Figure 1. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with synthetic data. SGLang v0.5.9.dev0 with MTP=3; vLLM v0.16.0rc2.dev173 with MTP=1 (vLLM failed when MTP=3, so we set MTP=1 as vLLM-GPT5-recipe); TileRT v0.1.3 with MTP=3.

TileRT Benchmark
Figure 2. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with synthetic data. SGLang v0.5.9.dev0; vLLM v0.16.0rc2.dev173; TileRT v0.1.3.

Using the GLM-5 model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs, we evaluated TileRT’s preliminary performance. As shown in the benchmarks below, TileRT demonstrates substantial improvements over existing inference systems.

The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into TileLang and TileScale.

Installation

Before installing the TileRT wheel package, please ensure your environment meets the following requirements:

Supported Environment

This wheel is built and tested under the following conditions:

  • Hardware: 8× NVIDIA B200 GPUs
  • Operating System: Linux x86_64 (Ubuntu 20.04+ recommended)
  • Python Versions: 3.11 – 3.12
  • CUDA Version: 12.9
  • CUDA Driver: Compatible with the B200 runtime environment
  • PyTorch Build: PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200)

Python Package Installation

Disclaimer: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image. For more details on the Docker environment and usage instructions, please refer to the TileRT project homepage on GitHub.

Docker Installation

To get started, pull the Docker image:

docker pull tileai/tilert:v0.1.0

Then, launch a Docker container using the following command:

IMAGE_NAME="tileai/tilert:v0.1.0"
WORKSPACE_PATH="xxx"  # Path to the workspace you want to mount

docker run --gpus all -it \
    -v $WORKSPACE_PATH:/workspace/ \
    $IMAGE_NAME

After the container starts, install the TileRT package:

pip install tilert

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tilert-0.1.3-py3-none-manylinux2014_x86_64.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file tilert-0.1.3-py3-none-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tilert-0.1.3-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 405eb82de70d08c62383516a59382b7baedc05bf550bf505eb4256e5d2c61dc1
MD5 7099222102a64ad1152d90719985951e
BLAKE2b-256 9b2679d4d387d659091af10839d0a16746e67227258ea4af0b4a822388b2fb45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page