Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: tile-ai/TileRT

v0.1.0-alpha.1 Release Notes

22 Nov 09:09
8b5225a

Choose a tag to compare

TileRT: Pushing the Boundaries of
Low-Latency LLM Inference

PyPI version

We’re excited to announce the first preview release of TileRT (v0.1.0-alpha.1). This initial exploration version introduces an experimental runtime that investigates tile-level compilation techniques for ultra-low-latency LLM inference. It serves as a starting point for evaluating TileRT’s potential to reduce end-to-end latency while maintaining compatibility with large-scale models and supporting future integration with TileLang and TileScale.

🚀 Overview

The goal of the TileRT project is to push the latency boundaries of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to run at millisecond-level TPOT. TileRT addresses these challenges with a new tile-level runtime engine. It uses a compiler-driven approach to decompose LLM operators into fine-grained tile-level tasks, and a tile-level runtime that reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This allows TileRT to minimize idle time and maximize hardware utilization. These compiler techniques will be incorporated into TileLang and TileScale.

We evaluated TileRT’s preliminary performance using the DeepSeek-V3.2-Exp model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT significantly outperforms existing inference systems:

TileRT Benchmark
Fig. Evaluation setup: Input seqlen/output seqlen: 1K/1K, SGLang-0.5.5, vLLM-0.11.0, CUDA-12.9

TileRT is a continuously evolving project. Our ongoing plans include pursuing more aggressive optimizations, supporting various batch sizes, more model families and more hardware, and establishing a new foundation for low-latency AI inference. Stay tuned for updates!

Installation

Before installing the TileRT wheel package, please ensure your environment meets the following requirements:

Supported Environment

This wheel is built and tested under the following conditions:

  • Hardware: 8× NVIDIA B200 GPUs
  • Operating System: Linux x86_64 (Ubuntu 20.04+ recommended)
  • Python Versions: 3.11 – 3.12
  • CUDA Version: 12.9
  • CUDA Driver: Compatible with the B200 runtime environment
  • PyTorch Build: PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200)

Python Package Installation

Important

Disclaimer: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image.
For more details on the Docker environment and usage instructions, please refer to the TileRT project homepage on GitHub.

Docker Installation

To get started, pull the Docker image:

docker pull tileai/tilert:v0.1.0

Then, launch a Docker container using the following command:

IMAGE_NAME="tileai/tilert:v0.1.0"
WORKSPACE_PATH="xxx"  # Path to the workspace you want to mount

docker run --gpus all -it \
    -v $WORKSPACE_PATH:/workspace/ \
    $IMAGE_NAME

After the container starts, install the TileRT package:

pip install tilert

🌟 Join the Journey

TileRT is developed and maintained by the TileRT team. This preview release marks just the beginning, and we’re continuing to explore new compiler techniques, improve runtime performance, and expand multi-device support.

If you’re interested in ultra-low-latency LLM inference, we invite you to follow the project, share feedback, and join us as TileRT evolves.