Releases: tile-ai/TileRT
v0.1.0-alpha.1 Release Notes
We’re excited to announce the first preview release of TileRT (v0.1.0-alpha.1). This initial exploration version introduces an experimental runtime that investigates tile-level compilation techniques for ultra-low-latency LLM inference. It serves as a starting point for evaluating TileRT’s potential to reduce end-to-end latency while maintaining compatibility with large-scale models and supporting future integration with TileLang and TileScale.
🚀 Overview
The goal of the TileRT project is to push the latency boundaries of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to run at millisecond-level TPOT. TileRT addresses these challenges with a new tile-level runtime engine. It uses a compiler-driven approach to decompose LLM operators into fine-grained tile-level tasks, and a tile-level runtime that reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This allows TileRT to minimize idle time and maximize hardware utilization. These compiler techniques will be incorporated into TileLang and TileScale.
We evaluated TileRT’s preliminary performance using the DeepSeek-V3.2-Exp model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT significantly outperforms existing inference systems:
Fig. Evaluation setup: Input seqlen/output seqlen: 1K/1K, SGLang-0.5.5, vLLM-0.11.0, CUDA-12.9
TileRT is a continuously evolving project. Our ongoing plans include pursuing more aggressive optimizations, supporting various batch sizes, more model families and more hardware, and establishing a new foundation for low-latency AI inference. Stay tuned for updates!
Installation
Before installing the TileRT wheel package, please ensure your environment meets the following requirements:
Supported Environment
This wheel is built and tested under the following conditions:
- Hardware: 8× NVIDIA B200 GPUs
- Operating System: Linux x86_64 (Ubuntu 20.04+ recommended)
- Python Versions: 3.11 – 3.12
- CUDA Version: 12.9
- CUDA Driver: Compatible with the B200 runtime environment
- PyTorch Build: PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200)
Python Package Installation
Important
Disclaimer: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image.
For more details on the Docker environment and usage instructions, please refer to the TileRT project homepage on GitHub.
Docker Installation
To get started, pull the Docker image:
docker pull tileai/tilert:v0.1.0Then, launch a Docker container using the following command:
IMAGE_NAME="tileai/tilert:v0.1.0"
WORKSPACE_PATH="xxx" # Path to the workspace you want to mount
docker run --gpus all -it \
-v $WORKSPACE_PATH:/workspace/ \
$IMAGE_NAMEAfter the container starts, install the TileRT package:
pip install tilert🌟 Join the Journey
TileRT is developed and maintained by the TileRT team. This preview release marks just the beginning, and we’re continuing to explore new compiler techniques, improve runtime performance, and expand multi-device support.
If you’re interested in ultra-low-latency LLM inference, we invite you to follow the project, share feedback, and join us as TileRT evolves.
- 💬 Start a conversation via Issues
- 📧 Contact the TileRT team: [email protected]