The LongCat series models have consistently followed the principle of Model–System Co-Design, which introduces unique challenges for both the training and inference systems. To help the community better adopt and use LongCat models, we are open-sourcing part of our inference engine (SGLang-FluentLLM) as well as several key kernels.
Our inference engine is built on top of the SGLang codebase, with the following enhanced capabilities:
- Refactored the speculative decoding workflow to make it compatible with overlap scheduling
- Combined Target + Verify + Draft into a single CUDA graph to reduce speculative decoding overhead
- Support for Eagle, MTP, and PLD style speculative decoding
- Layer-wise KVCache transfer, overlapping prefill computation with KVCache communication
- Decode Radix Tree Cache to reduce KVCache transfer volume between PDs
We sincerely appreciate the solid work and inspiration brought by the SGLang community.
On the kernels side, we are open-sourcing:
- FlashMLA SwapAB optimizations
- FlashMLA FP8 KVCache + FP8 Compute optimizations
- This optimization is detailed in the paper SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining.
- DeepGemm SwapAB Offset + PDL optimizations
- Communication–computation fused kernels optimizations in FlashInfer
We would also like to thank the broader LLM inference community. It is an honor for us to grow together with this community.
- We use Dynamo for KVCache-aware request scheduling. As a result, in SGLang-FluentLLM we have removed SGLang’s sgl-model-gateway.
- For multimodal models, we adopt a decoupled architecture that differs from the one used in the SGLang community. Therefore, multimodal support has also been removed from SGLang-FluentLLM itself (even in our internal setup, SGLang-FluentLLM is still used as the LLM backbone for multimodal inference).
- Tested on Nvidia GPUs H800/H20.
Please refer to Quick Start