Thanks to visit codestin.com
Credit goes to github.com

Skip to content

meituan-longcat/SGLang-FluentLLM

Repository files navigation

SGLang-FluentLLM

The LongCat series models have consistently followed the principle of Model–System Co-Design, which introduces unique challenges for both the training and inference systems. To help the community better adopt and use LongCat models, we are open-sourcing part of our inference engine (SGLang-FluentLLM) as well as several key kernels.

Engine

Our inference engine is built on top of the SGLang codebase, with the following enhanced capabilities:

  • Refactored the speculative decoding workflow to make it compatible with overlap scheduling
  • Combined Target + Verify + Draft into a single CUDA graph to reduce speculative decoding overhead
  • Support for Eagle, MTP, and PLD style speculative decoding
  • Layer-wise KVCache transfer, overlapping prefill computation with KVCache communication
  • Decode Radix Tree Cache to reduce KVCache transfer volume between PDs

We sincerely appreciate the solid work and inspiration brought by the SGLang community.

Kernels

On the kernels side, we are open-sourcing:

We would also like to thank the broader LLM inference community. It is an honor for us to grow together with this community.

Note

  • We use Dynamo for KVCache-aware request scheduling. As a result, in SGLang-FluentLLM we have removed SGLang’s sgl-model-gateway.
  • For multimodal models, we adopt a decoupled architecture that differs from the one used in the SGLang community. Therefore, multimodal support has also been removed from SGLang-FluentLLM itself (even in our internal setup, SGLang-FluentLLM is still used as the LLM backbone for multimodal inference).
  • Tested on Nvidia GPUs H800/H20.

How to Use

Please refer to Quick Start

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages