SGLang-FluentLLM

The LongCat series models have consistently followed the principle of Model–System Co-Design, which introduces unique challenges for both the training and inference systems. To help the community better adopt and use LongCat models, we are open-sourcing part of our inference engine (SGLang-FluentLLM) as well as several key kernels.

Engine

Our inference engine is built on top of the SGLang codebase, with the following enhanced capabilities:

Refactored the speculative decoding workflow to make it compatible with overlap scheduling
Combined Target + Verify + Draft into a single CUDA graph to reduce speculative decoding overhead
Support for Eagle, MTP, and PLD style speculative decoding
Layer-wise KVCache transfer, overlapping prefill computation with KVCache communication
Decode Radix Tree Cache to reduce KVCache transfer volume between PDs

We sincerely appreciate the solid work and inspiration brought by the SGLang community.

Kernels

On the kernels side, we are open-sourcing:

FlashMLA SwapAB optimizations
FlashMLA FP8 KVCache + FP8 Compute optimizations
- This optimization is detailed in the paper SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining.
DeepGemm SwapAB Offset + PDL optimizations
Communication–computation fused kernels optimizations in FlashInfer

We would also like to thank the broader LLM inference community. It is an honor for us to grow together with this community.

Note

We use Dynamo for KVCache-aware request scheduling. As a result, in SGLang-FluentLLM we have removed SGLang’s sgl-model-gateway.
For multimodal models, we adopt a decoupled architecture that differs from the one used in the SGLang community. Therefore, multimodal support has also been removed from SGLang-FluentLLM itself (even in our internal setup, SGLang-FluentLLM is still used as the LLM backbone for multimodal inference).
Tested on Nvidia GPUs H800/H20.

How to Use

Please refer to Quick Start

Name		Name	Last commit message	Last commit date
Latest commit History 2,219 Commits
.github		.github
3rdparty		3rdparty
assets		assets
benchmark		benchmark
examples/chat_template		examples/chat_template
python		python
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
Quick_Start.md		Quick_Start.md
README.md		README.md
clean_setup.sh		clean_setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGLang-FluentLLM

Engine

Kernels

Note

How to Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SGLang-FluentLLM

Engine

Kernels

Note

How to Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages