SGLang-Omni is a high-performance serving framework for omni and multimodal models, built on top of SGLang. It is designed to orchestrate multi-stage pipelines with low latency and OpenAI-compatible APIs.
Modern omni models — such as speech-output LLMs and multimodal generation systems — decompose into heterogeneous stages with fundamentally different computational profiles: a compute-bound thinker, a memory-bound talker, a latency-sensitive codec. SGLang-Omni is built around a computation-centric design: each stage runs its own independent scheduler tuned to its bottleneck, communicates through a shared inbox/outbox abstraction, and transfers tensors via zero-copy shared memory. This prevents any single stage from degrading the others and allows new models to plug into the framework by declaring a pipeline topology rather than building an inference system from scratch.
Core features:
- Multi-Stage Pipeline: Flexible framework for orchestrating preprocessing, AR engine, codec, and vocoder stages across processes and GPUs.
- Native SGLang Integration: Leverages SGLang's RadixAttention, continuous batching, and CUDA Graph optimizations for the AR backbone.
- OpenAI-Compatible Server: Drop-in
/v1/audio/speechand/v1/chat/completionsendpoints with real-time streaming support. - Broad Model Support: Supports a growing set of TTS and omni models including Higgs Audio, Fish Audio S2-Pro, Voxtral TTS, Qwen3 TTS, MOSS-TTS, Qwen3-Omni, Ming-Omni, and LLaDA2.0-Uni.
| Model | Type | Notes |
|---|---|---|
| boson-sglang/higgs-audio-v3-tts-4b-base | TTS | Voice cloning, streaming, 100+ languages |
| fishaudio/s2-pro | TTS | Voice cloning, streaming |
| mistralai/Voxtral-4B-TTS-2603 | TTS | Named voices, streaming, 9 languages |
| Qwen/Qwen3-TTS-12Hz-Base | TTS | Voice cloning, streaming, 10 languages, 0.6B / 1.7B |
| OpenMOSS-Team/MOSS-TTS-v1.5 | TTS | Voice cloning, streaming, 31 languages |
| Qwen/Qwen3-Omni-30B-A3B-Instruct | Omni | Text, image, audio, video → text + audio |
| inclusionAI/Ming-flash-omni-2.0 | Omni | Streaming TTS |
| inclusionAI/LLaDA2.0-Uni | Multimodal | Text + image understanding and generation |