A production-ready, high-performance serving framework for large language models with OpenAI-compatible APIs.
SFLLM (Serving Framework for Large Language Models) is designed to provide efficient and scalable inference services for large language models. It focuses on maximizing GPU utilization and reducing inference latency through intelligent batching, CUDA optimizations, and memory-efficient implementations.
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
- High Performance: Optimized inference with intelligent request batching
- Streaming Support: Real-time streaming responses for better user experience
- CUDA Optimizations: CUDA graphs and custom kernels for maximum performance
- Memory Efficient: Optimized KV-cache management and memory allocation
- Production Ready: Built-in health checks and error handling
- Eagle3 Speculative Decoding: Advanced speculative decoding with Eagle3 algorithm for faster generation
- full graph mode for spec: capture draft-verify-accept into one cuda graph, saved alot cpu overhead
- Eagle3 with CUDA Graph: Optimized Eagle3 implementation with CUDA graph acceleration
- spec decoding with overlap make the GPU are busy all the time
---from
to
- Python 3.8+
- CUDA 11.8+ (for GPU acceleration)
- PyTorch 2.0+
# Clone the repository
git clone https://github.com/wejoncy/sfllm.git
cd sfllm
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .Basic Usage:
python python/sfllm/serving/app.py \
--model /path/to/your/model \
--port 8081 \
--dtype float16With Eagle3 Speculative Decoding:
python python/sfllm/serving/app.py \
--model /path/to/your/model \
--draft-model-path /path/to/eagle3/draft/model \
--speculative-algorithm eagle3 \
--speculative-num-steps 4 \
--port 8081 \
--dtype float16Chat Completions (Streaming)
curl -X POST "http://localhost:8081/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": true,
"max_new_tokens": 256,
"temperature": 0.7
}'Text Completions
curl -X POST "http://localhost:8081/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model",
"prompt": "The future of AI is",
"max_new_tokens": 128,
"temperature": 0.8
}'curl http://localhost:8081/health| Option | Description | Default |
|---|---|---|
--model |
Path to model directory | Required |
--port |
Server port | 8081 |
--dtype |
Model precision (float16/float32) | float16 |
--max-context-length |
Maximum context length | 4096 |
--cuda-graph-max-bs |
Max CUDA graph batch size | 32 |
--disable-cuda-graph |
Disable CUDA graphs | False |
--speculative-algorithm |
Speculative decoding algorithm (eagle3) | None |
--draft-model-path |
Path to Eagle3 draft model | None |
--speculative-num-steps |
Number of speculative steps | 4 |
--disable-overlap |
Disable overlap scheduling | False |
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please feel free to submit issues and pull requests.
Made with ❤️ by wejoncy