Thanks to visit codestin.com
Credit goes to github.com

Skip to content

cdliang11/rtp-llm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English 中文

About

  • rtp-llm is a Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It is widely used within Alibaba Group, supporting LLM service across multiple business units including Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada.
  • The rtp-llm project is a sub-project of the havenask .

Features

Production Proven

Applied in numerous LLM scenarios, such as:

High Performance

  • Utilizes high-performance CUDA kernels, including PagedAttention, FlashAttention, FlashDecoding, etc.
  • Implements WeightOnly INT8 Quantization with automatic quantization at load time
  • Adaptive KVCache Quantization
  • Detailed optimization of dynamic batching overhead at the framework level
  • Specially optimized for the V100 GPU

Flexibility and Ease of Use

  • Seamless integration with the HuggingFace models, supporting multiple weight formats such as SafeTensors, Pytorch, and Megatron
  • Deploys multiple LoRA services with a single model instance
  • Handles multimodal inputs (combining images and text)
  • Enables multi-machine/multi-GPU tensor parallelism
  • Supports P-tuning models

Advanced Acceleration Techniques

  • Loads pruned irregular models
  • Contextual Prefix Cache for multi-turn dialogues
  • System Prompt Cache
  • Speculative Decoding
  • Medusa for advanced parallelization strategies

How to Use

Requirements

  • Operating System: Linux
  • Python: 3.10
  • NVIDIA GPU: Compute Capability 7.0 or higher (e.g., RTX20xx, RTX30xx, RTX40xx, V100, T4, A10/A30/A100, L4, H100, etc.)

Startup example

# Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt 
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.0.1+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'

Documentation

Acknowledgments

Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. FasterTransformer and TensorRT-LLM have provided us with reliable performance guarantees. Flash-Attention2 and cutlass have also provided a lot of help in our continuous performance optimization process. Our continuous batching and increment decoding draw on the implementation of vllm; sampling draws on transformers, with speculative sampling integrating Medusa's implementation, and the multimodal part integrating implementations from llava and qwen-vl. We thank these projects for their inspiration and help.

External Application Scenarios (Continuously Updated)

Supported Model List

LLM

  • Aquila and Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan and Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B)
  • Bloom (bigscience/bloom, bigscience/bloomz)
  • ChatGlm (THUDM/chatglm2-6b, THUDM/chatglm3-6b)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • GptNeox (EleutherAI/gpt-neox-20b)
  • GPT BigCode (bigcode/starcoder, bigcode/starcoder2)
  • LLaMA and LLaMA-2 (meta-llama/Llama-2-7b, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, lmsys/vicuna-33b-v1.3, 01-ai/Yi-34B, xverse/XVERSE-13B, etc.)
  • MPT (mosaicml/mpt-30b-chat, etc.)
  • Phi (microsoft/phi-1_5, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, Qwen/Qwen-14B, Qwen/Qwen-14B-Chat, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • Gemma (google/gemma-it, etc)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, etc)

LLM + Multimodal

  • LLAVA (liuhaotian/llava-v1.5-13b, liuhaotian/llava-v1.5-7b)
  • Qwen-VL (Qwen/Qwen-VL)

Contact Us

DingTalk Group

WeChat Group

About

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 46.5%
  • Cuda 30.4%
  • Python 21.5%
  • Starlark 1.3%
  • Shell 0.2%
  • Makefile 0.1%