-
Notifications
You must be signed in to change notification settings - Fork 310
Open
Description
System Info / 系統信息
When I deploying GLM-4.5-Air-FP8 in 8*RTX3090(total 192GB).
I got an error which seems in mix quantization precision.
RuntimeError: size_n = 2736 is not divisible by tile_n_size = 64
I have seen the related issue, but my GPU memory is enough and i have to use FP8 version
The detail message is as follow:
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] WorkerProc failed to start.
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] Traceback (most recent call last):
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 602, in worker_main
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] worker = WorkerProc(*args, **kwargs)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 457, in __init__
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] self.worker.load_model()
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 229, in load_model
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2873, in load_model
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] self.model = model_loader.load_model(
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 56, in load_model
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] process_weights_after_loading(model, model_config, target_device)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 118, in process_weights_after_loading
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] quant_method.process_weights_after_loading(module)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 824, in process_weights_after_loading
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] layer.scheme.process_weights_after_loading(layer)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 63, in process_weights_after_loading
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] prepare_fp8_layer_for_marlin(layer)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 113, in prepare_fp8_layer_for_marlin
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] marlin_qweight = ops.gptq_marlin_repack(
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1144, in gptq_marlin_repack
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n, num_bits)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] return self._op(*args, **kwargs)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] RuntimeError: size_n = 2736 is not divisible by tile_n_size = 64
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
vllm version 0.11.0
vllm serve models/GLM-4.5-Air-FP8 --cuda-graph-sizes 4 --served-model-name GLM-4.5-Air-FP8-cudagraph --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser glm45
Expected behavior / 期待表现
delopy GLM-4.5-Air-FP8 as a OpenAI-compatible API.
Metadata
Metadata
Assignees
Labels
No labels