Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RuntimeError when deploying GLM-4.5-Air-FP8 by vLLM in 8*RTX3090 #94

@TheKernelZ

Description

@TheKernelZ

System Info / 系統信息

When I deploying GLM-4.5-Air-FP8 in 8*RTX3090(total 192GB).
I got an error which seems in mix quantization precision.

RuntimeError: size_n = 2736 is not divisible by tile_n_size = 64

I have seen the related issue, but my GPU memory is enough and i have to use FP8 version
The detail message is as follow:

(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] WorkerProc failed to start.
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] Traceback (most recent call last):
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 602, in worker_main
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     worker = WorkerProc(*args, **kwargs)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 457, in __init__
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     self.worker.load_model()
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 229, in load_model
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2873, in load_model
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     self.model = model_loader.load_model(
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 56, in load_model
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     process_weights_after_loading(model, model_config, target_device)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 118, in process_weights_after_loading
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     quant_method.process_weights_after_loading(module)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 824, in process_weights_after_loading
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     layer.scheme.process_weights_after_loading(layer)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 63, in process_weights_after_loading
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     prepare_fp8_layer_for_marlin(layer)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 113, in prepare_fp8_layer_for_marlin
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     marlin_qweight = ops.gptq_marlin_repack(
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]                      ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1144, in gptq_marlin_repack
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n, num_bits)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]   File "/mnt/weka/home/xxxxx/LLMs/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]     return self._op(*args, **kwargs)
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=312) ERROR 10-17 14:02:34 [multiproc_executor.py:628] RuntimeError: size_n = 2736 is not divisible by tile_n_size = 64

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

vllm version 0.11.0
vllm serve models/GLM-4.5-Air-FP8 --cuda-graph-sizes 4 --served-model-name GLM-4.5-Air-FP8-cudagraph --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser glm45

Expected behavior / 期待表现

delopy GLM-4.5-Air-FP8 as a OpenAI-compatible API.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions