Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Feature]: Improve GGUF loading from HuggingFace user experience like repo_id:quant_type#29137

Merged
Isotr0py merged 11 commits intovllm-project:mainfrom
sts07142:feat/load-gguf-model
Nov 25, 2025
Merged

[Feature]: Improve GGUF loading from HuggingFace user experience like repo_id:quant_type#29137
Isotr0py merged 11 commits intovllm-project:mainfrom
sts07142:feat/load-gguf-model

Conversation

@sts07142
Copy link
Contributor

@sts07142 sts07142 commented Nov 21, 2025

Purpose

Fixes #25182
Improve GGUF loading from HuggingFace user experience like repo_id:quant_type

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
pytest tests/models/test_gguf_download.py
pytest tests/transformers_utils/test_utils.py

Test Result

[BEFORE] vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
 vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
INFO 11-21 10:31:03 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1210609) INFO 11-21 10:31:03 [api_server.py:1978] vLLM API server version 0.11.2.dev67+g1d642872a.d20251120
(APIServer pid=1210609) INFO 11-21 10:31:03 [utils.py:253] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=1210609) Traceback (most recent call last):
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
(APIServer pid=1210609)     hf_hub_download(
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=1210609)     validate_repo_id(arg_value)
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=1210609)     raise HFValidationError(
(APIServer pid=1210609) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'.
(APIServer pid=1210609)
(APIServer pid=1210609) During handling of the above exception, another exception occurred:
(APIServer pid=1210609)
(APIServer pid=1210609) Traceback (most recent call last):
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
(APIServer pid=1210609)     resolved_config_file = cached_file(
(APIServer pid=1210609)                            ^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 322, in cached_file
(APIServer pid=1210609)     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 532, in cached_files
(APIServer pid=1210609)     _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return
(APIServer pid=1210609)     resolved_file = try_to_load_from_cache(
(APIServer pid=1210609)                     ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=1210609)     validate_repo_id(arg_value)
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=1210609)     raise HFValidationError(
(APIServer pid=1210609) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'.
(APIServer pid=1210609)
(APIServer pid=1210609) During handling of the above exception, another exception occurred:
(APIServer pid=1210609)
(APIServer pid=1210609) Traceback (most recent call last):
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=1210609)     sys.exit(main())
(APIServer pid=1210609)              ^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1210609)     args.dispatch_function(args)
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=1210609)     uvloop.run(run_server(args))
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1210609)     return __asyncio.run(
(APIServer pid=1210609)            ^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1210609)     return runner.run(main)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1210609)     return self._loop.run_until_complete(task)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1210609)     return await main
(APIServer pid=1210609)            ^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 2028, in run_server
(APIServer pid=1210609)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 2047, in run_server_worker
(APIServer pid=1210609)     async with build_async_engine_client(
(APIServer pid=1210609)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1210609)     return await anext(self.gen)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 196, in build_async_engine_client
(APIServer pid=1210609)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1210609)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1210609)     return await anext(self.gen)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
(APIServer pid=1210609)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1210609)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/engine/arg_utils.py", line 1358, in create_engine_config
(APIServer pid=1210609)     maybe_override_with_speculators(
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/transformers_utils/config.py", line 573, in maybe_override_with_speculators
(APIServer pid=1210609)     config_dict, _ = PretrainedConfig.get_config_dict(
(APIServer pid=1210609)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict
(APIServer pid=1210609)     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
(APIServer pid=1210609)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict
(APIServer pid=1210609)     raise OSError(
(APIServer pid=1210609) OSError: Can't load the configuration of 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'unsloth/Qwen3-0.6B-GGUF:IQ1_S' is the correct path to a directory containing a config.json file
[AFTER] vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
INFO 11-21 10:25:52 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1206292) INFO 11-21 10:25:52 [api_server.py:1978] vLLM API server version 0.11.2.dev67+g1d642872a.d20251120
(APIServer pid=1206292) INFO 11-21 10:25:52 [utils.py:253] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=1206292) INFO 11-21 10:25:52 [arg_utils.py:1375]
(APIServer pid=1206292) INFO 11-21 10:25:52 [arg_utils.py:1375]
(APIServer pid=1206292) INFO 11-21 10:25:52 [arg_utils.py:1375] After maybe_override_with_speculators: unsloth/Qwen3-0.6B-GGUF
(APIServer pid=1206292) INFO 11-21 10:25:53 [model.py:652] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=1206292) INFO 11-21 10:25:53 [model.py:1783] Using max model len 40960
(APIServer pid=1206292) INFO 11-21 10:25:53 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:58 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev67+g1d642872a.d20251120) with config: model='unsloth/Qwen3-0.6B-GGUF', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:59 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.13:46217 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:59 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:59 [gpu_model_runner.py:3260] Starting to load model unsloth/Qwen3-0.6B-GGUF...
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:09 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:09 [cuda.py:424] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:14 [gpu_model_runner.py:3339] Model loading took 0.2154 GiB memory and 14.283840 seconds
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:17 [backends.py:648] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/124ca0c67d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:17 [backends.py:708] Dynamo bytecode transform time: 3.01 s
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:20 [backends.py:214] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.049 s
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:21 [monitor.py:34] torch.compile takes 6.06 s in total
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:22 [gpu_worker.py:337] Available KV cache memory: 65.28 GiB
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:22 [kv_cache_utils.py:1234] GPU KV cache size: 611,136 tokens
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:22 [kv_cache_utils.py:1239] Maximum concurrency for 40,960 tokens per request: 14.92x
(EngineCore_DP0 pid=1206446) 2025-11-21 10:26:22,420 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1206446) 2025-11-21 10:26:22,433 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██| 51/51 [00:01<00:00, 31.87it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████| 51/51 [00:01<00:00, 42.92it/s]
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:25 [gpu_model_runner.py:4245] Graph capturing finished in 3 secs, took 0.71 GiB
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:25 [core.py:253] init engine (profile, create kv cache, warmup model) took 11.47 seconds
(APIServer pid=1206292) INFO 11-21 10:26:27 [api_server.py:1726] Supported tasks: ['generate']
(APIServer pid=1206292) INFO 11-21 10:26:29 [api_server.py:2056] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:38] Available routes are:
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1206292) INFO:     Started server process [1206292]
(APIServer pid=1206292) INFO:     Waiting for application startup.
(APIServer pid=1206292) INFO:     Application startup complete.
pytest tests/models/test_gguf_download.py
pytest tests/models/test_gguf_download.py
===================================== test session starts ======================================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/name/.test/vllm
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 12 items

tests/models/test_gguf_download.py ............                                          [100%]

======================================= warnings summary =======================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ 12 passed, 2 warnings in 3.40s ================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
pytest tests/transformers_utils/test_utils.py
pytest tests/transformers_utils/test_utils.py
===================================== test session starts ======================================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/name/.test/vllm
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 12 items

tests/transformers_utils/test_utils.py ............                                      [100%]

======================================= warnings summary =======================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ 12 passed, 2 warnings in 1.78s ================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a user-friendly way to load GGUF models from Hugging Face using the repo_id:quant_type format. The changes are well-structured, touching configuration, argument parsing, and the GGUF model loader. The addition of unit tests is also a great practice. I've found one high-severity issue in the file searching logic that could prevent some models from loading correctly. My feedback includes a specific code suggestion to address this.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@sts07142
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature! I just leave some initial comments. PTAL! :)

@Isotr0py Isotr0py self-assigned this Nov 21, 2025
@mergify
Copy link

mergify bot commented Nov 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sts07142.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 22, 2025
Signed-off-by: Injae Ryou <[email protected]>
…ectly

- Changed `_prepare_weights` to take `ModelConfig` instead of a model path.
- Updated `download_gguf` to include an optional `revision` parameter.
- Adjusted `download_model` and `load_weights` methods to work with the new `_prepare_weights` signature.

Signed-off-by: Injae Ryou <[email protected]>
- remove 'gguf_quant_type' in ModelConfig
- move 'download_gguf' to weight_utils.py
- strictly check 'quant_type' in 'is_remote_gguf'
- leave self.model as repo_id:quant_type
- raise error in 'split_remote_gguf'
  - split invalid remote_gguf_model
  - invalid gguf_quant_type (different from GGMLQuantizationType)

Signed-off-by: Injae Ryou <[email protected]>
Signed-off-by: Injae Ryou <[email protected]>
@sts07142 sts07142 force-pushed the feat/load-gguf-model branch from 7435750 to eca3898 Compare November 22, 2025 12:48
@mergify mergify bot removed the needs-rebase label Nov 22, 2025
@sts07142
Copy link
Contributor Author

@Isotr0py
Thank you for your review.
Please review the changes.

@sts07142 sts07142 requested a review from Isotr0py November 24, 2025 02:19
@Isotr0py
Copy link
Member

Sorry for the delay! I'm relatively busy recently, will try to take a look tomorrow ASAP! :)

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Just leave a nit.

@Isotr0py Isotr0py enabled auto-merge (squash) November 25, 2025 12:16
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025
@Isotr0py Isotr0py merged commit 794029f into vllm-project:main Nov 25, 2025
53 checks passed
@ivanbaldo
Copy link

ivanbaldo commented Nov 25, 2025

Nice!!! Thanks for the PR!!!
The --tokenizer Qwen/Qwen3-0.6B argument is required or optional?

@sts07142
Copy link
Contributor Author

sts07142 commented Nov 25, 2025

Nice!!! Thanks for the PR!!!

The --tokenizer Qwen/Qwen3-0.6B argument is required or optional?

The tokenizer argument is required for GGUF.

  • We can use the original model repo as a tokenizer.

Almost all GGUF repo doesn't have tokenizer information.
Almost all GGUF repo doesn't have tokenizer informations like tokenizer.json (+ tokenizer_config.json …).

  • GGUF format also includes tokenizer information.

As far as I know, current vLLM need tokenizer files.
So we need to use tokenizer argument using GGUF.

@ivanbaldo
Copy link

Strange, according to this diagram https://huggingface.co/docs/transformers/gguf the files can have the tokenizer embedded inside.
I wonder if vLLM already supports it in case it's there and so the --tokenizer option is optional.

@sts07142
Copy link
Contributor Author

sts07142 commented Nov 27, 2025

Strange, according to this diagram https://huggingface.co/docs/transformers/gguf the files can have the tokenizer embedded inside. I wonder if vLLM already supports it in case it's there and so the --tokenizer option is optional.

That's right, GGUF also includes tokenizer information.

As far as I know, the current vLLM loads the tokenizer based on the file in the HF.
Currently, the tokenizer argument is required when using the GGUF model.
But I think it can be an option in GGUF through the update.

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
… repo_id:quant_type (vllm-project#29137)

Signed-off-by: Injae Ryou <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
… repo_id:quant_type (vllm-project#29137)

Signed-off-by: Injae Ryou <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
… repo_id:quant_type (vllm-project#29137)

Signed-off-by: Injae Ryou <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: dsuhinin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: improve GGUF loading from HuggingFace user experience

4 participants