Codestin Search App

sts07142 · 2025-11-21T01:34:21Z

Purpose

Fixes #25182
Improve GGUF loading from HuggingFace user experience like repo_id:quant_type

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
pytest tests/models/test_gguf_download.py
pytest tests/transformers_utils/test_utils.py

Test Result

[BEFORE] vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B

 vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
INFO 11-21 10:31:03 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1210609) INFO 11-21 10:31:03 [api_server.py:1978] vLLM API server version 0.11.2.dev67+g1d642872a.d20251120
(APIServer pid=1210609) INFO 11-21 10:31:03 [utils.py:253] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=1210609) Traceback (most recent call last):
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
(APIServer pid=1210609)     hf_hub_download(
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=1210609)     validate_repo_id(arg_value)
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=1210609)     raise HFValidationError(
(APIServer pid=1210609) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'.
(APIServer pid=1210609)
(APIServer pid=1210609) During handling of the above exception, another exception occurred:
(APIServer pid=1210609)
(APIServer pid=1210609) Traceback (most recent call last):
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
(APIServer pid=1210609)     resolved_config_file = cached_file(
(APIServer pid=1210609)                            ^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 322, in cached_file
(APIServer pid=1210609)     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 532, in cached_files
(APIServer pid=1210609)     _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return
(APIServer pid=1210609)     resolved_file = try_to_load_from_cache(
(APIServer pid=1210609)                     ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=1210609)     validate_repo_id(arg_value)
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=1210609)     raise HFValidationError(
(APIServer pid=1210609) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'.
(APIServer pid=1210609)
(APIServer pid=1210609) During handling of the above exception, another exception occurred:
(APIServer pid=1210609)
(APIServer pid=1210609) Traceback (most recent call last):
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=1210609)     sys.exit(main())
(APIServer pid=1210609)              ^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1210609)     args.dispatch_function(args)
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=1210609)     uvloop.run(run_server(args))
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1210609)     return __asyncio.run(
(APIServer pid=1210609)            ^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1210609)     return runner.run(main)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1210609)     return self._loop.run_until_complete(task)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1210609)     return await main
(APIServer pid=1210609)            ^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 2028, in run_server
(APIServer pid=1210609)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 2047, in run_server_worker
(APIServer pid=1210609)     async with build_async_engine_client(
(APIServer pid=1210609)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1210609)     return await anext(self.gen)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 196, in build_async_engine_client
(APIServer pid=1210609)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1210609)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1210609)     return await anext(self.gen)
(APIServer pid=1210609)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
(APIServer pid=1210609)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1210609)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/engine/arg_utils.py", line 1358, in create_engine_config
(APIServer pid=1210609)     maybe_override_with_speculators(
(APIServer pid=1210609)   File "/home/name/.test/vllm/vllm/transformers_utils/config.py", line 573, in maybe_override_with_speculators
(APIServer pid=1210609)     config_dict, _ = PretrainedConfig.get_config_dict(
(APIServer pid=1210609)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict
(APIServer pid=1210609)     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
(APIServer pid=1210609)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1210609)   File "/home/name/.test/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict
(APIServer pid=1210609)     raise OSError(
(APIServer pid=1210609) OSError: Can't load the configuration of 'unsloth/Qwen3-0.6B-GGUF:IQ1_S'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'unsloth/Qwen3-0.6B-GGUF:IQ1_S' is the correct path to a directory containing a config.json file

[AFTER] vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B

vllm serve unsloth/Qwen3-0.6B-GGUF:IQ1_S --tokenizer Qwen/Qwen3-0.6B
INFO 11-21 10:25:52 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1206292) INFO 11-21 10:25:52 [api_server.py:1978] vLLM API server version 0.11.2.dev67+g1d642872a.d20251120
(APIServer pid=1206292) INFO 11-21 10:25:52 [utils.py:253] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=1206292) INFO 11-21 10:25:52 [arg_utils.py:1375]
(APIServer pid=1206292) INFO 11-21 10:25:52 [arg_utils.py:1375]
(APIServer pid=1206292) INFO 11-21 10:25:52 [arg_utils.py:1375] After maybe_override_with_speculators: unsloth/Qwen3-0.6B-GGUF
(APIServer pid=1206292) INFO 11-21 10:25:53 [model.py:652] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=1206292) INFO 11-21 10:25:53 [model.py:1783] Using max model len 40960
(APIServer pid=1206292) INFO 11-21 10:25:53 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:58 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev67+g1d642872a.d20251120) with config: model='unsloth/Qwen3-0.6B-GGUF', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:59 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.13:46217 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:59 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1206446) INFO 11-21 10:25:59 [gpu_model_runner.py:3260] Starting to load model unsloth/Qwen3-0.6B-GGUF...
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:09 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:09 [cuda.py:424] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:14 [gpu_model_runner.py:3339] Model loading took 0.2154 GiB memory and 14.283840 seconds
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:17 [backends.py:648] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/124ca0c67d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:17 [backends.py:708] Dynamo bytecode transform time: 3.01 s
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:20 [backends.py:214] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.049 s
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:21 [monitor.py:34] torch.compile takes 6.06 s in total
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:22 [gpu_worker.py:337] Available KV cache memory: 65.28 GiB
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:22 [kv_cache_utils.py:1234] GPU KV cache size: 611,136 tokens
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:22 [kv_cache_utils.py:1239] Maximum concurrency for 40,960 tokens per request: 14.92x
(EngineCore_DP0 pid=1206446) 2025-11-21 10:26:22,420 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1206446) 2025-11-21 10:26:22,433 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██| 51/51 [00:01<00:00, 31.87it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████| 51/51 [00:01<00:00, 42.92it/s]
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:25 [gpu_model_runner.py:4245] Graph capturing finished in 3 secs, took 0.71 GiB
(EngineCore_DP0 pid=1206446) INFO 11-21 10:26:25 [core.py:253] init engine (profile, create kv cache, warmup model) took 11.47 seconds
(APIServer pid=1206292) INFO 11-21 10:26:27 [api_server.py:1726] Supported tasks: ['generate']
(APIServer pid=1206292) INFO 11-21 10:26:29 [api_server.py:2056] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:38] Available routes are:
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1206292) INFO 11-21 10:26:29 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1206292) INFO:     Started server process [1206292]
(APIServer pid=1206292) INFO:     Waiting for application startup.
(APIServer pid=1206292) INFO:     Application startup complete.

pytest tests/models/test_gguf_download.py

pytest tests/models/test_gguf_download.py
===================================== test session starts ======================================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/name/.test/vllm
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 12 items

tests/models/test_gguf_download.py ............                                          [100%]

======================================= warnings summary =======================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ 12 passed, 2 warnings in 3.40s ================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

pytest tests/transformers_utils/test_utils.py

pytest tests/transformers_utils/test_utils.py
===================================== test session starts ======================================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/name/.test/vllm
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 12 items

tests/transformers_utils/test_utils.py ............                                      [100%]

======================================= warnings summary =======================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ 12 passed, 2 warnings in 1.78s ================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a user-friendly way to load GGUF models from Hugging Face using the repo_id:quant_type format. The changes are well-structured, touching configuration, argument parsing, and the GGUF model loader. The addition of unit tests is also a great practice. I've found one high-severity issue in the file searching logic that could prevent some models from loading correctly. My feedback includes a specific code suggestion to address this.

vllm/model_executor/model_loader/gguf_loader.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/engine/arg_utils.py

sts07142 · 2025-11-21T02:10:36Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/model_loader/gguf_loader.py

vllm/transformers_utils/utils.py

Isotr0py

Thanks for adding this feature! I just leave some initial comments. PTAL! :)

vllm/config/model.py

vllm/model_executor/model_loader/gguf_loader.py

vllm/engine/arg_utils.py

vllm/transformers_utils/utils.py

mergify · 2025-11-22T07:25:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sts07142.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Injae Ryou <[email protected]>

…ectly - Changed `_prepare_weights` to take `ModelConfig` instead of a model path. - Updated `download_gguf` to include an optional `revision` parameter. - Adjusted `download_model` and `load_weights` methods to work with the new `_prepare_weights` signature. Signed-off-by: Injae Ryou <[email protected]>

Signed-off-by: Injae Ryou <[email protected]>

- remove 'gguf_quant_type' in ModelConfig - move 'download_gguf' to weight_utils.py - strictly check 'quant_type' in 'is_remote_gguf' - leave self.model as repo_id:quant_type - raise error in 'split_remote_gguf' - split invalid remote_gguf_model - invalid gguf_quant_type (different from GGMLQuantizationType) Signed-off-by: Injae Ryou <[email protected]>

Signed-off-by: Injae Ryou <[email protected]>

sts07142 · 2025-11-22T12:58:14Z

@Isotr0py
Thank you for your review.
Please review the changes.

Isotr0py · 2025-11-24T17:23:22Z

Sorry for the delay! I'm relatively busy recently, will try to take a look tomorrow ASAP! :)

Signed-off-by: Isotr0py <[email protected]>

Isotr0py

Overall LGTM! Just leave a nit.

vllm/transformers_utils/config.py

tests/transformers_utils/test_utils.py

vllm/config/model.py

Signed-off-by: Isotr0py <[email protected]>

ivanbaldo · 2025-11-25T20:01:29Z

Nice!!! Thanks for the PR!!!
The --tokenizer Qwen/Qwen3-0.6B argument is required or optional?

sts07142 · 2025-11-25T22:25:08Z

Nice!!! Thanks for the PR!!!

The --tokenizer Qwen/Qwen3-0.6B argument is required or optional?

The tokenizer argument is required for GGUF.

We can use the original model repo as a tokenizer.

~~Almost all GGUF repo doesn't have tokenizer information.~~
Almost all GGUF repo doesn't have tokenizer informations like tokenizer.json (+ tokenizer_config.json …).

GGUF format also includes tokenizer information.

As far as I know, current vLLM need tokenizer files.
So we need to use tokenizer argument using GGUF.

ivanbaldo · 2025-11-26T19:44:02Z

Strange, according to this diagram https://huggingface.co/docs/transformers/gguf the files can have the tokenizer embedded inside.
I wonder if vLLM already supports it in case it's there and so the --tokenizer option is optional.

sts07142 · 2025-11-27T01:05:36Z

Strange, according to this diagram https://huggingface.co/docs/transformers/gguf the files can have the tokenizer embedded inside. I wonder if vLLM already supports it in case it's there and so the --tokenizer option is optional.

That's right, GGUF also includes tokenizer information.

As far as I know, the current vLLM loads the tokenizer based on the file in the HF.
Currently, the tokenizer argument is required when using the GGUF model.
But I think it can be an option in GGUF through the update.

… repo_id:quant_type (vllm-project#29137) Signed-off-by: Injae Ryou <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>

… repo_id:quant_type (vllm-project#29137) Signed-off-by: Injae Ryou <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: dsuhinin <[email protected]>

sts07142 requested review from 22quinn, DarkLight1337, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners November 21, 2025 01:34

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

vllm/model_executor/model_loader/gguf_loader.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 21, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 21, 2025

View reviewed changes

vllm/model_executor/model_loader/gguf_loader.py Outdated Show resolved Hide resolved

vllm/transformers_utils/utils.py Outdated Show resolved Hide resolved

DarkLight1337 requested a review from Isotr0py November 21, 2025 03:44

Isotr0py reviewed Nov 21, 2025

View reviewed changes

Isotr0py self-assigned this Nov 21, 2025

mergify bot added the needs-rebase label Nov 22, 2025

sts07142 added 8 commits November 22, 2025 21:42

feat: hf_model_id:quant_type gguf

49e7902

Signed-off-by: Injae Ryou <[email protected]>

feat: add test

b4a7f89

Signed-off-by: Injae Ryou <[email protected]>

fix: enhance remote GGUF model check logic

4224d13

Signed-off-by: Injae Ryou <[email protected]>

fix: 'is_gguf' to avoid AttributeError

ccc260f

Signed-off-by: Injae Ryou <[email protected]>

refactor: remove fallback logic for GGUF file search

8f038ab

Signed-off-by: Injae Ryou <[email protected]>

refactor: tests

eca3898

Signed-off-by: Injae Ryou <[email protected]>

sts07142 force-pushed the feat/load-gguf-model branch from 7435750 to eca3898 Compare November 22, 2025 12:48

mergify bot removed the needs-rebase label Nov 22, 2025

sts07142 requested a review from Isotr0py November 24, 2025 02:19

better error message for missing config.json in model repo

fbdee53

Signed-off-by: Isotr0py <[email protected]>

Isotr0py approved these changes Nov 25, 2025

View reviewed changes

vllm/transformers_utils/config.py Show resolved Hide resolved

tests/transformers_utils/test_utils.py Show resolved Hide resolved

vllm/config/model.py Outdated Show resolved Hide resolved

Isotr0py added 2 commits November 25, 2025 20:15

clean

0cb32cd

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'main' into feat/load-gguf-model

6bacfbf

Isotr0py enabled auto-merge (squash) November 25, 2025 12:16

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025

Isotr0py merged commit 794029f into vllm-project:main Nov 25, 2025
53 checks passed

sts07142 mentioned this pull request Nov 27, 2025

[Feature]: Improve tokenizer loading when loading GGUF models #29563

Closed

1 task

Uh oh!

Conversation

sts07142 commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

sts07142 commented Nov 21, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 22, 2025

Uh oh!

sts07142 commented Nov 22, 2025

Uh oh!

Isotr0py commented Nov 24, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanbaldo commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sts07142 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanbaldo commented Nov 26, 2025

Uh oh!

sts07142 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sts07142 commented Nov 21, 2025 •

edited by github-actions bot

Loading

ivanbaldo commented Nov 25, 2025 •

edited

Loading

sts07142 commented Nov 25, 2025 •

edited

Loading

sts07142 commented Nov 27, 2025 •

edited

Loading