Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Large E2E job fails when our vllm backend waits for GPU VRAM reclamation (HuggingFace download error) #3215

@courtneypacheco

Description

@courtneypacheco

Describe the bug
Whenever our internal vLLM module waits for GPU VRAM reclamation, the large E2E job fails to download any models from HuggingFace:

025-03-03T04:52:43.1424493Z INFO:instructlab.sdg.utils.chunkers:Docling models not found on disk, downloading models...
2025-03-03T04:52:43.1425463Z INFO:instructlab.model.backends.vllm:Waiting for GPU VRAM reclamation...
2025-03-03T04:52:43.1426018Z Traceback (most recent call last):
2025-03-03T04:52:43.1426899Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 534, in _make_request
2025-03-03T04:52:43.1427804Z response = conn.getresponse()
2025-03-03T04:52:43.1428114Z ^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1428904Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connection.py", line 516, in getresponse
2025-03-03T04:52:43.1429849Z httplib_response = super().getresponse()
2025-03-03T04:52:43.1430310Z ^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1430722Z File "/usr/lib64/python3.11/http/client.py", line 1395, in getresponse
2025-03-03T04:52:43.1431207Z response.begin()
2025-03-03T04:52:43.1431584Z File "/usr/lib64/python3.11/http/client.py", line 325, in begin
2025-03-03T04:52:43.1432081Z version, status, reason = self._read_status()
2025-03-03T04:52:43.1432445Z ^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1432840Z File "/usr/lib64/python3.11/http/client.py", line 286, in _read_status
2025-03-03T04:52:43.1433393Z line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
2025-03-03T04:52:43.1433800Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1434195Z File "/usr/lib64/python3.11/socket.py", line 718, in readinto
2025-03-03T04:52:43.1434648Z return self._sock.recv_into(b)
2025-03-03T04:52:43.1434952Z ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1435466Z File "/usr/lib64/python3.11/ssl.py", line 1314, in recv_into
2025-03-03T04:52:43.1435918Z return self.read(nbytes, buffer)
2025-03-03T04:52:43.1436228Z ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1436572Z File "/usr/lib64/python3.11/ssl.py", line 1166, in read
2025-03-03T04:52:43.1436991Z return self._sslobj.read(len, buffer)
2025-03-03T04:52:43.1437325Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1437644Z TimeoutError: The read operation timed out
2025-03-03T04:52:43.1438149Z The above exception was the direct cause of the following exception:
2025-03-03T04:52:43.1438641Z Traceback (most recent call last):
2025-03-03T04:52:43.1439440Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/requests/adapters.py", line 667, in send
2025-03-03T04:52:43.1440239Z resp = conn.urlopen(
2025-03-03T04:52:43.1440513Z ^^^^^^^^^^^^^
2025-03-03T04:52:43.1441272Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 841, in urlopen
2025-03-03T04:52:43.1442409Z retries = retries.increment(
2025-03-03T04:52:43.1442718Z ^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1443459Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/util/retry.py", line 474, in increment
2025-03-03T04:52:43.1444375Z raise reraise(type(error), error, _stacktrace)
2025-03-03T04:52:43.1444769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1445739Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/util/util.py", line 39, in reraise
2025-03-03T04:52:43.1446533Z raise value
2025-03-03T04:52:43.1447272Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 787, in urlopen
2025-03-03T04:52:43.1448107Z response = self._make_request(
2025-03-03T04:52:43.1448405Z ^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1449192Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
2025-03-03T04:52:43.1450355Z self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
2025-03-03T04:52:43.1451346Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 367, in _raise_timeout
2025-03-03T04:52:43.1452233Z raise ReadTimeoutError(
2025-03-03T04:52:43.1452987Z urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)
2025-03-03T04:52:43.1453913Z During handling of the above exception, another exception occurred:
2025-03-03T04:52:43.1454397Z Traceback (most recent call last):

(Job logs: direct link)

However, whenever vLLM doesn't need to wait for GPU reclamation, we don't have any download issues:

2025-03-01T11:46:27.8554008Z INFO:instructlab.sdg.utils.chunkers:Docling models not found on disk, downloading models...
2025-03-01T11:46:27.8554521Z WARNING:easyocr.easyocr:Using CPU. Note: This module is much faster with a GPU.
2025-03-01T11:46:27.8555127Z WARNING:easyocr.easyocr:Downloading detection model, please wait. This may take several minutes depending upon your network connection.
2025-03-01T11:46:27.8555692Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8556051Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8556397Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8556767Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8557119Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8557467Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8557813Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8558151Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8558492Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8558992Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8559382Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8559727Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8560066Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8560416Z Progress: |--------------------------------------------------| 0.1% Complete

(Job logs: direct link)

All of the failing jobs from the past two weeks (since around Feb 17, 2025) have failed when trying to use the GPU to assist with downloading. All the passing jobs from the past two weeks have downloaded on CPU.

Example additional passing jobs -- all of which downloaded on CPU:

Example additional failing jobs -- all of which try to download on GPU:

To Reproduce
Steps to reproduce the behavior:

  1. Trigger a large E2E job on any PR, on main, on release-v0.24, or on release-v0.23.

Expected behavior
Our vLLM Python logic doesn't impact HuggingFace's download

Screenshots
N/A

Device Info (please complete the following information):

  • Hardware Specs: n/a
  • OS Version: n/a
  • Python Version: python3.11 because that's what our E2E job uses, but likely python3.10 too
  • InstructLab Version: main, release-v0.24, and release-v0.23 branches

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions