Large E2E job fails when our `vllm` backend waits for GPU VRAM reclamation (HuggingFace download error)

**Describe the bug**
Whenever our internal vLLM module waits for GPU VRAM reclamation, the large E2E job fails to download any models from HuggingFace:

```bash
025-03-03T04:52:43.1424493Z INFO:instructlab.sdg.utils.chunkers:Docling models not found on disk, downloading models...
2025-03-03T04:52:43.1425463Z INFO:instructlab.model.backends.vllm:Waiting for GPU VRAM reclamation...
2025-03-03T04:52:43.1426018Z Traceback (most recent call last):
2025-03-03T04:52:43.1426899Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 534, in _make_request
2025-03-03T04:52:43.1427804Z response = conn.getresponse()
2025-03-03T04:52:43.1428114Z ^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1428904Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connection.py", line 516, in getresponse
2025-03-03T04:52:43.1429849Z httplib_response = super().getresponse()
2025-03-03T04:52:43.1430310Z ^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1430722Z File "/usr/lib64/python3.11/http/client.py", line 1395, in getresponse
2025-03-03T04:52:43.1431207Z response.begin()
2025-03-03T04:52:43.1431584Z File "/usr/lib64/python3.11/http/client.py", line 325, in begin
2025-03-03T04:52:43.1432081Z version, status, reason = self._read_status()
2025-03-03T04:52:43.1432445Z ^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1432840Z File "/usr/lib64/python3.11/http/client.py", line 286, in _read_status
2025-03-03T04:52:43.1433393Z line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
2025-03-03T04:52:43.1433800Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1434195Z File "/usr/lib64/python3.11/socket.py", line 718, in readinto
2025-03-03T04:52:43.1434648Z return self._sock.recv_into(b)
2025-03-03T04:52:43.1434952Z ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1435466Z File "/usr/lib64/python3.11/ssl.py", line 1314, in recv_into
2025-03-03T04:52:43.1435918Z return self.read(nbytes, buffer)
2025-03-03T04:52:43.1436228Z ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1436572Z File "/usr/lib64/python3.11/ssl.py", line 1166, in read
2025-03-03T04:52:43.1436991Z return self._sslobj.read(len, buffer)
2025-03-03T04:52:43.1437325Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1437644Z TimeoutError: The read operation timed out
2025-03-03T04:52:43.1438149Z The above exception was the direct cause of the following exception:
2025-03-03T04:52:43.1438641Z Traceback (most recent call last):
2025-03-03T04:52:43.1439440Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/requests/adapters.py", line 667, in send
2025-03-03T04:52:43.1440239Z resp = conn.urlopen(
2025-03-03T04:52:43.1440513Z ^^^^^^^^^^^^^
2025-03-03T04:52:43.1441272Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 841, in urlopen
2025-03-03T04:52:43.1442409Z retries = retries.increment(
2025-03-03T04:52:43.1442718Z ^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1443459Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/util/retry.py", line 474, in increment
2025-03-03T04:52:43.1444375Z raise reraise(type(error), error, _stacktrace)
2025-03-03T04:52:43.1444769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1445739Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/util/util.py", line 39, in reraise
2025-03-03T04:52:43.1446533Z raise value
2025-03-03T04:52:43.1447272Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 787, in urlopen
2025-03-03T04:52:43.1448107Z response = self._make_request(
2025-03-03T04:52:43.1448405Z ^^^^^^^^^^^^^^^^^^^
2025-03-03T04:52:43.1449192Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
2025-03-03T04:52:43.1450355Z self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
2025-03-03T04:52:43.1451346Z File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 367, in _raise_timeout
2025-03-03T04:52:43.1452233Z raise ReadTimeoutError(
2025-03-03T04:52:43.1452987Z urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)
2025-03-03T04:52:43.1453913Z During handling of the above exception, another exception occurred:
2025-03-03T04:52:43.1454397Z Traceback (most recent call last):
```

(Job logs: [direct link](https://productionresultssa11.blob.core.windows.net/actions-results/4182d77d-b87f-4f0d-9a6d-5a6bcf00af61/workflow-job-run-e58e0d66-98f0-5722-b128-f0fe84bd43d8/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-03-03T13%3A58%3A52Z&sig=kpCHJzMyn3twfUuDN6wZDESAN8d16tM43cfLdiVPCBE%3D&ske=2025-03-04T00%3A43%3A30Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-03-03T12%3A43%3A30Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-03-03T13%3A48%3A47Z&sv=2025-01-05))

However, whenever vLLM doesn't need to wait for GPU reclamation, we don't have any download issues:

```
2025-03-01T11:46:27.8554008Z INFO:instructlab.sdg.utils.chunkers:Docling models not found on disk, downloading models...
2025-03-01T11:46:27.8554521Z WARNING:easyocr.easyocr:Using CPU. Note: This module is much faster with a GPU.
2025-03-01T11:46:27.8555127Z WARNING:easyocr.easyocr:Downloading detection model, please wait. This may take several minutes depending upon your network connection.
2025-03-01T11:46:27.8555692Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8556051Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8556397Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8556767Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8557119Z Progress: |--------------------------------------------------| 0.0% Complete
2025-03-01T11:46:27.8557467Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8557813Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8558151Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8558492Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8558992Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8559382Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8559727Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8560066Z Progress: |--------------------------------------------------| 0.1% Complete
2025-03-01T11:46:27.8560416Z Progress: |--------------------------------------------------| 0.1% Complete
```

(Job logs: [direct link](https://productionresultssa15.blob.core.windows.net/actions-results/54f436b2-5cfc-4002-8e49-a0ae1ad9f056/workflow-job-run-e58e0d66-98f0-5722-b128-f0fe84bd43d8/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-03-03T14%3A00%3A04Z&sig=2eEpHwFJfzyZ8hN6g1pOuHgz%2Bz3bY6ctc3KIo5Q4sr0%3D&ske=2025-03-04T00%3A43%3A18Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-03-03T12%3A43%3A18Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-03-03T13%3A49%3A59Z&sv=2025-01-05))

All of the failing jobs from the past two weeks (since around Feb 17, 2025) have failed when trying to use the GPU to assist with downloading. All the passing jobs from the past two weeks have downloaded on CPU.

Example additional **passing** jobs -- all of which downloaded on CPU:
-  [March 1, 2025 logs](https://github.com/instructlab/instructlab/actions/runs/13604143144/job/38033709406)
- [February 28, 2025 logs](https://github.com/instructlab/instructlab/actions/runs/13586844502/job/37983817918)
- [February 27, 2025 logs](https://github.com/instructlab/instructlab/actions/runs/13564943050)

Example additional **failing** jobs -- all of which try to download on GPU:
- [March 3, 2025 logs](https://github.com/instructlab/instructlab/actions/runs/13623238238/job/38076244547)

**To Reproduce**
Steps to reproduce the behavior:
1. Trigger a large E2E job on any PR, on `main`, on `release-v0.24`, or on `release-v0.23`.

**Expected behavior**
Our vLLM Python logic doesn't impact HuggingFace's download

**Screenshots**
N/A

**Device Info (please complete the following information):**
 - Hardware Specs: n/a
 - OS Version: n/a
 - Python Version: `python3.11` because that's what our E2E job uses, but likely `python3.10` too
 - InstructLab Version: `main`, `release-v0.24`, and `release-v0.23` branches

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large E2E job fails when our `vllm` backend waits for GPU VRAM reclamation (HuggingFace download error) #3215

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large E2E job fails when our vllm backend waits for GPU VRAM reclamation (HuggingFace download error) #3215

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Large E2E job fails when our `vllm` backend waits for GPU VRAM reclamation (HuggingFace download error) #3215