feat(deps): Bump `torch` upper bound (<2.7.0) + set allowable range for `vllm` versions (>=0.8.0,<0.9.0) #3283

courtneypacheco · 2025-04-10T22:12:15Z

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Changelog updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Functional tests have been added, if necessary.
E2E Workflow tests have been added, if necessary.

Changes

In requirements.txt: update vllm==0.7.3 with vllm>=0.8.0 and utilize a new constraints-dev.txt file to restrict the upper bound on vllm in the CI only
In requirements.txt: update torch>=2.3.0,<2.6.0 to torch>=2.6.0,<2.7.0, and set torch==2.6.0 in the new constraints-dev.txt file for the CI only

Note: vllm>=0.8.0,<0.8.0 is only compatible with torch>=2.6.0 right now, hence we have bumped the torch lower bound.

requirements-vllm-cuda.txt

constraints.txt

src/instructlab/train/linux_train.py

github-actions · 2025-04-14T14:04:36Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

courtneypacheco · 2025-04-14T14:39:32Z

@RobotSail when the large E2E job passes, will we be okay to merge this change and release it? Or, would you like any additional testing performed to ensure that training works as intended?

github-actions · 2025-04-14T16:24:41Z

e2e workflow succeeded on this PR: View run, congrats!

courtneypacheco · 2025-04-14T18:14:44Z

We need to validate this change in an E2E test with use_dolomite=True set. We also need to validate this against the Lama 70B model, which requires a new config.

courtneypacheco · 2025-04-14T23:04:19Z

Please see: #3288

ktdreyer · 2025-04-15T21:34:25Z

Should we merge #3288 instead of this?

github-actions · 2025-04-16T17:06:02Z

E2E (NVIDIA L40S x4) LLAMA workflow launched on this PR: View run

github-actions · 2025-04-16T17:11:59Z

e2e workflow failed on this PR: View run, please investigate.

github-actions · 2025-04-16T18:13:56Z

E2E (NVIDIA L40S x4) LLAMA workflow launched on this PR: View run

github-actions · 2025-04-16T18:19:17Z

e2e workflow failed on this PR: View run, please investigate.

dhellmann · 2025-04-17T13:50:55Z

requirements-vllm-cuda.txt

-# vLLM only supports Linux platform (including WSL)
-vllm==0.7.3 ; sys_platform == 'linux' and platform_machine == 'x86_64'
+# vLLM only supports Linux platform (including WSL). Do not cap this dependency here. Cap in constraints-dev.txt
+vllm>=0.8.0 ; sys_platform == 'linux' and platform_machine == 'x86_64'


We're building for aarch64 now. Why is this limited to x86_64?

Good catch. Let me update this to remove the x86_64 limitation.

Add `constraints.txt` to restrict the CI to using torch==2.6.0 and vllm<0.9.0. Also bump the minimum `torch` version because vLLM 0.8.z is only compatible with `torch==2.6.0` (so far). Also update the large Llama job to use fallback logic to try other availability zones. Signed-off-by: Courtney Pacheco <[email protected]>

courtneypacheco · 2025-04-17T17:06:47Z

This PR did not auto-merge because the large llama E2E job failed. (It has a bug in it.) Therefore, Mergify will not auto-merge this PR as is because it detected a CI failure. Since the required CI checks all pass, I'm going to merge this PR. The llama large E2E job will be fixed in a future PR.

mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing dependencies Relates to dependencies labels Apr 10, 2025

courtneypacheco force-pushed the add-constraints branch from 60e0a66 to ccd6781 Compare April 10, 2025 22:14

mergify bot added the ci-failure PR has at least one CI failure label Apr 10, 2025

courtneypacheco force-pushed the add-constraints branch from ccd6781 to 643c5ef Compare April 10, 2025 22:41

mergify bot removed the ci-failure PR has at least one CI failure label Apr 10, 2025

courtneypacheco changed the title ~~chore: Bump torch + vllm versions~~ feat: Bump torch + vllm versions Apr 10, 2025

dhellmann reviewed Apr 10, 2025

View reviewed changes

requirements-vllm-cuda.txt Outdated Show resolved Hide resolved

constraints.txt Outdated Show resolved Hide resolved

courtneypacheco changed the title ~~feat: Bump torch + vllm versions~~ feat: Bump vllm to 0.8.3 Apr 10, 2025

mergify bot added the ci-failure PR has at least one CI failure label Apr 10, 2025

courtneypacheco force-pushed the add-constraints branch from 643c5ef to 283f011 Compare April 11, 2025 00:36

mergify bot removed the ci-failure PR has at least one CI failure label Apr 11, 2025

courtneypacheco changed the title ~~feat: Bump vllm to 0.8.3~~ feat(deps): Bump torch upper bound (<2.7.0) + set allowable range for vllm versions (>=0.8.0,<0.9.0) Apr 11, 2025

mergify bot added the ci-failure PR has at least one CI failure label Apr 11, 2025

courtneypacheco force-pushed the add-constraints branch from 283f011 to eb8ee94 Compare April 11, 2025 01:23

mergify bot removed the ci-failure PR has at least one CI failure label Apr 11, 2025

courtneypacheco commented Apr 11, 2025

View reviewed changes

src/instructlab/train/linux_train.py Outdated Show resolved Hide resolved

mergify bot added the ci-failure PR has at least one CI failure label Apr 11, 2025

courtneypacheco force-pushed the add-constraints branch from eb8ee94 to 41878de Compare April 11, 2025 02:13

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 11, 2025

courtneypacheco force-pushed the add-constraints branch from 41878de to cf0e64c Compare April 11, 2025 02:16

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 11, 2025

courtneypacheco force-pushed the add-constraints branch from cf0e64c to e2f7068 Compare April 11, 2025 03:02

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Apr 11, 2025

courtneypacheco force-pushed the add-constraints branch from e2f7068 to 66e27ac Compare April 11, 2025 03:41

mergify bot removed the ci-failure PR has at least one CI failure label Apr 11, 2025

mergify bot added the ci-failure PR has at least one CI failure label Apr 14, 2025

ktdreyer approved these changes Apr 14, 2025

View reviewed changes

mergify bot removed the one-approval PR has one approval from a maintainer label Apr 14, 2025

mergify bot removed the ci-failure PR has at least one CI failure label Apr 14, 2025

courtneypacheco added the hold In-progress PR. Tag should be removed before merge. label Apr 14, 2025

courtneypacheco force-pushed the add-constraints branch 2 times, most recently from b84ed08 to a5db9e3 Compare April 16, 2025 17:54

mergify bot added the ci-failure PR has at least one CI failure label Apr 16, 2025

courtneypacheco force-pushed the add-constraints branch from a5db9e3 to e5386ea Compare April 16, 2025 18:05

mergify bot removed the ci-failure PR has at least one CI failure label Apr 16, 2025

mergify bot added the ci-failure PR has at least one CI failure label Apr 16, 2025

dhellmann reviewed Apr 17, 2025

View reviewed changes

courtneypacheco force-pushed the add-constraints branch from e5386ea to 8733d7c Compare April 17, 2025 14:58

mergify bot removed the ci-failure PR has at least one CI failure label Apr 17, 2025

tiran approved these changes Apr 17, 2025

View reviewed changes

courtneypacheco merged commit 480c8f0 into main Apr 17, 2025
29 checks passed

ktdreyer deleted the add-constraints branch April 17, 2025 18:45

booxter mentioned this pull request Apr 18, 2025

CI: improve smoke job AWS resource management to look into other AZs for available instances instructlab/training#480

Closed

feat(deps): Bump torch upper bound (<2.7.0) + set allowable range for vllm versions (>=0.8.0,<0.9.0) #3283

feat(deps): Bump torch upper bound (<2.7.0) + set allowable range for vllm versions (>=0.8.0,<0.9.0) #3283

Uh oh!

Conversation

courtneypacheco commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

courtneypacheco commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

courtneypacheco commented Apr 14, 2025

Uh oh!

courtneypacheco commented Apr 14, 2025

Uh oh!

ktdreyer commented Apr 15, 2025

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

dhellmann Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

courtneypacheco Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

courtneypacheco commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

feat(deps): Bump `torch` upper bound (<2.7.0) + set allowable range for `vllm` versions (>=0.8.0,<0.9.0) #3283

feat(deps): Bump `torch` upper bound (<2.7.0) + set allowable range for `vllm` versions (>=0.8.0,<0.9.0) #3283

courtneypacheco commented Apr 10, 2025 •

edited

Loading