feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161

courtneypacheco · 2025-02-18T00:32:44Z

Issue resolved by this Pull Request:
Resolves #3160

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Changelog updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Functional tests have been added, if necessary.
E2E Workflow tests have been added, if necessary.

Overview

The large E2E test has been failing due to "insufficient instance capacity" in AWS. In other words, whenever we manually kick off our large E2E job or the large E2E job gets kicked off at its regularly scheduled interval, AWS almost always returns an error letting us know that there aren't enough machines (VMs) available to run that job in AWS. Thus, the large E2E job fails to run:

See linked issue for more details.

Proposed Solution

This PR takes the fallback logic I added in #2975 and tweaks it so that it can also be used in our small, medium, and large E2E jobs.

github-actions · 2025-04-19T02:05:57Z

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

Signed-off-by: Courtney Pacheco <[email protected]>

github-actions · 2025-04-28T21:28:26Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-04-28T22:31:40Z

e2e workflow failed on this PR: View run, please investigate.

booxter · 2025-04-28T22:40:02Z

The linked job failed when running tests:

+ ilab model train --strategy=lab-skills-only --phased-phase2-data= --phased-phase2-num-epochs=1 --skip-user-confirm --phased-base-dir=/home/tmp/tmp.L2P9KCOOvo/.local/share/instructlab/skills-only
+ tee /home/tmp/tmp.L2P9KCOOvo/skills_only_training.log
Usage: ilab model train [OPTIONS]
Try 'ilab model train --help' for help.

Error: Invalid value for '--phased-phase2-data': File '' is a directory.
+ grep 'Best final checkpoint: ' /home/tmp/tmp.L2P9KCOOvo/skills_only_training.log
+ grep -o '/[^ ]*'
+ rm -rf /home/tmp/tmp.L2P9KCOOvo

This is irrelevant to the patch here because all it does is it affecs the start and stop steps of CI jobs, not the test itself.

booxter

This approach is validated in other jobs before; both in this repo and in training (maybe more). I checked that instance types are the same, so no change there. And we have at least several different workflow runs collected that prove that the retry logic worked and that the test phase was started successfully (even if failed due to unrelated reasons later). I think we should get this in asap not to waste time of engineers and CI resources (trying to respin jobs over and over - and failing).

courtneypacheco · 2025-04-30T16:56:41Z

@mergify backport release-v0.26

mergify · 2025-04-30T16:56:49Z

backport release-v0.26

✅ Backports have been created

Details

#3331 feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity (backport #3161) has been created for branch release-v0.26 but encountered conflicts

mergify bot added CI/CD Affects CI/CD configuration ci-failure PR has at least one CI failure labels Feb 18, 2025

courtneypacheco force-pushed the add-fallback-logic-to-small-med-and-large-e2e-tests branch from 8bec4e9 to c7a6acd Compare February 18, 2025 00:35

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Feb 18, 2025

github-actions bot added the stale label Apr 19, 2025

courtneypacheco force-pushed the add-fallback-logic-to-small-med-and-large-e2e-tests branch from c7a6acd to d37f039 Compare April 28, 2025 21:03

mergify bot removed the ci-failure PR has at least one CI failure label Apr 28, 2025

courtneypacheco force-pushed the add-fallback-logic-to-small-med-and-large-e2e-tests branch from d37f039 to 75b7b8d Compare April 28, 2025 21:17

courtneypacheco marked this pull request as ready for review April 28, 2025 21:18

courtneypacheco removed the stale label Apr 28, 2025

mergify bot added the ci-failure PR has at least one CI failure label Apr 28, 2025

Add fallback logic to small, medium, and large E2E tests

f98f92b

Signed-off-by: Courtney Pacheco <[email protected]>

courtneypacheco force-pushed the add-fallback-logic-to-small-med-and-large-e2e-tests branch from 75b7b8d to f98f92b Compare April 28, 2025 21:21

mergify bot removed the ci-failure PR has at least one CI failure label Apr 28, 2025

mergify bot added the ci-failure PR has at least one CI failure label Apr 28, 2025

booxter approved these changes Apr 28, 2025

View reviewed changes

booxter requested a review from a team April 28, 2025 22:43

mergify bot added the one-approval PR has one approval from a maintainer label Apr 28, 2025

booxter mentioned this pull request Apr 28, 2025

Flip to python 3.12 #3101

Closed

6 tasks

RobotSail approved these changes Apr 28, 2025

View reviewed changes

mergify bot merged commit 4ca72aa into main Apr 28, 2025
28 of 30 checks passed

mergify bot removed the one-approval PR has one approval from a maintainer label Apr 28, 2025

mergify bot deleted the add-fallback-logic-to-small-med-and-large-e2e-tests branch April 28, 2025 23:10

mergify bot mentioned this pull request Apr 30, 2025

feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity (backport #3161) #3331

Closed

6 tasks

booxter mentioned this pull request Apr 30, 2025

Use actions/launch-ec2-runner-with-fallback for all e2e jobs #3304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161

feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161

Uh oh!

courtneypacheco commented Feb 18, 2025

Uh oh!

github-actions bot commented Apr 19, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

booxter commented Apr 28, 2025

Uh oh!

booxter left a comment

Uh oh!

Uh oh!

courtneypacheco commented Apr 30, 2025

Uh oh!

mergify bot commented Apr 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161

feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161

Uh oh!

Conversation

courtneypacheco commented Feb 18, 2025

Overview

Proposed Solution

Uh oh!

github-actions bot commented Apr 19, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

booxter commented Apr 28, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

courtneypacheco commented Apr 30, 2025

Uh oh!

mergify bot commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mergify bot commented Apr 30, 2025 •

edited

Loading