-
Notifications
You must be signed in to change notification settings - Fork 450
feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add fallback logic to the small, medium, and large E2E tests to select a new AZ when AWS has insufficient capacity #3161
Conversation
8bec4e9 to
c7a6acd
Compare
|
This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days. |
c7a6acd to
d37f039
Compare
d37f039 to
75b7b8d
Compare
Signed-off-by: Courtney Pacheco <[email protected]>
75b7b8d to
f98f92b
Compare
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow failed on this PR: View run, please investigate. |
|
The linked job failed when running tests: This is irrelevant to the patch here because all it does is it affecs the start and stop steps of CI jobs, not the test itself. |
booxter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is validated in other jobs before; both in this repo and in training (maybe more). I checked that instance types are the same, so no change there. And we have at least several different workflow runs collected that prove that the retry logic worked and that the test phase was started successfully (even if failed due to unrelated reasons later). I think we should get this in asap not to waste time of engineers and CI resources (trying to respin jobs over and over - and failing).
|
@mergify backport release-v0.26 |
✅ Backports have been createdDetails
|
Issue resolved by this Pull Request:
Resolves #3160
Checklist:
conventional commits.
Overview
The large E2E test has been failing due to "insufficient instance capacity" in AWS. In other words, whenever we manually kick off our large E2E job or the large E2E job gets kicked off at its regularly scheduled interval, AWS almost always returns an error letting us know that there aren't enough machines (VMs) available to run that job in AWS. Thus, the large E2E job fails to run:
See linked issue for more details.
Proposed Solution
This PR takes the fallback logic I added in #2975 and tweaks it so that it can also be used in our small, medium, and large E2E jobs.