-
Notifications
You must be signed in to change notification settings - Fork 450
Make SDG batch size configurable via system profile (backport #3157) #3208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nikhil Palaskar <[email protected]> (cherry picked from commit 4a309ce)
|
There is a failure in the large E2E job on I'm going to run the large E2E job on this branch to ensure no conflicts on this particular release branch. |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow failed on this PR: View run, please investigate. |
|
The above job failed due to a download failure with HuggingFace. Rerunning. |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow failed on this PR: View run, please investigate. |
|
The same HuggingFace download error occurred. It seems like a server-side error, so I will trigger the job once more. |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow succeeded on this PR: View run, congrats! |
|
Courtney & I discussed this today in a call. Even though we fixed the unit tests in #3210, this PR does not reflect that change until we rebase on that. In the interest of time, we will not manually rebase this PR. We need to ship the upcoming Summary: We will merge this PR over the failing unit tests. After it is merged, we'll verify that unit tests continue to pass on the |
Currently, the batch size for SDG is only configurable via the CLI, but a single batch size across all hardware profiles is not optimal. Different hardware configurations have varying capabilities, and using a fixed batch size can lead to under-utilization or over-utilization of resources during the SDG process.
To ensure efficient performance across different hardware, we should set the batch sizes independently in each system profile.
This is an automatic backport of pull request #3157 done by Mergify.