-
Notifications
You must be signed in to change notification settings - Fork 827
Enable CI for torch ops #22548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable CI for torch ops #22548
Conversation
Signed-off-by: Erick Ochoa <[email protected]>
| pytest iree-test-suites/pytorch_ops/ \ | ||
| -rpfE \ | ||
| --timeout=30 \ | ||
| --durations=20 \ | ||
| --report-log=${LOG_FILE_PATH} \ | ||
| --config-files=${CONFIG_FILE_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many tests is this going to run in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now not running in parallel. I removed running tests in parallel in here 9c911f7 due to a PkgCI run that failed. This one specifically: https://github.com/iree-org/iree/actions/runs/19083410154/job/54518386104.
I disabled parallelism because I had already disabled some tests that failed but a new different set of tests failed after disabling the originally failing tests. I decided to open this up for review to surface the issue. We may want to first merge the CPU only and then investigate the failures on the GPUs. It may also be a different CI issue altogether.
| @@ -0,0 +1,18 @@ | |||
| { | |||
| "config_name": "gpu_hip_rdna3", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably call this gpu_hip_gfx1100, since there are also other gfx ip versions under the rdna3 umbrella
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "config_name": "gpu_vulkan", | ||
| "iree_compile_flags": [ | ||
| "--iree-hal-target-device=vulkan", | ||
| "--iree-opt-level=O0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these work in -O3? If yes, we probably want to use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me set up a run with O3. I copied over ONNX's settings. Let me know if changing the ONNX setting would also be a good idea.
iree/.github/workflows/pkgci_test_onnx.yml
Lines 46 to 53 in 6816a39
| - name: amdgpu_hip_rdna3_O3 | |
| numprocesses: 1 | |
| config-file: onnx_ops_gpu_hip_rdna3_O3.json | |
| runs-on: [Linux, X64, gfx1100] | |
| - name: amdgpu_vulkan_O0 | |
| numprocesses: 4 | |
| config-file: onnx_ops_gpu_vulkan_O0.json | |
| runs-on: [Linux, X64, rdna3] |
I will update this comment after the run has either failed or succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We disabled O3 in onnx because of some regression but we'd prefer to run pytorch in O3 if it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it worked! Here's the link to the successful test https://github.com/iree-org/iree/actions/runs/19104758812
| - name: amdgpu_hip_rdna3_O3 | ||
| config-file: torch_ops_gpu_hip_gfx1100_O3.json | ||
| runs-on: [Linux, X64, gfx1100] | ||
| - name: amdgpu_vulkan_O0 | ||
| config-file: torch_ops_gpu_vulkan_O3.json | ||
| runs-on: [Linux, X64, rdna3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add mi300 also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should have mi250 also, but okay for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm running a test pkgCI here https://github.com/iree-org/iree/actions/runs/19139123834)
Please note that once the changes you requested here iree-org/iree-test-suites#128 are merged, I'll just need to revert the last commit.
Rerunning here after disabling failing test https://github.com/iree-org/iree/actions/runs/19140548793 f8b1c7d
| uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0 | ||
| with: | ||
| repository: iree-org/iree-test-suites | ||
| ref: main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Groverkss as discussed offline, below you have a commit to one of your branches. I have created a PR for your branch here: iree-org/iree-test-suites#127 . I think it is best to have these reference main. That way as iree-test-suites changes, the code pulled in here gets updated automatically. If you prefer to have a ref to a specific commit id, then let me know and I can change this one here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pin to a specific commit (from the main branch) and not main. That allows you to make breaking changes in the other repository and then take time adapting to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed 5e3554c. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accidently commited a reference to my branch, it should always be a commit reference to main, and also not main as Scott mentioned.
--------- Signed-off-by: Erick Ochoa <[email protected]>
--------- Signed-off-by: Erick Ochoa <[email protected]>
Sample run here https://github.com/iree-org/iree/actions/runs/19085734122/job/54525888442
I'm not exactly sure why I had to disable some of the initial tests on GPUs at the moment. They were failing, I made pytest run sequentially and it seems to have stopped it, but maybe it was something else?