MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)#8721
MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)#8721science-enthusiast wants to merge 4 commits intokokkos:developfrom
Conversation
|
Stencils based example shared in #8652. Totally there are three stencil computations. All involving 2D views. There are multiple Views involved, with only a few of them shared across the stencils. Only timings regarding MDRange are expected to change. Size of each dimension of the 2D Views: 6144. Tile dimensions: 8 X 256
Size of each dimension of the 2D Views: 8193. Tile dimensions: 8 X 256
|
Stream benchmark: based on PR[8462](https://github.com//pull/8462/) (2c3ba82) In all cases, the total number of elements in the View =
|
| Operation<Rank> | Tiles | No Tiles | |
|---|---|---|---|
| Copy<1> | 1055 ms | 968 ms | FOM: GB/s=0.277449/s MB=268.435 |
| Copy<2> | 1065 ms | 1036 ms | FOM: GB/s=0.259217/s MB=268.435 |
| Copy<3> | 1015 ms | 1043 ms | FOM: GB/s=0.257328/s MB=268.435 |
| Copy<4> | 1185 ms | 1039 ms | FOM: GB/s=0.258273/s MB=268.435 |
| Copy<5> | 1167 ms | 1038 ms | FOM: GB/s=0.258493/s MB=268.435 |
| Copy<6> | 1214 ms | 1068 ms | FOM: GB/s=0.251298/s MB=268.435 |
| Set<1> | 801 ms | 746 ms | FOM: GB/s=0.179817/s MB=134.218 |
| Set<2> | 799 ms | 741 ms | FOM: GB/s=0.181102/s MB=134.218 |
| Set<3> | 715 ms | 742 ms | FOM: GB/s=0.180931/s MB=134.218 |
| Set<4> | 771 ms | 739 ms | FOM: GB/s=0.181647/s MB=134.218 |
| Set<5> | 691 ms | 743 ms | FOM: GB/s=0.180738/s MB=134.218 |
| Set<6> | 671 ms | 749 ms | FOM: GB/s=0.179249/s MB=134.218 |
| Add<1> | 1440 ms | 1206 ms | FOM: GB/s=0.333874/s MB=402.653 |
| Add<2> | 1442 ms | 1305 ms | FOM: GB/s=0.308443/s MB=402.653 |
| Add<3> | 1387 ms | 1299 ms | FOM: GB/s=0.309954/s MB=402.653 |
| Add<4> | 1892 ms | 1302 ms | FOM: GB/s=0.309343/s MB=402.653 |
| Add<5> | 1909 ms | 1302 ms | FOM: GB/s=0.309343/s MB=402.653 |
| Add<6> | 1696 ms | 1413 ms | FOM: GB/s=0.284918/s MB=402.653 |
| Scale<1> | 1067 ms | 956 ms | FOM: GB/s=0.280883/s MB=268.435 |
| Scale<2> | 1047 ms | 1046 ms | FOM: GB/s=0.25656/s MB=268.435 |
| Scale<3> | 1021 ms | 1043 ms | FOM: GB/s=0.257302/s MB=268.435 |
| Scale<4> | 1272 ms | 1046 ms | FOM: GB/s=0.256544/s MB=268.435 |
| Scale<5> | 1212 ms | 1047 ms | FOM: GB/s=0.256409/s MB=268.435 |
| Scale<6> | 1218 ms | 1073 ms | FOM: GB/s=0.25028/s MB=268.435 |
| Triad<1> | 1383 ms | 1219 ms | FOM: GB/s=0.330189/s MB=402.653 |
| Triad<2> | 1396 ms | 1314 ms | FOM: GB/s=0.306464/s MB=402.653 |
| Triad<3> | 1373 ms | 1321 ms | FOM: GB/s=0.304741/s MB=402.653 |
| Triad<4> | 1808 ms | 1360 ms | FOM: GB/s=0.296073/s MB=402.653 |
| Triad<5> | 1889 ms | 1404 ms | FOM: GB/s=0.286864/s MB=402.653 |
| Triad<6> | 1653 ms | 1397 ms | FOM: GB/s=0.288239/s MB=402.653 |
Stencil benchmark: based on PR[8458](https://github.com//pull/8458/) (ac90be7). Default tiling was used in all cases. Note: Unlike the PR, the `parallel_for` was launched 100 times to get bigger total time durations.
|
|
The title should mention that this only concerns the |
This work could be useful for multi-core CPUs also. |
I'd be curious how you would distribute the execution range for host parallel backends with this implementation. |
There are no dependencies between the loop iterations, except for fetching data in a cache friendly manner. Parallelization in terms of batches of "higher level" iterations could help. Also, the results in #8652 are quite good for the RangePolicy with a for loop inside the functor. One of the reasons could be better vectorization with |
3c02fea to
4a5e2b8
Compare
Signed-off-by: Hariprasad Kannan <[email protected]>
4a5e2b8 to
46f6091
Compare
yasahi-hpc
left a comment
There was a problem hiding this comment.
Do you have a test for no-tile version?
I will add test(s). |
Signed-off-by: Hariprasad Kannan <[email protected]>
yasahi-hpc
left a comment
There was a problem hiding this comment.
Thanks
Other than simd directives, it looks good
Signed-off-by: Hariprasad Kannan <[email protected]>
Signed-off-by: Hariprasad Kannan <[email protected]>
Issue 8652 reported a performance issue concerning MDRangePolicy in Serial backend.
Two approaches were tried to iterate through the elements of a View via MDRangePolicy. The approach based on a direct nested for loop over the elements (without tiles) is showing consistent speed improvement in the benchmarks tested so far (see below).
Please share your thoughts on how to choose between various approaches like looping over tiles, nested for loops without tiles, flat loop with index reconstruction etc. Currently, the approach in this PR is chosen if the tile dimensions are all set to
1, otherwise, the existing approach (a single loop over the tiles) is chosen.@crtrott was also mentioning that this approach could be useful for multi-core CPU backends.
This nested for loop structure is very much suitable for unroll-and-jam based code optimization. It could be one of the knobs that we provide to the user.