Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)#8721

Open
science-enthusiast wants to merge 4 commits intokokkos:developfrom
science-enthusiast:serial-mdrange-without-tiles
Open

MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial)#8721
science-enthusiast wants to merge 4 commits intokokkos:developfrom
science-enthusiast:serial-mdrange-without-tiles

Conversation

@science-enthusiast
Copy link
Contributor

@science-enthusiast science-enthusiast commented Dec 3, 2025

Issue 8652 reported a performance issue concerning MDRangePolicy in Serial backend.

Two approaches were tried to iterate through the elements of a View via MDRangePolicy. The approach based on a direct nested for loop over the elements (without tiles) is showing consistent speed improvement in the benchmarks tested so far (see below).

Please share your thoughts on how to choose between various approaches like looping over tiles, nested for loops without tiles, flat loop with index reconstruction etc. Currently, the approach in this PR is chosen if the tile dimensions are all set to 1, otherwise, the existing approach (a single loop over the tiles) is chosen.

@crtrott was also mentioning that this approach could be useful for multi-core CPU backends.

This nested for loop structure is very much suitable for unroll-and-jam based code optimization. It could be one of the knobs that we provide to the user.

@science-enthusiast
Copy link
Contributor Author

science-enthusiast commented Dec 3, 2025

Stencils based example shared in #8652. Totally there are three stencil computations. All involving 2D views. There are multiple Views involved, with only a few of them shared across the stencils.

Only timings regarding MDRange are expected to change.

Size of each dimension of the 2D Views: 6144. Tile dimensions: 8 X 256
Current tiles based approach Directly iterating over the elements without tiles
Time: Range 3 Kernels 1.255342 s 1.271328 s
Time: MDRange 3 Kernels 2.196737 s 1.853119 s
Time: Range 1 Kernel 1.417772 s 1.425144 s
Time: MDRange 1 Kernel 2.901179 s 2.291974 s
Time: MDRange Flat 1 Kernel 2.705818 s 2.732882 s
Time: Team Policy 1.407209 s 1.417905 s
Size of each dimension of the 2D Views: 8193. Tile dimensions: 8 X 256
Current tiles based approach Directly iterating over the elements without tiles
Time: Range 3 Kernels 2.221814 s 2.213865 s
Time: MDRange 3 Kernels 3.976642 s 3.359611 s
Time: Range 1 Kernel 2.511044 s 2.535955 s
Time: MDRange 1 Kernel 4.690899 s 4.040167 s
Time: MDRange Flat 1 Kernel 4.793878 s 4.884660 s
Time: Team Policy 2.488391 s 2.491935 s

@science-enthusiast
Copy link
Contributor Author

science-enthusiast commented Dec 3, 2025

Stream benchmark: based on PR[8462](https://github.com//pull/8462/) (2c3ba82) In all cases, the total number of elements in the View = $16^6$. Note: Unlike the PR, the `parallel_for` was launched 100 times to get numbers in the order of seconds.
Operation<Rank> Tiles No Tiles
Copy<1> 1055 ms 968 ms FOM: GB/s=0.277449/s MB=268.435
Copy<2> 1065 ms 1036 ms FOM: GB/s=0.259217/s MB=268.435
Copy<3> 1015 ms 1043 ms FOM: GB/s=0.257328/s MB=268.435
Copy<4> 1185 ms 1039 ms FOM: GB/s=0.258273/s MB=268.435
Copy<5> 1167 ms 1038 ms FOM: GB/s=0.258493/s MB=268.435
Copy<6> 1214 ms 1068 ms FOM: GB/s=0.251298/s MB=268.435
Set<1> 801 ms 746 ms FOM: GB/s=0.179817/s MB=134.218
Set<2> 799 ms 741 ms FOM: GB/s=0.181102/s MB=134.218
Set<3> 715 ms 742 ms FOM: GB/s=0.180931/s MB=134.218
Set<4> 771 ms 739 ms FOM: GB/s=0.181647/s MB=134.218
Set<5> 691 ms 743 ms FOM: GB/s=0.180738/s MB=134.218
Set<6> 671 ms 749 ms FOM: GB/s=0.179249/s MB=134.218
Add<1> 1440 ms 1206 ms FOM: GB/s=0.333874/s MB=402.653
Add<2> 1442 ms 1305 ms FOM: GB/s=0.308443/s MB=402.653
Add<3> 1387 ms 1299 ms FOM: GB/s=0.309954/s MB=402.653
Add<4> 1892 ms 1302 ms FOM: GB/s=0.309343/s MB=402.653
Add<5> 1909 ms 1302 ms FOM: GB/s=0.309343/s MB=402.653
Add<6> 1696 ms 1413 ms FOM: GB/s=0.284918/s MB=402.653
Scale<1> 1067 ms 956 ms FOM: GB/s=0.280883/s MB=268.435
Scale<2> 1047 ms 1046 ms FOM: GB/s=0.25656/s MB=268.435
Scale<3> 1021 ms 1043 ms FOM: GB/s=0.257302/s MB=268.435
Scale<4> 1272 ms 1046 ms FOM: GB/s=0.256544/s MB=268.435
Scale<5> 1212 ms 1047 ms FOM: GB/s=0.256409/s MB=268.435
Scale<6> 1218 ms 1073 ms FOM: GB/s=0.25028/s MB=268.435
Triad<1> 1383 ms 1219 ms FOM: GB/s=0.330189/s MB=402.653
Triad<2> 1396 ms 1314 ms FOM: GB/s=0.306464/s MB=402.653
Triad<3> 1373 ms 1321 ms FOM: GB/s=0.304741/s MB=402.653
Triad<4> 1808 ms 1360 ms FOM: GB/s=0.296073/s MB=402.653
Triad<5> 1889 ms 1404 ms FOM: GB/s=0.286864/s MB=402.653
Triad<6> 1653 ms 1397 ms FOM: GB/s=0.288239/s MB=402.653

@science-enthusiast
Copy link
Contributor Author

science-enthusiast commented Dec 3, 2025

Stencil benchmark: based on PR[8458](https://github.com//pull/8458/) (ac90be7). Default tiling was used in all cases. Note: Unlike the PR, the `parallel_for` was launched 100 times to get bigger total time durations.
  Tiles No Tiles  
2D, LayoutRight, each dim. = 512 29.3 ms 21.1 ms tile_0=2 tile_1=512
2D, LayoutRight, each dim. = 1024 118 ms 85.9 ms tile_0=2 tile_1=1.024k
2D, LayoutRight, each dim. = 2048 482 ms 372 ms tile_0=2 tile_1=2.048k
2D, LayoutRight, each dim. = 4096 1937 ms 1509 ms tile_0=2 tile_1=4.096k
2D, LayoutLeft, each dim. = 512 29.5 ms 21.7 ms tile_0=512 tile_1=2
2D, LayoutLeft, each dim. = 1024 118 ms 85.3 ms tile_0=1.024k tile_1=2
2D, LayoutLeft, each dim. = 2048 480 ms 371 ms tile_0=2.048k tile_1=2
2D, LayoutLeft, each dim. = 4096 1931 ms 1504 ms tile_0=4.096k tile_1=2
3D, LayoutRight, each dim. = 128 328 ms 233 ms tile_0=2 tile_1=2 tile_2=128
3D, LayoutRight, each dim. = 192 1048 ms 795 ms tile_0=2 tile_1=2 tile_2=192
3D, LayoutRight, each dim. = 256 2505 ms 1937 ms tile_0=2 tile_1=2 tile_2=256
3D, LayoutLeft, each dim. = 128 316 ms 230 ms tile_0=128 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 192 1077 ms 788 ms tile_0=192 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 256 2557 ms 1902 ms tile_0=256 tile_1=2 tile_2=2
4D, LayoutRight, each dim. = 32 229 ms 151 ms tile_0=2 tile_1=2 tile_2=2 tile_3=32
4D, LayoutRight, each dim. = 64 3348 ms 2386 ms tile_0=2 tile_1=2 tile_2=2 tile_3=64
4D, LayoutLeft, each dim. = 32 216 ms 161 ms tile_0=32 tile_1=2 tile_2=2 tile_3=2
4D, LayoutLeft, each dim. = 64 3368 ms 2514 ms tile_0=64 tile_1=2 tile_2=2 tile_3=2

@AdRi1t
Copy link
Member

AdRi1t commented Dec 3, 2025

The title should mention that this only concerns the serial backend.

@science-enthusiast
Copy link
Contributor Author

The title should mention that this only concerns the serial backend.

This work could be useful for multi-core CPUs also.

@science-enthusiast science-enthusiast changed the title Iterate w/o tiles in MDRangePolicy Iterate w/o tiles in MDRangePolicy for Host backends Dec 3, 2025
@masterleinad
Copy link
Contributor

This work could be useful for multi-core CPUs also.

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

@science-enthusiast
Copy link
Contributor Author

science-enthusiast commented Dec 3, 2025

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

There are no dependencies between the loop iterations, except for fetching data in a cache friendly manner. Parallelization in terms of batches of "higher level" iterations could help.

Also, the results in #8652 are quite good for the RangePolicy with a for loop inside the functor. One of the reasons could be better vectorization with -O3 level optimization. I will add _Pragma(ivdep) to the inner most loop like it is done in core/src/impl/KokkosExp_Host_IterateTile.hpp. I will also re-run that test with the flags Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION and Kokkos_ENABLE_PRAGMA_IVDEP, to see whether the performance for current MDRangePolicy can be improved for Serial backend.

@science-enthusiast science-enthusiast changed the title Iterate w/o tiles in MDRangePolicy for Host backends MDRangePolicy: Iterate w/o tiles; Host backends (begin with Serial) Dec 4, 2025
@science-enthusiast science-enthusiast changed the title MDRangePolicy: Iterate w/o tiles; Host backends (begin with Serial) MDRangePolicy: Nested for loops w/o tiles; Host backends (begin with Serial) Dec 4, 2025
@science-enthusiast science-enthusiast changed the title MDRangePolicy: Nested for loops w/o tiles; Host backends (begin with Serial) MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial) Dec 4, 2025
@science-enthusiast science-enthusiast force-pushed the serial-mdrange-without-tiles branch 2 times, most recently from 3c02fea to 4a5e2b8 Compare December 9, 2025 12:39
@science-enthusiast science-enthusiast force-pushed the serial-mdrange-without-tiles branch from 4a5e2b8 to 46f6091 Compare December 9, 2025 15:33
@science-enthusiast science-enthusiast marked this pull request as ready for review December 9, 2025 18:07
Copy link
Contributor

@yasahi-hpc yasahi-hpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a test for no-tile version?

@science-enthusiast
Copy link
Contributor Author

Do you have a test for no-tile version?

I will add test(s).

Copy link
Contributor

@yasahi-hpc yasahi-hpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks
Other than simd directives, it looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments