Codestin Search App

science-enthusiast · 2025-12-03T14:04:57Z

Issue 8652 reported a performance issue concerning MDRangePolicy in Serial backend.

Two approaches were tried to iterate through the elements of a View via MDRangePolicy. The approach based on a direct nested for loop over the elements (without tiles) is showing consistent speed improvement in the benchmarks tested so far (see below).

Please share your thoughts on how to choose between various approaches like looping over tiles, nested for loops without tiles, flat loop with index reconstruction etc. Currently, the approach in this PR is chosen if the tile dimensions are all set to 1, otherwise, the existing approach (a single loop over the tiles) is chosen.

@crtrott was also mentioning that this approach could be useful for multi-core CPU backends.

This nested for loop structure is very much suitable for unroll-and-jam based code optimization. It could be one of the knobs that we provide to the user.

science-enthusiast · 2025-12-03T14:17:33Z

Stencils based example shared in #8652. Totally there are three stencil computations. All involving 2D views. There are multiple Views involved, with only a few of them shared across the stencils.

Only timings regarding MDRange are expected to change.

Size of each dimension of the 2D Views: 6144. Tile dimensions: 8 X 256

	Current tiles based approach	Directly iterating over the elements without tiles
Time: Range 3 Kernels	1.255342 s	1.271328 s
Time: MDRange 3 Kernels	2.196737 s	1.853119 s
Time: Range 1 Kernel	1.417772 s	1.425144 s
Time: MDRange 1 Kernel	2.901179 s	2.291974 s
Time: MDRange Flat 1 Kernel	2.705818 s	2.732882 s
Time: Team Policy	1.407209 s	1.417905 s

Size of each dimension of the 2D Views: 8193. Tile dimensions: 8 X 256

	Current tiles based approach	Directly iterating over the elements without tiles
Time: Range 3 Kernels	2.221814 s	2.213865 s
Time: MDRange 3 Kernels	3.976642 s	3.359611 s
Time: Range 1 Kernel	2.511044 s	2.535955 s
Time: MDRange 1 Kernel	4.690899 s	4.040167 s
Time: MDRange Flat 1 Kernel	4.793878 s	4.884660 s
Time: Team Policy	2.488391 s	2.491935 s

science-enthusiast · 2025-12-03T14:22:47Z

Stream benchmark: based on PR[8462](https://github.com//pull/8462/) (2c3ba82) In all cases, the total number of elements in the View = $16^6$. Note: Unlike the PR, the `parallel_for` was launched 100 times to get numbers in the order of seconds.

Operation<Rank>	Tiles	No Tiles
Copy<1>	1055 ms	968 ms	FOM: GB/s=0.277449/s MB=268.435
Copy<2>	1065 ms	1036 ms	FOM: GB/s=0.259217/s MB=268.435
Copy<3>	1015 ms	1043 ms	FOM: GB/s=0.257328/s MB=268.435
Copy<4>	1185 ms	1039 ms	FOM: GB/s=0.258273/s MB=268.435
Copy<5>	1167 ms	1038 ms	FOM: GB/s=0.258493/s MB=268.435
Copy<6>	1214 ms	1068 ms	FOM: GB/s=0.251298/s MB=268.435
Set<1>	801 ms	746 ms	FOM: GB/s=0.179817/s MB=134.218
Set<2>	799 ms	741 ms	FOM: GB/s=0.181102/s MB=134.218
Set<3>	715 ms	742 ms	FOM: GB/s=0.180931/s MB=134.218
Set<4>	771 ms	739 ms	FOM: GB/s=0.181647/s MB=134.218
Set<5>	691 ms	743 ms	FOM: GB/s=0.180738/s MB=134.218
Set<6>	671 ms	749 ms	FOM: GB/s=0.179249/s MB=134.218
Add<1>	1440 ms	1206 ms	FOM: GB/s=0.333874/s MB=402.653
Add<2>	1442 ms	1305 ms	FOM: GB/s=0.308443/s MB=402.653
Add<3>	1387 ms	1299 ms	FOM: GB/s=0.309954/s MB=402.653
Add<4>	1892 ms	1302 ms	FOM: GB/s=0.309343/s MB=402.653
Add<5>	1909 ms	1302 ms	FOM: GB/s=0.309343/s MB=402.653
Add<6>	1696 ms	1413 ms	FOM: GB/s=0.284918/s MB=402.653
Scale<1>	1067 ms	956 ms	FOM: GB/s=0.280883/s MB=268.435
Scale<2>	1047 ms	1046 ms	FOM: GB/s=0.25656/s MB=268.435
Scale<3>	1021 ms	1043 ms	FOM: GB/s=0.257302/s MB=268.435
Scale<4>	1272 ms	1046 ms	FOM: GB/s=0.256544/s MB=268.435
Scale<5>	1212 ms	1047 ms	FOM: GB/s=0.256409/s MB=268.435
Scale<6>	1218 ms	1073 ms	FOM: GB/s=0.25028/s MB=268.435
Triad<1>	1383 ms	1219 ms	FOM: GB/s=0.330189/s MB=402.653
Triad<2>	1396 ms	1314 ms	FOM: GB/s=0.306464/s MB=402.653
Triad<3>	1373 ms	1321 ms	FOM: GB/s=0.304741/s MB=402.653
Triad<4>	1808 ms	1360 ms	FOM: GB/s=0.296073/s MB=402.653
Triad<5>	1889 ms	1404 ms	FOM: GB/s=0.286864/s MB=402.653
Triad<6>	1653 ms	1397 ms	FOM: GB/s=0.288239/s MB=402.653

science-enthusiast · 2025-12-03T14:25:13Z

Stencil benchmark: based on PR[8458](https://github.com//pull/8458/) (ac90be7). Default tiling was used in all cases. Note: Unlike the PR, the `parallel_for` was launched 100 times to get bigger total time durations.

	Tiles	No Tiles
2D, LayoutRight, each dim. = 512	29.3 ms	21.1 ms	tile_0=2 tile_1=512
2D, LayoutRight, each dim. = 1024	118 ms	85.9 ms	tile_0=2 tile_1=1.024k
2D, LayoutRight, each dim. = 2048	482 ms	372 ms	tile_0=2 tile_1=2.048k
2D, LayoutRight, each dim. = 4096	1937 ms	1509 ms	tile_0=2 tile_1=4.096k
2D, LayoutLeft, each dim. = 512	29.5 ms	21.7 ms	tile_0=512 tile_1=2
2D, LayoutLeft, each dim. = 1024	118 ms	85.3 ms	tile_0=1.024k tile_1=2
2D, LayoutLeft, each dim. = 2048	480 ms	371 ms	tile_0=2.048k tile_1=2
2D, LayoutLeft, each dim. = 4096	1931 ms	1504 ms	tile_0=4.096k tile_1=2
3D, LayoutRight, each dim. = 128	328 ms	233 ms	tile_0=2 tile_1=2 tile_2=128
3D, LayoutRight, each dim. = 192	1048 ms	795 ms	tile_0=2 tile_1=2 tile_2=192
3D, LayoutRight, each dim. = 256	2505 ms	1937 ms	tile_0=2 tile_1=2 tile_2=256
3D, LayoutLeft, each dim. = 128	316 ms	230 ms	tile_0=128 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 192	1077 ms	788 ms	tile_0=192 tile_1=2 tile_2=2
3D, LayoutLeft, each dim. = 256	2557 ms	1902 ms	tile_0=256 tile_1=2 tile_2=2
4D, LayoutRight, each dim. = 32	229 ms	151 ms	tile_0=2 tile_1=2 tile_2=2 tile_3=32
4D, LayoutRight, each dim. = 64	3348 ms	2386 ms	tile_0=2 tile_1=2 tile_2=2 tile_3=64
4D, LayoutLeft, each dim. = 32	216 ms	161 ms	tile_0=32 tile_1=2 tile_2=2 tile_3=2
4D, LayoutLeft, each dim. = 64	3368 ms	2514 ms	tile_0=64 tile_1=2 tile_2=2 tile_3=2

AdRi1t · 2025-12-03T14:30:08Z

The title should mention that this only concerns the serial backend.

science-enthusiast · 2025-12-03T14:36:47Z

The title should mention that this only concerns the serial backend.

This work could be useful for multi-core CPUs also.

core/src/impl/KokkosExp_Host_IterateElement.hpp

masterleinad · 2025-12-03T14:55:30Z

This work could be useful for multi-core CPUs also.

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

science-enthusiast · 2025-12-03T15:14:39Z

I'd be curious how you would distribute the execution range for host parallel backends with this implementation.

There are no dependencies between the loop iterations, except for fetching data in a cache friendly manner. Parallelization in terms of batches of "higher level" iterations could help.

Also, the results in #8652 are quite good for the RangePolicy with a for loop inside the functor. One of the reasons could be better vectorization with -O3 level optimization. I will add _Pragma(ivdep) to the inner most loop like it is done in core/src/impl/KokkosExp_Host_IterateTile.hpp. I will also re-run that test with the flags Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION and Kokkos_ENABLE_PRAGMA_IVDEP, to see whether the performance for current MDRangePolicy can be improved for Serial backend.

core/src/impl/KokkosExp_Host_IterateNestLoopWoTile.hpp

Signed-off-by: Hariprasad Kannan <[email protected]>

core/src/Serial/Kokkos_Serial_Parallel_MDRange.hpp

yasahi-hpc

Do you have a test for no-tile version?

core/src/Serial/Kokkos_Serial_Parallel_MDRange.hpp

science-enthusiast · 2025-12-10T08:36:39Z

Do you have a test for no-tile version?

I will add test(s).

Signed-off-by: Hariprasad Kannan <[email protected]>

yasahi-hpc

Thanks
Other than simd directives, it looks good

core/src/impl/KokkosExp_Host_IterateNestLoopWoTile.hpp

Signed-off-by: Hariprasad Kannan <[email protected]>

science-enthusiast requested review from AdRi1t, crtrott, dalg24, masterleinad, nmm0, tretre91 and yasahi-hpc December 3, 2025 14:25

science-enthusiast mentioned this pull request Dec 3, 2025

Performance question for a single CPU core with different Kokkos policies #8652

Open

science-enthusiast changed the title ~~Iterate w/o tiles in MDRangePolicy~~ Iterate w/o tiles in MDRangePolicy for Host backends Dec 3, 2025

AdRi1t reviewed Dec 3, 2025

View reviewed changes

core/src/impl/KokkosExp_Host_IterateElement.hpp Outdated Show resolved Hide resolved

science-enthusiast changed the title ~~Iterate w/o tiles in MDRangePolicy for Host backends~~ MDRangePolicy: Iterate w/o tiles; Host backends (begin with Serial) Dec 4, 2025

science-enthusiast changed the title ~~MDRangePolicy: Iterate w/o tiles; Host backends (begin with Serial)~~ MDRangePolicy: Nested for loops w/o tiles; Host backends (begin with Serial) Dec 4, 2025

science-enthusiast changed the title ~~MDRangePolicy: Nested for loops w/o tiles; Host backends (begin with Serial)~~ MDRangePolicy: Nested loops w/o tiles; Host backends (begin with Serial) Dec 4, 2025

science-enthusiast force-pushed the serial-mdrange-without-tiles branch 2 times, most recently from 3c02fea to 4a5e2b8 Compare December 9, 2025 12:39

tretre91 reviewed Dec 9, 2025

View reviewed changes

core/src/impl/KokkosExp_Host_IterateNestLoopWoTile.hpp Show resolved Hide resolved

core/src/impl/KokkosExp_Host_IterateNestLoopWoTile.hpp Show resolved Hide resolved

Nested loop w/o tiles over the elements in MDRangePolicy

46f6091

Signed-off-by: Hariprasad Kannan <[email protected]>

science-enthusiast force-pushed the serial-mdrange-without-tiles branch from 4a5e2b8 to 46f6091 Compare December 9, 2025 15:33

science-enthusiast marked this pull request as ready for review December 9, 2025 18:07

yasahi-hpc reviewed Dec 10, 2025

View reviewed changes

core/src/Serial/Kokkos_Serial_Parallel_MDRange.hpp Show resolved Hide resolved

yasahi-hpc requested changes Dec 10, 2025

View reviewed changes

core/src/Serial/Kokkos_Serial_Parallel_MDRange.hpp Show resolved Hide resolved

Added unit tests. Fixed local variable name.

03c2640

Signed-off-by: Hariprasad Kannan <[email protected]>

yasahi-hpc requested changes Dec 10, 2025

View reviewed changes

core/src/impl/KokkosExp_Host_IterateNestLoopWoTile.hpp Outdated Show resolved Hide resolved

core/src/impl/KokkosExp_Host_IterateNestLoopWoTile.hpp Show resolved Hide resolved

science-enthusiast added 2 commits February 12, 2026 09:51

Merge 'origin/develop' into 'serial-mdrange-without-tiles'

076946d

Signed-off-by: Hariprasad Kannan <[email protected]>

Added ivdep pragmas.

156ce5c

Signed-off-by: Hariprasad Kannan <[email protected]>

Conversation

science-enthusiast commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

science-enthusiast commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

science-enthusiast commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

science-enthusiast commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdRi1t commented Dec 3, 2025

Uh oh!

science-enthusiast commented Dec 3, 2025

Uh oh!

Uh oh!

masterleinad commented Dec 3, 2025

Uh oh!

science-enthusiast commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yasahi-hpc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

science-enthusiast commented Dec 10, 2025

Uh oh!

yasahi-hpc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

science-enthusiast commented Dec 3, 2025 •

edited

Loading

science-enthusiast commented Dec 3, 2025 •

edited

Loading

science-enthusiast commented Dec 3, 2025 •

edited

Loading

science-enthusiast commented Dec 3, 2025 •

edited

Loading

science-enthusiast commented Dec 3, 2025 •

edited

Loading