Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

winner245
Copy link
Contributor

@winner245 winner245 commented Mar 25, 2025

Previously, the segmented iterator optimization for std::for_each was restricted to >= C++23 due to its dependence on __movable_box (which requires >= C++23 to perform move semantics). It was not optimized for std::for_each_n, std::ranges::for_each, or std::ranges::for_each_n.

This patch:

  1. Extends the segmented iterator optimization to make it applicable starting from C++11 by removing the dependence on __movable_box;
  2. The optimization is further extended to std::for_each_n, std::ranges::for_each, and std::ranges::for_each_n, resulting in consistent optimizations for all these algorithms.

Benchmarks demonstrate significant performance improvements for both deque and join_view iterators: up to 21.3x for deque and 24.9x for join_view.

Addresses a subtask of #102817.

Summary of speedups for deque iterators

-------------------------------------------------------------------------------
Benchmark                        deque<char>    deque<short>    deque<int>
-------------------------------------------------------------------------------
std::for_each                        1.0x           1.0x           1.0x
rng::for_each                       13.1x          21.3x           4.4x
std::for_each_n                     13.1x          17.7x           3.7x
rng::for_each_n                     13.8x          15.5x           3.6x
-------------------------------------------------------------------------------

Summary of speedups for join_view iterators

-----------------------------------------------------------------------------------------
Benchmark          vector<vector<char>>    vector<vector<short>>    vector<vector<int>>
-----------------------------------------------------------------------------------------
std::for_each              1.0x                    1.0x                    1.0x
rng::for_each             17.8x                   22.1x                    4.3x
std::for_each_n           11.5x                   11.2x                    3.2x
rng::for_each_n           24.9x                   23.1x                    4.0x
-----------------------------------------------------------------------------------------

Note: std::for_each shows no change as it was already optimized previously (for >= C++23).

Benchmarks:

{std, ranges}::for_each_n with deque iterators

--------------------------------------------------------------------------
Benchmark                                    Before       After    Speedup
--------------------------------------------------------------------------
std::for_each_n(vector<char>)/8             4.26 ns     4.23 ns      1.0x
std::for_each_n(vector<char>)/32            2.68 ns     2.67 ns      1.0x
std::for_each_n(vector<char>)/50            9.49 ns     9.36 ns      1.0x
std::for_each_n(vector<char>)/1024          42.3 ns     40.1 ns      1.1x
std::for_each_n(vector<char>)/4096           163 ns      151 ns      1.1x
std::for_each_n(vector<char>)/8192           308 ns      294 ns      1.0x
std::for_each_n(vector<char>)/16384          608 ns      593 ns      1.0x
std::for_each_n(vector<char>)/65536         2435 ns     2464 ns      1.0x
std::for_each_n(vector<char>)/262144       10029 ns    10190 ns      1.0x
std::for_each_n(deque<char>)/8              6.57 ns     2.43 ns      2.7x
std::for_each_n(deque<char>)/32             24.0 ns     2.73 ns      8.8x
std::for_each_n(deque<char>)/50             33.2 ns     4.53 ns      7.3x
std::for_each_n(deque<char>)/1024            541 ns     44.9 ns     12.0x
std::for_each_n(deque<char>)/4096           2067 ns      169 ns     12.2x
std::for_each_n(deque<char>)/8192           4005 ns      305 ns     13.1x
std::for_each_n(deque<char>)/16384          7831 ns      639 ns     12.3x
std::for_each_n(deque<char>)/65536         31819 ns     2717 ns     11.7x
std::for_each_n(deque<char>)/262144       120801 ns    10674 ns     11.3x
std::for_each_n(list<char>)/8               4.97 ns     5.16 ns      1.0x
std::for_each_n(list<char>)/32              19.9 ns     20.6 ns      1.0x
std::for_each_n(list<char>)/50              40.6 ns     42.7 ns      1.0x
std::for_each_n(list<char>)/1024             996 ns     1038 ns      1.0x
std::for_each_n(list<char>)/4096            6186 ns     6341 ns      1.0x
std::for_each_n(list<char>)/8192           12522 ns    12391 ns      1.0x
std::for_each_n(list<char>)/16384          26158 ns    25739 ns      1.0x
std::for_each_n(list<char>)/65536         106410 ns   105299 ns      1.0x
std::for_each_n(list<char>)/262144        621473 ns   625741 ns      1.0x
rng::for_each_n(vector<char>)/8             3.85 ns     4.99 ns      0.8x
rng::for_each_n(vector<char>)/32            2.75 ns     2.91 ns      0.9x
rng::for_each_n(vector<char>)/50            9.67 ns     13.3 ns      0.7x
rng::for_each_n(vector<char>)/1024          41.4 ns     42.4 ns      1.0x
rng::for_each_n(vector<char>)/4096           154 ns      171 ns      0.9x
rng::for_each_n(vector<char>)/8192           308 ns      340 ns      0.9x
rng::for_each_n(vector<char>)/16384          608 ns      673 ns      0.9x
rng::for_each_n(vector<char>)/65536         2471 ns     2867 ns      0.9x
rng::for_each_n(vector<char>)/262144       10138 ns    10882 ns      0.9x
rng::for_each_n(deque<char>)/8              5.71 ns     2.32 ns      2.5x
rng::for_each_n(deque<char>)/32             24.0 ns     2.74 ns      8.8x
rng::for_each_n(deque<char>)/50             33.3 ns     5.00 ns      6.7x
rng::for_each_n(deque<char>)/1024            554 ns     42.1 ns     13.2x
rng::for_each_n(deque<char>)/4096           2194 ns      159 ns     13.8x
rng::for_each_n(deque<char>)/8192           4265 ns      337 ns     12.7x
rng::for_each_n(deque<char>)/16384          8539 ns      672 ns     12.7x
rng::for_each_n(deque<char>)/65536         33510 ns     2775 ns     12.1x
rng::for_each_n(deque<char>)/262144       136651 ns    11271 ns     12.1x
rng::for_each_n(list<char>)/8               5.37 ns     6.21 ns      0.9x
rng::for_each_n(list<char>)/32              20.3 ns     23.1 ns      0.9x
rng::for_each_n(list<char>)/50              41.3 ns     42.3 ns      1.0x
rng::for_each_n(list<char>)/1024            1036 ns     1064 ns      1.0x
rng::for_each_n(list<char>)/4096            6310 ns     6645 ns      0.9x
rng::for_each_n(list<char>)/8192           12996 ns    13245 ns      1.0x
rng::for_each_n(list<char>)/16384          24803 ns    25932 ns      1.0x
rng::for_each_n(list<char>)/65536         103587 ns   105354 ns      1.0x
rng::for_each_n(list<char>)/262144        550281 ns   753493 ns      0.7x
std::for_each_n(vector<short>)/8            4.42 ns     3.92 ns      1.1x
std::for_each_n(vector<short>)/32           1.62 ns     1.64 ns      1.0x
std::for_each_n(vector<short>)/50           2.74 ns     2.75 ns      1.0x
std::for_each_n(vector<short>)/1024         34.0 ns     33.6 ns      1.0x
std::for_each_n(vector<short>)/4096          120 ns      117 ns      1.0x
std::for_each_n(vector<short>)/8192          229 ns      267 ns      0.9x
std::for_each_n(vector<short>)/16384         452 ns      469 ns      1.0x
std::for_each_n(vector<short>)/65536        2262 ns     2265 ns      1.0x
std::for_each_n(vector<short>)/262144       9129 ns     9140 ns      1.0x
std::for_each_n(deque<short>)/8             5.28 ns     1.78 ns      3.0x
std::for_each_n(deque<short>)/32            22.8 ns     2.08 ns     11.0x
std::for_each_n(deque<short>)/50            32.3 ns     4.46 ns      7.2x
std::for_each_n(deque<short>)/1024           545 ns     35.2 ns     15.5x
std::for_each_n(deque<short>)/4096          2158 ns      128 ns     16.9x
std::for_each_n(deque<short>)/8192          4303 ns      243 ns     17.7x
std::for_each_n(deque<short>)/16384         8624 ns      516 ns     16.7x
std::for_each_n(deque<short>)/65536        34569 ns     2336 ns     14.8x
std::for_each_n(deque<short>)/262144      137820 ns     9319 ns     14.8x
std::for_each_n(list<short>)/8              4.66 ns     4.95 ns      0.9x
std::for_each_n(list<short>)/32             19.9 ns     20.4 ns      1.0x
std::for_each_n(list<short>)/50             41.3 ns     41.1 ns      1.0x
std::for_each_n(list<short>)/1024           1018 ns     1021 ns      1.0x
std::for_each_n(list<short>)/4096           6110 ns     6294 ns      1.0x
std::for_each_n(list<short>)/8192          12433 ns    12692 ns      1.0x
std::for_each_n(list<short>)/16384         24739 ns    24820 ns      1.0x
std::for_each_n(list<short>)/65536        103376 ns   102812 ns      1.0x
std::for_each_n(list<short>)/262144       538314 ns   555664 ns      1.0x
rng::for_each_n(vector<short>)/8            3.84 ns     3.90 ns      1.0x
rng::for_each_n(vector<short>)/32           1.60 ns     1.63 ns      1.0x
rng::for_each_n(vector<short>)/50           2.88 ns     2.88 ns      1.0x
rng::for_each_n(vector<short>)/1024         33.6 ns     33.8 ns      1.0x
rng::for_each_n(vector<short>)/4096          117 ns      117 ns      1.0x
rng::for_each_n(vector<short>)/8192          229 ns      233 ns      1.0x
rng::for_each_n(vector<short>)/16384         456 ns      479 ns      1.0x
rng::for_each_n(vector<short>)/65536        2256 ns     2288 ns      1.0x
rng::for_each_n(vector<short>)/262144       8966 ns     9078 ns      1.0x
rng::for_each_n(deque<short>)/8             6.52 ns     1.97 ns      3.3x
rng::for_each_n(deque<short>)/32            23.7 ns     2.10 ns     11.3x
rng::for_each_n(deque<short>)/50            34.1 ns     4.74 ns      7.2x
rng::for_each_n(deque<short>)/1024           539 ns     35.1 ns     15.4x
rng::for_each_n(deque<short>)/4096          1920 ns      131 ns     14.7x
rng::for_each_n(deque<short>)/8192          3957 ns      255 ns     15.5x
rng::for_each_n(deque<short>)/16384         7807 ns      505 ns     15.5x
rng::for_each_n(deque<short>)/65536        30293 ns     2435 ns     12.4x
rng::for_each_n(deque<short>)/262144      119499 ns     9667 ns     12.4x
rng::for_each_n(list<short>)/8              5.08 ns     5.38 ns      0.9x
rng::for_each_n(list<short>)/32             20.1 ns     20.5 ns      1.0x
rng::for_each_n(list<short>)/50             42.6 ns     41.1 ns      1.0x
rng::for_each_n(list<short>)/1024           1028 ns     1025 ns      1.0x
rng::for_each_n(list<short>)/4096           6857 ns     6311 ns      1.1x
rng::for_each_n(list<short>)/8192          13336 ns    12807 ns      1.0x
rng::for_each_n(list<short>)/16384         26031 ns    25081 ns      1.0x
rng::for_each_n(list<short>)/65536        101849 ns   109759 ns      0.9x
rng::for_each_n(list<short>)/262144       582600 ns   554157 ns      1.1x
std::for_each_n(vector<int>)/8              2.78 ns     2.73 ns      1.0x
std::for_each_n(vector<int>)/32             5.22 ns     5.26 ns      1.0x
std::for_each_n(vector<int>)/50             8.20 ns     8.65 ns      0.9x
std::for_each_n(vector<int>)/1024            156 ns      175 ns      0.9x
std::for_each_n(vector<int>)/4096            602 ns      758 ns      0.8x
std::for_each_n(vector<int>)/8192           1214 ns     1393 ns      0.9x
std::for_each_n(vector<int>)/16384          2417 ns     2690 ns      0.9x
std::for_each_n(vector<int>)/65536          9989 ns    10703 ns      0.9x
std::for_each_n(vector<int>)/262144        41512 ns    43798 ns      0.9x
std::for_each_n(deque<int>)/8               5.04 ns     2.75 ns      1.8x
std::for_each_n(deque<int>)/32              19.1 ns     5.56 ns      3.4x
std::for_each_n(deque<int>)/50              30.6 ns     8.55 ns      3.6x
std::for_each_n(deque<int>)/1024             567 ns      152 ns      3.7x
std::for_each_n(deque<int>)/4096            2241 ns      657 ns      3.4x
std::for_each_n(deque<int>)/8192            4512 ns     1334 ns      3.4x
std::for_each_n(deque<int>)/16384           9066 ns     2701 ns      3.4x
std::for_each_n(deque<int>)/65536          35955 ns    10887 ns      3.3x
std::for_each_n(deque<int>)/262144        146489 ns    44361 ns      3.3x
std::for_each_n(list<int>)/8                4.68 ns     6.05 ns      0.8x
std::for_each_n(list<int>)/32               21.0 ns     21.9 ns      1.0x
std::for_each_n(list<int>)/50               43.0 ns     42.2 ns      1.0x
std::for_each_n(list<int>)/1024             1015 ns     1035 ns      1.0x
std::for_each_n(list<int>)/4096             6373 ns     6331 ns      1.0x
std::for_each_n(list<int>)/8192            12757 ns    12836 ns      1.0x
std::for_each_n(list<int>)/16384           24879 ns    25035 ns      1.0x
std::for_each_n(list<int>)/65536          103931 ns   103773 ns      1.0x
std::for_each_n(list<int>)/262144         536841 ns   555330 ns      1.0x
rng::for_each_n(vector<int>)/8              2.76 ns     2.79 ns      1.0x
rng::for_each_n(vector<int>)/32             5.30 ns     5.22 ns      1.0x
rng::for_each_n(vector<int>)/50             8.09 ns     8.17 ns      1.0x
rng::for_each_n(vector<int>)/1024            152 ns      153 ns      1.0x
rng::for_each_n(vector<int>)/4096            612 ns      608 ns      1.0x
rng::for_each_n(vector<int>)/8192           1206 ns     1220 ns      1.0x
rng::for_each_n(vector<int>)/16384          2428 ns     2451 ns      1.0x
rng::for_each_n(vector<int>)/65536          9852 ns    10112 ns      1.0x
rng::for_each_n(vector<int>)/262144        39133 ns    42646 ns      0.9x
rng::for_each_n(deque<int>)/8               4.39 ns     2.79 ns      1.6x
rng::for_each_n(deque<int>)/32              18.3 ns     5.75 ns      3.2x
rng::for_each_n(deque<int>)/50              29.7 ns     9.29 ns      3.2x
rng::for_each_n(deque<int>)/1024             571 ns      167 ns      3.4x
rng::for_each_n(deque<int>)/4096            2297 ns      649 ns      3.5x
rng::for_each_n(deque<int>)/8192            4497 ns     1248 ns      3.6x
rng::for_each_n(deque<int>)/16384           9025 ns     2513 ns      3.6x
rng::for_each_n(deque<int>)/65536          36321 ns    10063 ns      3.6x
rng::for_each_n(deque<int>)/262144        144304 ns    40555 ns      3.6x
rng::for_each_n(list<int>)/8                6.00 ns     5.12 ns      1.2x
rng::for_each_n(list<int>)/32               22.3 ns     20.5 ns      1.1x
rng::for_each_n(list<int>)/50               41.5 ns     40.5 ns      1.0x
rng::for_each_n(list<int>)/1024             1041 ns     1004 ns      1.0x
rng::for_each_n(list<int>)/4096             6455 ns     6347 ns      1.0x
rng::for_each_n(list<int>)/8192            12870 ns    12753 ns      1.0x
rng::for_each_n(list<int>)/16384           25525 ns    25135 ns      1.0x
rng::for_each_n(list<int>)/65536          103878 ns   103348 ns      1.0x
rng::for_each_n(list<int>)/262144         576571 ns   548541 ns      1.1x
--------------------------------------------------------------------------

{std, ranges}::for_each with deque iterators

--------------------------------------------------------------------------
Benchmark                                    Before       After    Speedup
--------------------------------------------------------------------------
std::for_each(vector<char>)/8               2.36 ns     2.27 ns      1.0x
std::for_each(vector<char>)/32              2.71 ns     2.72 ns      1.0x
std::for_each(vector<char>)/50              3.93 ns     4.17 ns      0.9x
std::for_each(vector<char>)/1024            40.6 ns     41.3 ns      1.0x
std::for_each(vector<char>)/4096             150 ns      158 ns      0.9x
std::for_each(vector<char>)/8192             293 ns      304 ns      1.0x
std::for_each(vector<char>)/16384            597 ns      615 ns      1.0x
std::for_each(vector<char>)/65536           2471 ns     2478 ns      1.0x
std::for_each(vector<char>)/262144          9665 ns     9878 ns      1.0x
std::for_each(deque<char>)/8                2.33 ns     2.36 ns      1.0x
std::for_each(deque<char>)/32               2.79 ns     2.87 ns      1.0x
std::for_each(deque<char>)/50               4.13 ns     4.13 ns      1.0x
std::for_each(deque<char>)/1024             43.3 ns     42.6 ns      1.0x
std::for_each(deque<char>)/4096              171 ns      177 ns      1.0x
std::for_each(deque<char>)/8192              337 ns      336 ns      1.0x
std::for_each(deque<char>)/16384             658 ns      664 ns      1.0x
std::for_each(deque<char>)/65536            2658 ns     2727 ns      1.0x
std::for_each(deque<char>)/262144          10916 ns    11005 ns      1.0x
std::for_each(list<char>)/8                 4.19 ns     3.94 ns      1.1x
std::for_each(list<char>)/32                35.1 ns     34.6 ns      1.0x
std::for_each(list<char>)/50                57.1 ns     54.2 ns      1.1x
std::for_each(list<char>)/1024              1044 ns     1034 ns      1.0x
std::for_each(list<char>)/4096              6214 ns     6225 ns      1.0x
std::for_each(list<char>)/8192             11791 ns    11629 ns      1.0x
std::for_each(list<char>)/16384            21278 ns    21767 ns      1.0x
std::for_each(list<char>)/65536            97876 ns    97773 ns      1.0x
std::for_each(list<char>)/262144          497406 ns   498083 ns      1.0x
rng::for_each(vector<char>)/8               3.72 ns     2.40 ns      1.5x
rng::for_each(vector<char>)/32              2.94 ns     2.79 ns      1.1x
rng::for_each(vector<char>)/50              9.81 ns     4.08 ns      2.4x
rng::for_each(vector<char>)/1024            46.2 ns     42.2 ns      1.1x
rng::for_each(vector<char>)/4096             171 ns      156 ns      1.1x
rng::for_each(vector<char>)/8192             334 ns      307 ns      1.1x
rng::for_each(vector<char>)/16384            675 ns      611 ns      1.1x
rng::for_each(vector<char>)/65536           2665 ns     2449 ns      1.1x
rng::for_each(vector<char>)/262144         10656 ns     9963 ns      1.1x
rng::for_each(deque<char>)/8                5.16 ns     2.37 ns      2.2x
rng::for_each(deque<char>)/32               23.2 ns     2.80 ns      8.3x
rng::for_each(deque<char>)/50               33.1 ns     4.15 ns      8.0x
rng::for_each(deque<char>)/1024              551 ns     41.9 ns     13.1x
rng::for_each(deque<char>)/4096             2179 ns      170 ns     12.8x
rng::for_each(deque<char>)/8192             4404 ns      344 ns     12.8x
rng::for_each(deque<char>)/16384            8719 ns      666 ns     13.1x
rng::for_each(deque<char>)/65536           34988 ns     2702 ns     13.0x
rng::for_each(deque<char>)/262144         141022 ns    11098 ns     12.7x
rng::for_each(list<char>)/8                 3.86 ns     4.07 ns      0.9x
rng::for_each(list<char>)/32                22.2 ns     34.9 ns      0.6x
rng::for_each(list<char>)/50                55.6 ns     54.2 ns      1.0x
rng::for_each(list<char>)/1024              1018 ns     1025 ns      1.0x
rng::for_each(list<char>)/4096              6661 ns     6690 ns      1.0x
rng::for_each(list<char>)/8192             11840 ns    11128 ns      1.1x
rng::for_each(list<char>)/16384            21107 ns    21612 ns      1.0x
rng::for_each(list<char>)/65536            97611 ns    99755 ns      1.0x
rng::for_each(list<char>)/262144          488435 ns   484463 ns      1.0x
std::for_each(vector<short>)/8              1.56 ns     1.61 ns      1.0x
std::for_each(vector<short>)/32             1.57 ns     1.63 ns      1.0x
std::for_each(vector<short>)/50             2.83 ns     2.82 ns      1.0x
std::for_each(vector<short>)/1024           37.1 ns     33.7 ns      1.1x
std::for_each(vector<short>)/4096            134 ns      133 ns      1.0x
std::for_each(vector<short>)/8192            235 ns      232 ns      1.0x
std::for_each(vector<short>)/16384           461 ns      457 ns      1.0x
std::for_each(vector<short>)/65536          2307 ns     2486 ns      0.9x
std::for_each(vector<short>)/262144         9273 ns     9248 ns      1.0x
std::for_each(deque<short>)/8               1.59 ns     1.56 ns      1.0x
std::for_each(deque<short>)/32              1.55 ns     1.55 ns      1.0x
std::for_each(deque<short>)/50              2.79 ns     2.81 ns      1.0x
std::for_each(deque<short>)/1024            34.0 ns     37.1 ns      0.9x
std::for_each(deque<short>)/4096             122 ns      127 ns      1.0x
std::for_each(deque<short>)/8192             247 ns      236 ns      1.0x
std::for_each(deque<short>)/16384            484 ns      469 ns      1.0x
std::for_each(deque<short>)/65536           2328 ns     2272 ns      1.0x
std::for_each(deque<short>)/262144          9203 ns     9214 ns      1.0x
std::for_each(list<short>)/8                3.44 ns     3.64 ns      0.9x
std::for_each(list<short>)/32               23.7 ns     20.8 ns      1.1x
std::for_each(list<short>)/50               52.6 ns     56.3 ns      0.9x
std::for_each(list<short>)/1024             1025 ns     1031 ns      1.0x
std::for_each(list<short>)/4096             6100 ns     6250 ns      1.0x
std::for_each(list<short>)/8192            11627 ns    11765 ns      1.0x
std::for_each(list<short>)/16384           22026 ns    21348 ns      1.0x
std::for_each(list<short>)/65536          104321 ns   102664 ns      1.0x
std::for_each(list<short>)/262144         521524 ns   498252 ns      1.0x
rng::for_each(vector<short>)/8              4.56 ns     1.55 ns      2.9x
rng::for_each(vector<short>)/32             1.76 ns     1.61 ns      1.1x
rng::for_each(vector<short>)/50             2.69 ns     2.90 ns      0.9x
rng::for_each(vector<short>)/1024           33.3 ns     34.4 ns      1.0x
rng::for_each(vector<short>)/4096            121 ns      117 ns      1.0x
rng::for_each(vector<short>)/8192            231 ns      232 ns      1.0x
rng::for_each(vector<short>)/16384           461 ns      457 ns      1.0x
rng::for_each(vector<short>)/65536          2251 ns     2249 ns      1.0x
rng::for_each(vector<short>)/262144         9080 ns     9064 ns      1.0x
rng::for_each(deque<short>)/8               4.86 ns     1.59 ns      3.1x
rng::for_each(deque<short>)/32              23.9 ns     1.56 ns     15.3x
rng::for_each(deque<short>)/50              36.2 ns     2.91 ns     12.4x
rng::for_each(deque<short>)/1024             637 ns     34.4 ns     18.5x
rng::for_each(deque<short>)/4096            2486 ns      125 ns     19.9x
rng::for_each(deque<short>)/8192            5039 ns      237 ns     21.3x
rng::for_each(deque<short>)/16384           9968 ns      474 ns     21.0x
rng::for_each(deque<short>)/65536          39995 ns     2294 ns     17.4x
rng::for_each(deque<short>)/262144        161619 ns     9273 ns     17.4x
rng::for_each(list<short>)/8                3.92 ns     3.85 ns      1.0x
rng::for_each(list<short>)/32               35.6 ns     21.4 ns      1.7x
rng::for_each(list<short>)/50               53.8 ns     53.9 ns      1.0x
rng::for_each(list<short>)/1024             1026 ns     1027 ns      1.0x
rng::for_each(list<short>)/4096             6646 ns     6574 ns      1.0x
rng::for_each(list<short>)/8192            11429 ns    11104 ns      1.0x
rng::for_each(list<short>)/16384           21677 ns    21029 ns      1.0x
rng::for_each(list<short>)/65536          105132 ns   102157 ns      1.0x
rng::for_each(list<short>)/262144         483564 ns   482510 ns      1.0x
std::for_each(vector<int>)/8                2.76 ns     2.76 ns      1.0x
std::for_each(vector<int>)/32               5.28 ns     5.24 ns      1.0x
std::for_each(vector<int>)/50               7.93 ns     8.06 ns      1.0x
std::for_each(vector<int>)/1024              156 ns      155 ns      1.0x
std::for_each(vector<int>)/4096              609 ns      615 ns      1.0x
std::for_each(vector<int>)/8192             1187 ns     1217 ns      1.0x
std::for_each(vector<int>)/16384            2385 ns     2446 ns      1.0x
std::for_each(vector<int>)/65536            9613 ns     9735 ns      1.0x
std::for_each(vector<int>)/262144          38775 ns    40545 ns      1.0x
std::for_each(deque<int>)/8                 2.74 ns     2.77 ns      1.0x
std::for_each(deque<int>)/32                5.36 ns     5.32 ns      1.0x
std::for_each(deque<int>)/50                8.44 ns     7.94 ns      1.1x
std::for_each(deque<int>)/1024               178 ns      156 ns      1.1x
std::for_each(deque<int>)/4096               689 ns      644 ns      1.1x
std::for_each(deque<int>)/8192              1345 ns     1273 ns      1.1x
std::for_each(deque<int>)/16384             2877 ns     2556 ns      1.1x
std::for_each(deque<int>)/65536            11167 ns    10196 ns      1.1x
std::for_each(deque<int>)/262144           42527 ns    40692 ns      1.0x
std::for_each(list<int>)/8                  4.02 ns     3.74 ns      1.1x
std::for_each(list<int>)/32                 38.4 ns     21.0 ns      1.8x
std::for_each(list<int>)/50                 56.9 ns     54.2 ns      1.0x
std::for_each(list<int>)/1024               1018 ns     1021 ns      1.0x
std::for_each(list<int>)/4096               6570 ns     6640 ns      1.0x
std::for_each(list<int>)/8192              11447 ns    11230 ns      1.0x
std::for_each(list<int>)/16384             20943 ns    21013 ns      1.0x
std::for_each(list<int>)/65536            106761 ns   106624 ns      1.0x
std::for_each(list<int>)/262144           533213 ns   545600 ns      1.0x
rng::for_each(vector<int>)/8                2.93 ns     2.82 ns      1.0x
rng::for_each(vector<int>)/32               5.57 ns     5.42 ns      1.0x
rng::for_each(vector<int>)/50               8.27 ns     7.99 ns      1.0x
rng::for_each(vector<int>)/1024              154 ns      156 ns      1.0x
rng::for_each(vector<int>)/4096              611 ns      606 ns      1.0x
rng::for_each(vector<int>)/8192             1194 ns     1203 ns      1.0x
rng::for_each(vector<int>)/16384            2423 ns     2442 ns      1.0x
rng::for_each(vector<int>)/65536            9702 ns     9960 ns      1.0x
rng::for_each(vector<int>)/262144          39326 ns    41502 ns      0.9x
rng::for_each(deque<int>)/8                 4.64 ns     2.81 ns      1.7x
rng::for_each(deque<int>)/32                20.8 ns     5.28 ns      3.9x
rng::for_each(deque<int>)/50                35.5 ns     8.01 ns      4.4x
rng::for_each(deque<int>)/1024               640 ns      170 ns      3.8x
rng::for_each(deque<int>)/4096              2589 ns      672 ns      3.9x
rng::for_each(deque<int>)/8192              5033 ns     1340 ns      3.8x
rng::for_each(deque<int>)/16384            10136 ns     2794 ns      3.6x
rng::for_each(deque<int>)/65536            40210 ns    10524 ns      3.8x
rng::for_each(deque<int>)/262144          164145 ns    42007 ns      3.9x
rng::for_each(list<int>)/8                  4.08 ns     3.88 ns      1.1x
rng::for_each(list<int>)/32                 35.1 ns     21.5 ns      1.6x
rng::for_each(list<int>)/50                 54.1 ns     55.8 ns      1.0x
rng::for_each(list<int>)/1024               1041 ns     1094 ns      1.0x
rng::for_each(list<int>)/4096               6607 ns     6955 ns      1.0x
rng::for_each(list<int>)/8192              11412 ns    11509 ns      1.0x
rng::for_each(list<int>)/16384             21225 ns    21480 ns      1.0x
rng::for_each(list<int>)/65536            102125 ns   106719 ns      1.0x
rng::for_each(list<int>)/262144           521829 ns   521055 ns      1.0x
--------------------------------------------------------------------------

{std, ranges}::for_each_{, n} with join_view iterators

---------------------------------------------------------------------------------------------
Benchmark                                                       Before      After    Speedup
---------------------------------------------------------------------------------------------
std::for_each(join_view(vector<vector<char>>))/8               2.25 ns     2.25 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/32              2.66 ns     2.65 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/50              4.81 ns     4.89 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/1024            40.5 ns     40.3 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/4096             159 ns      160 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/8192             324 ns      324 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/16384            651 ns      639 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/65536           2645 ns     2617 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/262144         10690 ns    10415 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/8              2.23 ns     2.15 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/32             2.26 ns     2.29 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/50             4.30 ns     4.60 ns      0.9x
std::for_each(join_view(vector<vector<short>>))/1024           39.4 ns     41.0 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/4096            182 ns      182 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/8192            350 ns      363 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/16384           707 ns      716 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/65536          2992 ns     3164 ns      0.9x
std::for_each(join_view(vector<vector<short>>))/262144        11883 ns    12178 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/8                2.83 ns     2.92 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/32               6.01 ns     6.33 ns      0.9x
std::for_each(join_view(vector<vector<int>>))/50               9.27 ns     9.60 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/1024              172 ns      173 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/4096              695 ns      699 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/8192             1361 ns     1387 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/16384            2789 ns     2993 ns      0.9x
std::for_each(join_view(vector<vector<int>>))/65536           11228 ns    11184 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/262144          44412 ns    47894 ns      0.9x
rng::for_each(join_view(vector<vector<char>>))/8               6.39 ns     2.44 ns      2.6x
rng::for_each(join_view(vector<vector<char>>))/32              32.3 ns     2.84 ns     11.4x
rng::for_each(join_view(vector<vector<char>>))/50              41.8 ns     5.14 ns      8.1x
rng::for_each(join_view(vector<vector<char>>))/1024             744 ns     44.5 ns     16.7x
rng::for_each(join_view(vector<vector<char>>))/4096            3069 ns      172 ns     17.8x
rng::for_each(join_view(vector<vector<char>>))/8192            5988 ns      345 ns     17.4x
rng::for_each(join_view(vector<vector<char>>))/16384          11820 ns      696 ns     17.0x
rng::for_each(join_view(vector<vector<char>>))/65536          48948 ns     2764 ns     17.7x
rng::for_each(join_view(vector<vector<char>>))/262144        192328 ns    10913 ns     17.6x
rng::for_each(join_view(vector<vector<short>>))/8              7.07 ns     2.42 ns      2.9x
rng::for_each(join_view(vector<vector<short>>))/32             37.1 ns     2.67 ns     13.9x
rng::for_each(join_view(vector<vector<short>>))/50             50.4 ns     4.99 ns     10.1x
rng::for_each(join_view(vector<vector<short>>))/1024            738 ns     34.5 ns     21.4x
rng::for_each(join_view(vector<vector<short>>))/4096           2943 ns      138 ns     21.3x
rng::for_each(join_view(vector<vector<short>>))/8192           5828 ns      265 ns     22.0x
rng::for_each(join_view(vector<vector<short>>))/16384         11746 ns      531 ns     22.1x
rng::for_each(join_view(vector<vector<short>>))/65536         48087 ns     2594 ns     18.5x
rng::for_each(join_view(vector<vector<short>>))/262144       188488 ns    10406 ns     18.1x
rng::for_each(join_view(vector<vector<int>>))/8                6.28 ns     2.81 ns      2.2x
rng::for_each(join_view(vector<vector<int>>))/32               28.2 ns     6.53 ns      4.3x
rng::for_each(join_view(vector<vector<int>>))/50               41.6 ns     10.1 ns      4.1x
rng::for_each(join_view(vector<vector<int>>))/1024              720 ns      178 ns      4.0x
rng::for_each(join_view(vector<vector<int>>))/4096             2772 ns      744 ns      3.7x
rng::for_each(join_view(vector<vector<int>>))/8192             5575 ns     1502 ns      3.7x
rng::for_each(join_view(vector<vector<int>>))/16384           11323 ns     2988 ns      3.8x
rng::for_each(join_view(vector<vector<int>>))/65536           44912 ns    11843 ns      3.8x
rng::for_each(join_view(vector<vector<int>>))/262144         184685 ns    47666 ns      3.9x
std::for_each_n(join_view(vector<vector<char>>))/8             5.03 ns     2.44 ns      2.1x
std::for_each_n(join_view(vector<vector<char>>))/32            22.5 ns     2.80 ns      8.0x
std::for_each_n(join_view(vector<vector<char>>))/50            30.5 ns     5.26 ns      5.8x
std::for_each_n(join_view(vector<vector<char>>))/1024           478 ns     51.9 ns      9.2x
std::for_each_n(join_view(vector<vector<char>>))/4096          1896 ns      165 ns     11.5x
std::for_each_n(join_view(vector<vector<char>>))/8192          3867 ns      346 ns     11.2x
std::for_each_n(join_view(vector<vector<char>>))/16384         7660 ns      682 ns     11.2x
std::for_each_n(join_view(vector<vector<char>>))/65536        30498 ns     4234 ns      7.2x
std::for_each_n(join_view(vector<vector<char>>))/262144      122379 ns    12491 ns      9.8x
std::for_each_n(join_view(vector<vector<short>>))/8            5.59 ns     2.42 ns      2.3x
std::for_each_n(join_view(vector<vector<short>>))/32           22.8 ns     2.50 ns      9.1x
std::for_each_n(join_view(vector<vector<short>>))/50           30.0 ns     5.05 ns      5.9x
std::for_each_n(join_view(vector<vector<short>>))/1024          481 ns     42.9 ns     11.2x
std::for_each_n(join_view(vector<vector<short>>))/4096         1943 ns      199 ns      9.8x
std::for_each_n(join_view(vector<vector<short>>))/8192         3840 ns      371 ns     10.3x
std::for_each_n(join_view(vector<vector<short>>))/16384        7638 ns      728 ns     10.5x
std::for_each_n(join_view(vector<vector<short>>))/65536       31207 ns     2920 ns     10.7x
std::for_each_n(join_view(vector<vector<short>>))/262144     125150 ns    11799 ns     10.6x
std::for_each_n(join_view(vector<vector<int>>))/8              5.40 ns     2.90 ns      1.9x
std::for_each_n(join_view(vector<vector<int>>))/32             21.6 ns     6.82 ns      3.2x
std::for_each_n(join_view(vector<vector<int>>))/50             29.0 ns     9.53 ns      3.0x
std::for_each_n(join_view(vector<vector<int>>))/1024            473 ns      173 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/4096           1890 ns      707 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/8192           3763 ns     1397 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/16384          7690 ns     2835 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/65536         30403 ns    11352 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/262144       124215 ns    46235 ns      2.7x
rng::for_each_n(join_view(vector<vector<char>>))/8             5.93 ns     2.39 ns      2.5x
rng::for_each_n(join_view(vector<vector<char>>))/32            26.4 ns     2.84 ns      9.3x
rng::for_each_n(join_view(vector<vector<char>>))/50            38.6 ns     5.59 ns      6.9x
rng::for_each_n(join_view(vector<vector<char>>))/1024           686 ns     44.0 ns     15.6x
rng::for_each_n(join_view(vector<vector<char>>))/4096          3223 ns      172 ns     18.7x
rng::for_each_n(join_view(vector<vector<char>>))/8192          8771 ns      352 ns     24.9x
rng::for_each_n(join_view(vector<vector<char>>))/16384        15115 ns      701 ns     21.6x
rng::for_each_n(join_view(vector<vector<char>>))/65536        62153 ns     3017 ns     20.6x
rng::for_each_n(join_view(vector<vector<char>>))/262144      249936 ns    11436 ns     21.9x
rng::for_each_n(join_view(vector<vector<short>>))/8            7.30 ns     2.52 ns      2.9x
rng::for_each_n(join_view(vector<vector<short>>))/32           30.6 ns     2.47 ns     12.4x
rng::for_each_n(join_view(vector<vector<short>>))/50           37.1 ns     4.78 ns      7.8x
rng::for_each_n(join_view(vector<vector<short>>))/1024          674 ns     36.8 ns     18.3x
rng::for_each_n(join_view(vector<vector<short>>))/4096         2686 ns      141 ns     19.0x
rng::for_each_n(join_view(vector<vector<short>>))/8192         5415 ns      273 ns     19.8x
rng::for_each_n(join_view(vector<vector<short>>))/16384       12075 ns      523 ns     23.1x
rng::for_each_n(join_view(vector<vector<short>>))/65536       45979 ns     2495 ns     18.4x
rng::for_each_n(join_view(vector<vector<short>>))/262144     188528 ns    10266 ns     18.4x
rng::for_each_n(join_view(vector<vector<int>>))/8              6.71 ns     2.89 ns      2.3x
rng::for_each_n(join_view(vector<vector<int>>))/32             26.1 ns     6.48 ns      4.0x
rng::for_each_n(join_view(vector<vector<int>>))/50             37.6 ns     9.55 ns      3.9x
rng::for_each_n(join_view(vector<vector<int>>))/1024            636 ns      168 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/4096           2657 ns      697 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/8192           5082 ns     1363 ns      3.7x
rng::for_each_n(join_view(vector<vector<int>>))/16384         10629 ns     2764 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/65536         42324 ns    11006 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/262144       169755 ns    44317 ns      3.8x
---------------------------------------------------------------------------------------------

@winner245 winner245 marked this pull request as ready for review March 25, 2025 15:59
@winner245 winner245 requested a review from a team as a code owner March 25, 2025 15:59
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 25, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 25, 2025

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

This patch extends segmented iterator optimizations, previously applied to std::for_each, to std::for_each_n, std::ranges::for_each, and std::ranges::for_each_n by forwarding to std::for_each. New tests validate these optimizations for segmented iterators (e.g., deque&lt;int&gt; and join_view iterators). Benchmarks demonstrate up to 3.9x performance improvement for deque&lt;int&gt; iterators, aligning their performance with contiguous iterators (e.g., vector&lt;int&gt;). The vector&lt;int&gt; performance serves as a baseline for contiguous iterators, representing the upper bound for segmented deque&lt;int&gt; inputs.

Addresses a subtask of #102817.

for_each_n

--------------------------------------------------------------------------------
Benchmark                                       Before          After    Speedup
--------------------------------------------------------------------------------
std::for_each_n(deque&lt;int&gt;)/8                  5.31 ns         3.39 ns      1.6x
std::for_each_n(deque&lt;int&gt;)/32                 20.1 ns         6.89 ns      2.9x
std::for_each_n(deque&lt;int&gt;)/1024                612 ns          171 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/8192               4892 ns         1350 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/16384              9786 ns         2774 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/65536             39026 ns        11339 ns      3.4x
std::for_each_n(deque&lt;int&gt;)/262144           157897 ns        45166 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/1048576          643836 ns       184999 ns      3.5x
rng::for_each_n(deque&lt;int&gt;)/8                  4.85 ns         4.94 ns      1.0x
rng::for_each_n(deque&lt;int&gt;)/32                 18.1 ns         8.47 ns      2.1x
rng::for_each_n(deque&lt;int&gt;)/1024                622 ns          171 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/8192               5008 ns         1363 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/16384              9952 ns         2744 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/65536             40204 ns        10841 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/262144           157713 ns        43386 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/1048576          637549 ns       177042 ns      3.6x
std::for_each_n(vector&lt;int&gt;)/8                 2.91 ns         2.94 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/32                5.42 ns         5.54 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1024               161 ns          165 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/8192              1271 ns         1292 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/16384             2556 ns         2619 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/65536            10125 ns        10659 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/262144           44572 ns        44372 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1048576         180804 ns       183389 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/8                 3.05 ns         3.05 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/32                5.71 ns         5.85 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/1024               167 ns          183 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/8192              1298 ns         1429 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/16384             2691 ns         2870 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/65536            10632 ns        11465 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/262144           53031 ns        45948 ns      1.2x
rng::for_each_n(vector&lt;int&gt;)/1048576         174328 ns       184270 ns      0.9x

for_each

--------------------------------------------------------------------------------
Benchmark                                     Before           After     Speedup
--------------------------------------------------------------------------------
std::for_each(deque&lt;int&gt;)/8                  3.18 ns         2.96 ns        1.1x
std::for_each(deque&lt;int&gt;)/32                 5.70 ns         5.54 ns        1.0x
std::for_each(deque&lt;int&gt;)/1024                183 ns          180 ns        1.0x
std::for_each(deque&lt;int&gt;)/8192               1435 ns         1422 ns        1.0x
std::for_each(deque&lt;int&gt;)/16384              2885 ns         2879 ns        1.0x
std::for_each(deque&lt;int&gt;)/65536             11423 ns        11378 ns        1.0x
std::for_each(deque&lt;int&gt;)/262144            45203 ns        43686 ns        1.0x
std::for_each(deque&lt;int&gt;)/1048576          181832 ns       173832 ns        1.0x
rng::for_each(deque&lt;int&gt;)/8                  5.10 ns         3.75 ns        1.4x
rng::for_each(deque&lt;int&gt;)/32                 23.5 ns         7.49 ns        3.1x
rng::for_each(deque&lt;int&gt;)/1024                693 ns          184 ns        3.8x
rng::for_each(deque&lt;int&gt;)/8192               5522 ns         1430 ns        3.9x
rng::for_each(deque&lt;int&gt;)/16384             11112 ns         2930 ns        3.8x
rng::for_each(deque&lt;int&gt;)/65536             44390 ns        11656 ns        3.8x
rng::for_each(deque&lt;int&gt;)/262144           179419 ns        46582 ns        3.9x
rng::for_each(deque&lt;int&gt;)/1048576          711406 ns       189658 ns        3.8x
std::for_each(vector&lt;int&gt;)/8                 2.96 ns         2.91 ns        1.0x
std::for_each(vector&lt;int&gt;)/32                5.54 ns         5.49 ns        1.0x
std::for_each(vector&lt;int&gt;)/1024               165 ns          162 ns        1.0x
std::for_each(vector&lt;int&gt;)/8192              1269 ns         1257 ns        1.0x
std::for_each(vector&lt;int&gt;)/16384             2636 ns         2567 ns        1.0x
std::for_each(vector&lt;int&gt;)/65536            10231 ns        10215 ns        1.0x
std::for_each(vector&lt;int&gt;)/262144           41544 ns        40719 ns        1.0x
std::for_each(vector&lt;int&gt;)/1048576         173667 ns       167878 ns        1.0x
rng::for_each(vector&lt;int&gt;)/8                 3.09 ns         3.06 ns        1.0x
rng::for_each(vector&lt;int&gt;)/32                5.85 ns         5.77 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1024               179 ns          168 ns        1.1x
rng::for_each(vector&lt;int&gt;)/8192              1346 ns         1309 ns        1.0x
rng::for_each(vector&lt;int&gt;)/16384             2714 ns         2664 ns        1.0x
rng::for_each(vector&lt;int&gt;)/65536            10979 ns        10523 ns        1.0x
rng::for_each(vector&lt;int&gt;)/262144           42994 ns        42535 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1048576         175633 ns       173933 ns        1.0x

Full diff: https://github.com/llvm/llvm-project/pull/132896.diff

8 Files Affected:

  • (modified) libcxx/include/__algorithm/for_each_n.h (+24-1)
  • (modified) libcxx/include/__algorithm/ranges_for_each.h (+11-3)
  • (modified) libcxx/include/__algorithm/ranges_for_each_n.h (+11-4)
  • (added) libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp (+57)
  • (modified) libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp (+1-1)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp (+82-38)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp (+41-5)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp (+44-2)
diff --git a/libcxx/include/__algorithm/for_each_n.h b/libcxx/include/__algorithm/for_each_n.h
index fce380b49df3e..3d91124432f56 100644
--- a/libcxx/include/__algorithm/for_each_n.h
+++ b/libcxx/include/__algorithm/for_each_n.h
@@ -10,7 +10,11 @@
 #ifndef _LIBCPP___ALGORITHM_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__config>
+#include <__iterator/iterator_traits.h>
+#include <__iterator/segmented_iterator.h>
+#include <__type_traits/enable_if.h>
 #include <__utility/convert_to_integral.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -21,7 +25,13 @@ _LIBCPP_BEGIN_NAMESPACE_STD
 
 #if _LIBCPP_STD_VER >= 17
 
-template <class _InputIterator, class _Size, class _Function>
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<!__is_segmented_iterator<_InputIterator>::value ||
+                            (__has_input_iterator_category<_InputIterator>::value &&
+                             !__has_random_access_iterator_category<_InputIterator>::value),
+                        int> = 0>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
 for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   typedef decltype(std::__convert_to_integral(__orig_n)) _IntegralSize;
@@ -34,6 +44,19 @@ for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   return __first;
 }
 
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<__is_segmented_iterator<_InputIterator>::value &&
+                            __has_random_access_iterator_category<_InputIterator>::value,
+                        int> = 0>
+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
+for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
+  _InputIterator __last = __first + __orig_n;
+  std::for_each(__first, __last, __f);
+  return __last;
+}
+
 #endif
 
 _LIBCPP_END_NAMESPACE_STD
diff --git a/libcxx/include/__algorithm/ranges_for_each.h b/libcxx/include/__algorithm/ranges_for_each.h
index de39bc5522753..475f85366188e 100644
--- a/libcxx/include/__algorithm/ranges_for_each.h
+++ b/libcxx/include/__algorithm/ranges_for_each.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -41,9 +42,16 @@ struct __for_each {
   template <class _Iter, class _Sent, class _Proj, class _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr static for_each_result<_Iter, _Func>
   __for_each_impl(_Iter __first, _Sent __last, _Func& __func, _Proj& __proj) {
-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (random_access_iterator<_Iter> && sized_sentinel_for<_Sent, _Iter>) {
+      auto __n   = __last - __first;
+      auto __end = __first + __n;
+      std::for_each(__first, __end, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__end), std::move(__func)};
+    } else {
+      for (; __first != __last; ++__first)
+        std::invoke(__func, std::invoke(__proj, *__first));
+      return {std::move(__first), std::move(__func)};
+    }
   }
 
 public:
diff --git a/libcxx/include/__algorithm/ranges_for_each_n.h b/libcxx/include/__algorithm/ranges_for_each_n.h
index 603cb723233c8..3108d66001295 100644
--- a/libcxx/include/__algorithm/ranges_for_each_n.h
+++ b/libcxx/include/__algorithm/ranges_for_each_n.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -40,11 +41,17 @@ struct __for_each_n {
   template <input_iterator _Iter, class _Proj = identity, indirectly_unary_invocable<projected<_Iter, _Proj>> _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr for_each_n_result<_Iter, _Func>
   operator()(_Iter __first, iter_difference_t<_Iter> __count, _Func __func, _Proj __proj = {}) const {
-    while (__count-- > 0) {
-      std::invoke(__func, std::invoke(__proj, *__first));
-      ++__first;
+    if constexpr (random_access_iterator<_Iter>) {
+      auto __last = __first + __count;
+      std::for_each(__first, __last, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__last), std::move(__func)};
+    } else {
+      while (__count-- > 0) {
+        std::invoke(__func, std::invoke(__proj, *__first));
+        ++__first;
+      }
+      return {std::move(__first), std::move(__func)};
     }
-    return {std::move(__first), std::move(__func)};
   }
 };
 
diff --git a/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
new file mode 100644
index 0000000000000..af46371881577
--- /dev/null
+++ b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
@@ -0,0 +1,57 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17
+
+#include <algorithm>
+#include <cstddef>
+#include <deque>
+#include <list>
+#include <string>
+#include <vector>
+
+#include <benchmark/benchmark.h>
+
+int main(int argc, char** argv) {
+  auto std_for_each_n = [](auto first, auto n, auto f) { return std::for_each_n(first, n, f); };
+
+  // {std,ranges}::for_each_n
+  {
+    auto bm = []<class Container>(std::string name, auto for_each_n) {
+      benchmark::RegisterBenchmark(
+          name,
+          [for_each_n](auto& st) {
+            std::size_t const n = st.range(0);
+            Container c(n, 1);
+            auto first = c.begin();
+
+            for ([[maybe_unused]] auto _ : st) {
+              benchmark::DoNotOptimize(c);
+              auto result = for_each_n(first, n, [](int& x) { x = std::clamp(x, 10, 100); });
+              benchmark::DoNotOptimize(result);
+            }
+          })
+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(8192)
+          ->Arg(1 << 20);
+    };
+    bm.operator()<std::vector<int>>("std::for_each_n(vector<int>)", std_for_each_n);
+    bm.operator()<std::deque<int>>("std::for_each_n(deque<int>)", std_for_each_n);
+    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);
+    bm.operator()<std::deque<int>>("rng::for_each_n(deque<int>)", std::ranges::for_each_n);
+    bm.operator()<std::list<int>>("rng::for_each_n(list<int>)", std::ranges::for_each_n);
+  }
+
+  benchmark::Initialize(&argc, argv);
+  benchmark::RunSpecifiedBenchmarks();
+  benchmark::Shutdown();
+  return 0;
+}
diff --git a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
index dd026444330ea..beb4c7f675a6e 100644
--- a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
@@ -258,7 +258,7 @@ constexpr bool all_the_algorithms()
 int main(int, char**)
 {
     all_the_algorithms();
-    static_assert(all_the_algorithms());
+    // static_assert(all_the_algorithms());
 
     return 0;
 }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
index 371f6c92f1ed1..42f1a41a27096 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
@@ -13,69 +13,113 @@
 //    constexpr InputIterator      // constexpr after C++17
 //    for_each_n(InputIterator first, Size n, Function f);
 
-
 #include <algorithm>
 #include <cassert>
+#include <deque>
 #include <functional>
+#include <iterator>
+#include <ranges>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
 
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int expected[] = {3, 5, 8, 9};
-    const std::size_t N = 4;
+struct for_each_test {
+  TEST_CONSTEXPR for_each_test(int c) : count(c) {}
+  int count;
+  TEST_CONSTEXPR_CXX14 void operator()(int& i) {
+    ++i;
+    ++count;
+  }
+};
 
-    auto it = std::for_each_n(std::begin(ia), N, [](int &a) { a += 2; });
-    return it == (std::begin(ia) + N)
-        && std::equal(std::begin(ia), std::end(ia), std::begin(expected))
-        ;
-    }
-#endif
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
 
-struct for_each_test
-{
-    for_each_test(int c) : count(c) {}
-    int count;
-    void operator()(int& i) {++i; ++count;}
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
 };
 
-int main(int, char**)
-{
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
+TEST_CONSTEXPR_CXX20 bool test() {
+  {
     typedef cpp17_input_iterator<int*> Iter;
-    int ia[] = {0, 1, 2, 3, 4, 5};
-    const unsigned s = sizeof(ia)/sizeof(ia[0]);
+    int ia[]         = {0, 1, 2, 3, 4, 5};
+    const unsigned s = sizeof(ia) / sizeof(ia[0]);
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
-    assert(it == Iter(ia));
-    assert(f.count == 0);
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
+      assert(it == Iter(ia));
+      assert(f.count == 0);
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
 
-    assert(it == Iter(ia+s));
-    assert(f.count == s);
-    for (unsigned i = 0; i < s; ++i)
-        assert(ia[i] == static_cast<int>(i+1));
+      assert(it == Iter(ia + s));
+      assert(f.count == s);
+      for (unsigned i = 0; i < s; ++i)
+        assert(ia[i] == static_cast<int>(i + 1));
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
 
-    assert(it == Iter(ia+1));
-    assert(f.count == 1);
-    for (unsigned i = 0; i < 1; ++i)
-        assert(ia[i] == static_cast<int>(i+2));
+      assert(it == Iter(ia + 1));
+      assert(f.count == 1);
+      for (unsigned i = 0; i < 1; ++i)
+        assert(ia[i] == static_cast<int>(i + 2));
     }
+  }
+
+#if TEST_STD_VER > 11
+  {
+    int ia[]            = {1, 3, 6, 7};
+    int expected[]      = {3, 5, 8, 9};
+    const std::size_t N = 4;
+
+    auto it = std::for_each_n(std::begin(ia), N, [](int& a) { a += 2; });
+    assert(it == (std::begin(ia) + N) && std::equal(std::begin(ia), std::end(ia), std::begin(expected)));
+  }
+#endif
+
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+#if TEST_STD_VER >= 20
+  { // Make sure that the segmented iterator optimization works during constant evaluation
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::for_each_n(v.begin(), std::ranges::distance(v), [i = 0](int& a) mutable { assert(a == i++); });
+  }
+#endif
+
+  return true;
+}
 
+int main(int, char**) {
+  assert(test());
 #if TEST_STD_VER > 17
-    static_assert(test_constexpr());
+  static_assert(test());
 #endif
 
   return 0;
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
index 8b9b6e82cbcb2..2f4bfb9db6dba 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
@@ -20,7 +20,10 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
 #include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -30,7 +33,7 @@ struct Callable {
 };
 
 template <class Iter, class Sent = Iter>
-concept HasForEachIt = requires (Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
+concept HasForEachIt = requires(Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
 
 static_assert(HasForEachIt<int*>);
 static_assert(!HasForEachIt<InputIteratorNotDerivedFrom>);
@@ -47,7 +50,7 @@ static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotPredicate>);
 static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotCopyConstructible>);
 
 template <class Range>
-concept HasForEachR = requires (Range range) { std::ranges::for_each(range, Callable{}); };
+concept HasForEachR = requires(Range range) { std::ranges::for_each(range, Callable{}); };
 
 static_assert(HasForEachR<UncheckedRange<int*>>);
 static_assert(!HasForEachR<InputRangeNotDerivedFrom>);
@@ -68,7 +71,7 @@ constexpr void test_iterator() {
   { // simple test
     {
       auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      int a[]   = {1, 6, 3, 4};
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(Iter(a), Sent(Iter(a + 4)), func);
       assert(a[0] == 1);
@@ -81,8 +84,8 @@ constexpr void test_iterator() {
       assert(i == 4);
     }
     {
-      auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      auto func  = [i = 0](int& a) mutable { a += i++; };
+      int a[]    = {1, 6, 3, 4};
       auto range = std::ranges::subrange(Iter(a), Sent(Iter(a + 4)));
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(range, func);
@@ -110,6 +113,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each(d, deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>, sentinel_wrapper<cpp17_input_iterator<int*>>>();
   test_iterator<cpp20_input_iterator<int*>, sentinel_wrapper<cpp20_input_iterator<int*>>>();
@@ -146,6 +173,15 @@ constexpr bool test() {
     }
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each(v, [i = 0](int x) mutable { assert(x == 2 * i++); }, [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
index d4b2d053d08ce..ad1447b7348f5 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
@@ -17,7 +17,12 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
+#include <iterator>
 #include <ranges>
+#include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -27,7 +32,7 @@ struct Callable {
 };
 
 template <class Iter>
-concept HasForEachN = requires (Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
+concept HasForEachN = requires(Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
 
 static_assert(HasForEachN<int*>);
 static_assert(!HasForEachN<InputIteratorNotDerivedFrom>);
@@ -45,7 +50,7 @@ template <class Iter>
 constexpr void test_iterator() {
   { // simple test
     auto func = [i = 0](int& a) mutable { a += i++; };
-    int a[] = {1, 6, 3, 4};
+    int a[]   = {1, 6, 3, 4};
     std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> auto ret =
         std::ranges::for_each_n(Iter(a), 4, func);
     assert(a[0] == 1);
@@ -64,6 +69,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>>();
   test_iterator<cpp20_input_iterator<int*>>();
@@ -89,6 +118,19 @@ constexpr bool test() {
     assert(a[2].other == 6);
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each_n(
+        v.begin(),
+        std::ranges::distance(v),
+        [i = 0](int x) mutable { assert(x == 2 * i++); },
+        [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 

@winner245 winner245 force-pushed the for-each-segment branch 2 times, most recently from 16438be to 047acfd Compare March 27, 2025 01:08
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I left some comments but I think this is going to be a nice optimization.

Comment on lines 64 to 65
resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` segmented inputs and 24.9x for
``join_view`` of ``vector<vector<T>>``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` segmented inputs and 24.9x for
``join_view`` of ``vector<vector<T>>``.
resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` and 24.9x for
``join_view`` of ``vector<vector<T>>``.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

_LIBCPP_BEGIN_NAMESPACE_STD

// __for_each_n_segment optimizes linear iteration over segmented iterators. It processes a segmented
// input range defined by (__first, __orig_n), where __first is the starting segmented iterator and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// input range defined by (__first, __orig_n), where __first is the starting segmented iterator and
// input range defined by [__first, __first + __n), where __first is the starting segmented iterator and

__orig_n is just an artifact of the conversion inside the function, let's use __n in the documentation for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

auto __lfirst = _Traits::__local(__first);
auto __seg_size = static_cast<_IntegralSize>(std::distance(__lfirst, __slast));

// Single-segment case: input range fits within a single segment (may not align with segment boundaries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit like this could be merged inside the loop. But I failed to actually do it myself within a few minutes, so you can look into it but it's not a hard request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion! I agree that the current implementation might not be in its most ideal form. However, we are dealing with multiple corner cases here—such as single-segment vs. multi-segment, partial first and/or last segments, and combinations of these scenarios. Considering these complexities, I found it challenging to further simplify the logic.

While it might be possible to reduce a few lines of code, I am concerned that doing so could compromise clarity. After several attempts, I wasn't able to come up with a refactoring that I feel is an improvement over the current approach.

auto __sfirst = _Traits::__begin(__seg);
auto __slast = _Traits::__end(__seg);
auto __lfirst = _Traits::__local(__first);
auto __seg_size = static_cast<_IntegralSize>(std::distance(__lfirst, __slast));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're making an important assumption here that the local iterator is random access. If that's not the case, then this is doing a separate O(N) traversal of the segment, which might not be OK either. So I think we can only provide this function when the local iterator is random access.

Some enable_if based on the iterator category of Traits::__local_iterator is probably what we need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! The assumption of random-access local iterators is indeed needed. So I have made the assumption explicit. Considering the fact that the segmented iterator overload of __for_each_n already required to use enable_if and __for_each_n_segment has no overload, I think we can just use enable_if for __for_each_n and use static_assert for
__for_each_n_segment. Please let me know if you think differently.

Comment on lines 62 to 63
for_each(_InputIterator __first, _InputIterator __last, _Function __f) {
return std::__for_each(__first, __last, __f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd take a projection inside std::__for_each and create an identity projection here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have done the suggested change.

Comment on lines 62 to 69
bm.operator()<std::vector<std::vector<char>>>("std::for_each(join_view(vector<vector<char>>))", std_for_each);
bm.operator()<std::vector<std::vector<short>>>("std::for_each(join_view(vector<vector<short>>))", std_for_each);
bm.operator()<std::vector<std::vector<int>>>("std::for_each(join_view(vector<vector<int>>))", std_for_each);
bm.operator()<std::vector<std::vector<char>>>(
"rng::for_each(join_view(vector<vector<char>>)", std::ranges::for_each);
bm.operator()<std::vector<std::vector<short>>>(
"rng::for_each(join_view(vector<vector<short>>)", std::ranges::for_each);
bm.operator()<std::vector<std::vector<int>>>("rng::for_each(join_view(vector<vector<int>>)", std::ranges::for_each);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I would only add the int ones to keep this lightweight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 50 to 62
bm.operator()<std::vector<char>>("std::for_each_n(vector<char>)", std_for_each_n);
bm.operator()<std::deque<char>>("std::for_each_n(deque<char>)", std_for_each_n);
bm.operator()<std::list<char>>("std::for_each_n(list<char>)", std_for_each_n);
bm.operator()<std::vector<char>>("rng::for_each_n(vector<char>)", std::ranges::for_each_n);
bm.operator()<std::deque<char>>("rng::for_each_n(deque<char>)", std::ranges::for_each_n);
bm.operator()<std::list<char>>("rng::for_each_n(list<char>)", std::ranges::for_each_n);

bm.operator()<std::vector<short>>("std::for_each_n(vector<short>)", std_for_each_n);
bm.operator()<std::deque<short>>("std::for_each_n(deque<short>)", std_for_each_n);
bm.operator()<std::list<short>>("std::for_each_n(list<short>)", std_for_each_n);
bm.operator()<std::vector<short>>("rng::for_each_n(vector<short>)", std::ranges::for_each_n);
bm.operator()<std::deque<short>>("rng::for_each_n(deque<short>)", std::ranges::for_each_n);
bm.operator()<std::list<short>>("rng::for_each_n(list<short>)", std::ranges::for_each_n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bm.operator()<std::vector<char>>("std::for_each_n(vector<char>)", std_for_each_n);
bm.operator()<std::deque<char>>("std::for_each_n(deque<char>)", std_for_each_n);
bm.operator()<std::list<char>>("std::for_each_n(list<char>)", std_for_each_n);
bm.operator()<std::vector<char>>("rng::for_each_n(vector<char>)", std::ranges::for_each_n);
bm.operator()<std::deque<char>>("rng::for_each_n(deque<char>)", std::ranges::for_each_n);
bm.operator()<std::list<char>>("rng::for_each_n(list<char>)", std::ranges::for_each_n);
bm.operator()<std::vector<short>>("std::for_each_n(vector<short>)", std_for_each_n);
bm.operator()<std::deque<short>>("std::for_each_n(deque<short>)", std_for_each_n);
bm.operator()<std::list<short>>("std::for_each_n(list<short>)", std_for_each_n);
bm.operator()<std::vector<short>>("rng::for_each_n(vector<short>)", std::ranges::for_each_n);
bm.operator()<std::deque<short>>("rng::for_each_n(deque<short>)", std::ranges::for_each_n);
bm.operator()<std::list<short>>("rng::for_each_n(list<short>)", std::ranges::for_each_n);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test_segmented_deque_iterator();

#if TEST_STD_VER >= 20
{ // Make sure that the segmented iterator optimization works during constant evaluation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is specific to constant evaluation? I think I'd remove that comment, unless I missed something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this comment is misleading. Removed.

while (__count-- > 0) {
std::invoke(__func, std::invoke(__proj, *__first));
++__first;
if constexpr (forward_iterator<_Iter>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to check for a forward iterator here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is now fixed since ranges::for_each_n now directly calls for_each_n and for_each_n has the updated enable_if constraint.

#include <__algorithm/in_fun_result.h>
#include <__config>
#include <__functional/identity.h>
#include <__functional/invoke.h>
#include <__iterator/concepts.h>
#include <__iterator/incrementable_traits.h>
#include <__iterator/iterator_traits.h>
#include <__iterator/next.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <__iterator/next.h>

Unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 29 to 40
template <class _InputIterator,
class _Size,
class _Function,
class _Proj,
__enable_if_t<!__has_random_access_iterator_category<_InputIterator>::value &&
(!__is_segmented_iterator<_InputIterator>::value
// || !__has_random_access_iterator_category<
// typename __segmented_iterator_traits<_InputIterator>::__local_iterator>::value
), // TODO: __segmented_iterator_traits<_InputIterator> results in template instantiation
// during SFINAE, which is a hard error to be fixed. Once fixed, we should uncomment.
int> = 0>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using __segmented_iterator_traits<_Iterator> in SFINEA, I encountered a hard error caused by template instantiation of __segmented_iterator_traits<_Iterator> for unsupported _Iterator types. This appears to be a different issue associated with __segmented_iterator_traits that requires resolution. To address this, I have submitted PR #134304 as a separate fix.

@winner245 winner245 force-pushed the for-each-segment branch 2 times, most recently from 8548154 to d14bde4 Compare April 5, 2025 02:14
Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

@winner245
Copy link
Contributor Author

winner245 commented Apr 5, 2025

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

Thank you for your feedback! I agree that the scope of the patch has expanded beyond its original intent. Initially, the goal was simple: only to extend the optimization for std::for_each to its variants ranges::for_each{,_n}. However, as the review and revision progressed, I aimed to address the inconsistent segmented iterator optimization support between for_each_n and for_each, as the optimization for for_each_n includes C++03. I think back-porting the optimization for std::for_each to C++03 could be useful as we may be able to extend the optimization to other algorithms by letting them simply forward to std::for_each (as per your comment in another PR).

However, I agree that this made the patch diverge from its original purpose and may complicate the review process. Following your suggestion, I will work on splitting it to make it clear what this patch focuses on.

-------------- Update --------------
As per your suggestion, I have split this into the following PRs, each focusing on an independent and self-contained subtask for the classical algorithms:

This separation allows the current PR to focus exclusively on the optimization of the ranges algorithms. I will rebase my current patch on the above split pieces once they are landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants