Introduce options to control profiling-specific optimizations #614

MatthewFluet · 2025-06-18T18:08:31Z

Add -profile-tail-call-opt {true|false} expert compile-time option. The -profile-tail-call-opt {true|false} controls whether or not the SSA{,2} shrinker optimizes tail calls in the presence of profiling.

Add -profile-tail-call-opt {true|false} expert compile-time option. The -profile-intro-loops-opt {true|false} controls whether or not the SSA IntroduceLoops optimization applies in presence of profiling and -profile-tail-call-opt false, when the SSA{,2} shrinker does not optimize tail calls in the presence of profiling. In particular, when -profile-tail-call-opt false but -profile-intro-loops-opt true, then IntroduceLoops will recognize self non-tail calls with eta return and handler continuations as tail calls.

-profile-tail-call-opt false and -profile-intro-loops-opt false is expected to have a significant performance penalty, but can improve the accuracy of exception history. It likely worsens the accuracy of time profiling, since the profiled program (without tail call and introduce loops optimizations) will be significantly different from the non-profiled program (with tail call and introduce loops optimizations).

-profile-tail-call-opt false and -profile-intro-loops-opt true is expected to recover some of the performance penalty, at the expense of less accurate exception history; the exception history will have only one entry for the recursive function, even if the exception is raised by a deeply nested recursive (tail) call.

Profiling results:

config command                                                                                                                                                                                       
C02    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64                                                                          
C03    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64 -profile drop -profile-tail-call-opt true -profile-intro-loops-opt true  
C04    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64 -profile drop -profile-tail-call-opt false -profile-intro-loops-opt true 
C05    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64 -profile drop -profile-tail-call-opt false -profile-intro-loops-opt false
C08    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c                                                                              
C09    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c -profile drop -profile-tail-call-opt true -profile-intro-loops-opt true      
C10    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c -profile drop -profile-tail-call-opt false -profile-intro-loops-opt true     
C11    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c -profile drop -profile-tail-call-opt false -profile-intro-loops-opt false    
C14    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm                                                                           
C15    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm -profile drop -profile-tail-call-opt true -profile-intro-loops-opt true   
C16    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm -profile drop -profile-tail-call-opt false -profile-intro-loops-opt true  
C17    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm -profile drop -profile-tail-call-opt false -profile-intro-loops-opt false 

task_clock [email protected] (2-level)
program           `C03/C02` `C04/C02` `C05/C02` `C09/C08` `C10/C08` `C11/C08` `C15/C14` `C16/C14` `C17/C14`
barnes-hut           1.107     1.055     1.214     1.131     0.9522    1.264     0.9007    0.9306    0.9785
boyer                1.053     1.372     1.339     1.233     1.429     1.340     1.113     1.314     1.128 
checksum             1.026     1.032    10.34      1.102     1.099    18.20      1.029     1.163    18.62  
count-graphs         1.036     1.117     2.023     1.120     1.260     2.044     1.033     1.133     1.942 
DLXSimulator         1.050     1.042     2.450     1.020     1.039     2.519     1.130     0.9209    2.201 
fft                  1.038     1.076     1.692     0.9071    0.8427    1.476     1.101     1.114     1.682 
fib                  1.183     1.209     1.273     1.143     1.174     1.128     1.299     1.231     1.288 
flat-array           1.010     0.9847    3.701     0.9900    0.9762    3.655     1.212     1.262     3.145 
hamlet               1.035     1.022     1.210     1.047     1.029     1.414     1.040     1.097     1.281 
imp-for              1.072     1.034     5.059     1.023     1.044     6.872 
knuth-bendix         1.231     1.346     1.387     1.151     1.382     1.302     1.220     1.571     1.382 
lexgen               1.090     1.034     1.647     1.066     1.043     1.602     1.088     0.9897    1.535 
life                 0.9971    0.9810    2.423     0.9802    1.045     2.443     0.9818    0.9713    2.124 
logic                1.002     1.512     1.517     0.9268    1.533     1.747     1.063     1.631     1.986 
mandelbrot           0.9804    0.9926    1.754     0.9579    1.072     2.830     1.150     1.097     3.716 
matrix-multiply      0.9985    1.209     3.598     0.9716    1.096     6.408     0.9968    0.9880    8.000 
md5                  0.9943    0.9730    1.050     1.081     1.130     1.102     1.021     1.030     1.057 
merge                0.9251    1.146     1.262     0.9866    0.9012    1.009     0.9017    0.8910    1.075 
mlyacc               1.029     1.053     1.011     0.9717    1.020     1.025     1.050     1.033     1.072 
model-elimination    1.147     1.055     1.638     1.075     1.059     1.911     1.058     1.003     1.971 
mpuz                 1.059     1.077     3.122     1.020     1.094     3.827     0.9822    1.024     4.534 
nucleic              1.033     1.128     1.140     1.047     0.9743    1.053     0.9830    1.057     1.174 
peek                 1.059     1.026     3.912     1.010     1.066     7.083     0.9948    1.001     8.791 
pidigits             0.9842    0.9689    1.124     1.098     0.9231    1.125     1.110     0.9834    1.116 
ratio-regions        0.9516    1.087     6.217     1.118     1.059     7.095     1.020     0.9765    6.952 
ray                  1.067     1.077     1.242     0.9726    1.001     1.078     0.9786    1.001     1.146 
raytrace             0.9722    1.040     1.306     1.010     1.142     1.384     1.109     1.240     1.581 
simple               1.353     1.144     1.566     1.170     0.9980    1.103     1.122     0.9829    1.294 
smith-normal-form    0.9465    0.9540    0.9851    1.026     0.9895    0.9820    0.8872    0.9533    1.039 
string-concat        0.9844    0.9919    8.312     1.024     1.193     7.942     0.9610    0.9303    9.124 
tailmerge            1.051     1.003     1.980     1.044     0.9581    1.729     1.073     0.9661    1.723 
tak                  1.304     1.382     1.691     1.159     1.110     1.591     1.088     1.023     1.306 
tensor               0.9748    0.9561    2.294     1.143     1.190     3.024 
tsp                  1.056     1.036     1.217     0.9907    1.016     1.523     1.002     1.033     1.597 
tyan                 1.201     1.161     1.533     1.010     1.026     1.253     1.003     1.083     1.537 
vector-rev           1.042     1.131     9.978     0.8691    0.9528   11.91      1.016     0.9505   13.12  
vector32-concat      0.8325    0.9266    6.271     1.096     1.078     7.314     0.8915    0.9622    9.874 
vector64-concat      0.9219    1.081     5.738     0.9804    0.9389    3.569     0.8714    0.8335    5.495 
vliw                 1.132     1.055     1.460     0.8508    1.159     1.796     1.019     1.109     2.006 
wc-input1            1.016     1.032    11.17      0.8939    0.9587   11.11      0.9507    1.012    11.02  
wc-scanStream        1.005     1.121    18.73      0.9619    1.026    11.54      0.9254    0.9501   16.69  
zebra                1.082     1.059     2.532     1.082     1.091     3.880     1.044     1.003     4.285 
zern                 0.9558    1.008     1.755     1.062     1.033     1.952     0.9511    1.015     2.039 
MIN                  0.8325    0.9266    0.9851    0.8508    0.8427    0.9820    0.8714    0.8335    0.9785
GMEAN                1.042     1.080     2.334     1.032     1.065     2.474     1.029     1.050     2.555 
MAX                  1.353     1.512    18.73      1.233     1.533    18.20      1.299     1.631    18.62

For the benchmarks that run in all configurations (see below), -profile-tail-call-opt false -profile-intro-loops-opt false introduces considerable overhead, while -profile-tail-call-opt false -profile-intro-loops-opt true is not significantly worse than -profile-tail-call-opt true.

With -profile-tail-call-opt false -profile-intro-loops-opt false, even-odd, output1, psdes-random, reduce, tailfib terminate with Out of memory with max heap size 4Gb, due to tail-recursive functions that are normally turned into loops being executed as non-tail-recursive functions with explosive stack growth.

Interestingly, with -profile-tail-call-opt false -profile-intro-loops-opt true, even-odd also terminates with Out of memory with max heap size 4Gb.

With -codegen llvm, imp-for and tensor (with -profile-tail-call-opt true or -profile-tail-call-opt false -profile-intro-loops-opt true), LLVM is able to completely optimize away the inner loops, leading to run times of 0 (and meaningless run time ratios).

It may be worth considering making -profile-tail-call-opt false -profile-intro-loops-opt true the default when -const 'Exn.keepHistory true' is used in order to improve the accuracy of exception history. However, the fact that one benchmark (even-odd) exhausts heap with explosive stack growth is worrisome.

See #609.

The `-profile-tail-call-opt {true|false}` controls whether or not the SSA{,2} shrinker optimizes tail calls in the presence of profiling. `-profile-tail-call-opt false` is expected to have a significant performance penalty, but can improve the accuracy of exception history. It likely worsens the accuracy of time profiling, since the profiled program (without tail call optimizations) will be significantly different from the non-profiled program (with tail call optimizations).

The `-profile-intro-loops-opt {true|false}` controls whether or not the SSA IntroduceLoops optimization applies in presence of profiling. In particular, when `-profile-tail-call-opt false` but `-profile-intro-loops-opt true`, then IntroduceLoops will recognize self non-tail calls with eta return and handler continuations as tail calls. This is expected to recover some of the performance penalty of `-profile-tail-call-opt false`, at the expense of less accurate exception history; the exception history will have only one entry for the recursive function, even if the exception is raised by a deeply nested recursive (tail) call.

YawarRaza7349 · 2025-06-20T18:30:26Z

Legendary implementation speed.

Also, if -profile-tail-call-opt true -profile-intro-loops-opt false is not allowed, then perhaps it might be clearer to instead have a single ternary flag, such as -profile-tail-call-opt {true|loops-only|false}.

MatthewFluet · 2025-06-20T19:28:29Z

Also, if -profile-tail-call-opt true -profile-intro-loops-opt false is not allowed, then perhaps it might be clearer to instead have a single ternary flag, such as -profile-tail-call-opt {true|loops-only|false}.

Excellent idea! It is especially nice, because we can distinguish self tail calls from non-self tail calls in the shrinker; that would allow us to revert the changes to IntroduceLoops, returning it to the "simple" implementation. With that observation, we can even more nicely have the ternary flag as {true|self-only|false}.

@YawarRaza7349

Revise `-profile-tail-call-opt` to `{always|self-only|never}` Thanks to @YawarRaza7349 for the suggestion (#614 (comment)).

MatthewFluet added 11 commits May 28, 2025 14:35

Assert that movable stmts are profiling stmts in SSA{,2} shrinker

a8b379a

Rename canMove-vars to profileStmts-vars in SSA{,2} shrinker

7994349

Improve SSA{,2} shrinking of Return and Raise transfers

124c52c

Improve variable names for SSA{,2} tail-call optimization

44bcb5a

Delete handler when SSA{,2} Shrink performs tail-call optimization

4278014

Delete handler when SSA{,2} Shrink converts eta Handle to Caller

c3b72ef

Update args of Raise/Return from Free to Formal

17551b3

Add Exp.isProfile and Statement.isProfile to structure SsaTree

aee7031

Update description of -profile-tail-call-opt compile-time option

ee7459c

MatthewFluet merged commit 497710e into MLton:master Jun 18, 2025
20 checks passed

MatthewFluet deleted the profile-optional-opts branch June 18, 2025 23:43

MatthewFluet mentioned this pull request Jun 18, 2025

MLton.Exn.history returns different callstack if the function is used in another way #609

Closed

MatthewFluet mentioned this pull request Jun 23, 2025

Revise -profile-tail-call-opt to {always|self-only|never} #615

Merged

MatthewFluet added a commit that referenced this pull request Jun 24, 2025

Merge pull request #615 from MatthewFluet/profile-tail-opt-option

8c9c850

Revise `-profile-tail-call-opt` to `{always|self-only|never}` Thanks to @YawarRaza7349 for the suggestion (#614 (comment)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce options to control profiling-specific optimizations #614

Introduce options to control profiling-specific optimizations #614

Uh oh!

MatthewFluet commented Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

YawarRaza7349 commented Jun 20, 2025

Uh oh!

MatthewFluet commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Introduce options to control profiling-specific optimizations #614

Introduce options to control profiling-specific optimizations #614

Uh oh!

Conversation

MatthewFluet commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

YawarRaza7349 commented Jun 20, 2025

Uh oh!

MatthewFluet commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatthewFluet commented Jun 18, 2025 •

edited

Loading