Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@MatthewFluet
Copy link
Member

@MatthewFluet MatthewFluet commented Jun 18, 2025

Add -profile-tail-call-opt {true|false} expert compile-time option. The -profile-tail-call-opt {true|false} controls whether or not the SSA{,2} shrinker optimizes tail calls in the presence of profiling.

Add -profile-tail-call-opt {true|false} expert compile-time option. The -profile-intro-loops-opt {true|false} controls whether or not the SSA IntroduceLoops optimization applies in presence of profiling and -profile-tail-call-opt false, when the SSA{,2} shrinker does not optimize tail calls in the presence of profiling. In particular, when -profile-tail-call-opt false but -profile-intro-loops-opt true, then IntroduceLoops will recognize self non-tail calls with eta return and handler continuations as tail calls.

-profile-tail-call-opt false and -profile-intro-loops-opt false is expected to have a significant performance penalty, but can improve the accuracy of exception history. It likely worsens the accuracy of time profiling, since the profiled program (without tail call and introduce loops optimizations) will be significantly different from the non-profiled program (with tail call and introduce loops optimizations).

-profile-tail-call-opt false and -profile-intro-loops-opt true is expected to recover some of the performance penalty, at the expense of less accurate exception history; the exception history will have only one entry for the recursive function, even if the exception is raised by a deeply nested recursive (tail) call.

Profiling results:

config command                                                                                                                                                                                       
C02    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64                                                                          
C03    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64 -profile drop -profile-tail-call-opt true -profile-intro-loops-opt true  
C04    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64 -profile drop -profile-tail-call-opt false -profile-intro-loops-opt true 
C05    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen amd64 -profile drop -profile-tail-call-opt false -profile-intro-loops-opt false
C08    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c                                                                              
C09    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c -profile drop -profile-tail-call-opt true -profile-intro-loops-opt true      
C10    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c -profile drop -profile-tail-call-opt false -profile-intro-loops-opt true     
C11    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen c -profile drop -profile-tail-call-opt false -profile-intro-loops-opt false    
C14    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm                                                                           
C15    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm -profile drop -profile-tail-call-opt true -profile-intro-loops-opt true   
C16    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm -profile drop -profile-tail-call-opt false -profile-intro-loops-opt true  
C17    /home/mtf/devel/mlton/builds/gc0622b9f4/bin/mlton @MLton max-heap 4G -- -runtime max-heap -runtime 4G -codegen llvm -profile drop -profile-tail-call-opt false -profile-intro-loops-opt false 

task_clock [email protected] (2-level)
program           `C03/C02` `C04/C02` `C05/C02` `C09/C08` `C10/C08` `C11/C08` `C15/C14` `C16/C14` `C17/C14`
barnes-hut           1.107     1.055     1.214     1.131     0.9522    1.264     0.9007    0.9306    0.9785
boyer                1.053     1.372     1.339     1.233     1.429     1.340     1.113     1.314     1.128 
checksum             1.026     1.032    10.34      1.102     1.099    18.20      1.029     1.163    18.62  
count-graphs         1.036     1.117     2.023     1.120     1.260     2.044     1.033     1.133     1.942 
DLXSimulator         1.050     1.042     2.450     1.020     1.039     2.519     1.130     0.9209    2.201 
fft                  1.038     1.076     1.692     0.9071    0.8427    1.476     1.101     1.114     1.682 
fib                  1.183     1.209     1.273     1.143     1.174     1.128     1.299     1.231     1.288 
flat-array           1.010     0.9847    3.701     0.9900    0.9762    3.655     1.212     1.262     3.145 
hamlet               1.035     1.022     1.210     1.047     1.029     1.414     1.040     1.097     1.281 
imp-for              1.072     1.034     5.059     1.023     1.044     6.872 
knuth-bendix         1.231     1.346     1.387     1.151     1.382     1.302     1.220     1.571     1.382 
lexgen               1.090     1.034     1.647     1.066     1.043     1.602     1.088     0.9897    1.535 
life                 0.9971    0.9810    2.423     0.9802    1.045     2.443     0.9818    0.9713    2.124 
logic                1.002     1.512     1.517     0.9268    1.533     1.747     1.063     1.631     1.986 
mandelbrot           0.9804    0.9926    1.754     0.9579    1.072     2.830     1.150     1.097     3.716 
matrix-multiply      0.9985    1.209     3.598     0.9716    1.096     6.408     0.9968    0.9880    8.000 
md5                  0.9943    0.9730    1.050     1.081     1.130     1.102     1.021     1.030     1.057 
merge                0.9251    1.146     1.262     0.9866    0.9012    1.009     0.9017    0.8910    1.075 
mlyacc               1.029     1.053     1.011     0.9717    1.020     1.025     1.050     1.033     1.072 
model-elimination    1.147     1.055     1.638     1.075     1.059     1.911     1.058     1.003     1.971 
mpuz                 1.059     1.077     3.122     1.020     1.094     3.827     0.9822    1.024     4.534 
nucleic              1.033     1.128     1.140     1.047     0.9743    1.053     0.9830    1.057     1.174 
peek                 1.059     1.026     3.912     1.010     1.066     7.083     0.9948    1.001     8.791 
pidigits             0.9842    0.9689    1.124     1.098     0.9231    1.125     1.110     0.9834    1.116 
ratio-regions        0.9516    1.087     6.217     1.118     1.059     7.095     1.020     0.9765    6.952 
ray                  1.067     1.077     1.242     0.9726    1.001     1.078     0.9786    1.001     1.146 
raytrace             0.9722    1.040     1.306     1.010     1.142     1.384     1.109     1.240     1.581 
simple               1.353     1.144     1.566     1.170     0.9980    1.103     1.122     0.9829    1.294 
smith-normal-form    0.9465    0.9540    0.9851    1.026     0.9895    0.9820    0.8872    0.9533    1.039 
string-concat        0.9844    0.9919    8.312     1.024     1.193     7.942     0.9610    0.9303    9.124 
tailmerge            1.051     1.003     1.980     1.044     0.9581    1.729     1.073     0.9661    1.723 
tak                  1.304     1.382     1.691     1.159     1.110     1.591     1.088     1.023     1.306 
tensor               0.9748    0.9561    2.294     1.143     1.190     3.024 
tsp                  1.056     1.036     1.217     0.9907    1.016     1.523     1.002     1.033     1.597 
tyan                 1.201     1.161     1.533     1.010     1.026     1.253     1.003     1.083     1.537 
vector-rev           1.042     1.131     9.978     0.8691    0.9528   11.91      1.016     0.9505   13.12  
vector32-concat      0.8325    0.9266    6.271     1.096     1.078     7.314     0.8915    0.9622    9.874 
vector64-concat      0.9219    1.081     5.738     0.9804    0.9389    3.569     0.8714    0.8335    5.495 
vliw                 1.132     1.055     1.460     0.8508    1.159     1.796     1.019     1.109     2.006 
wc-input1            1.016     1.032    11.17      0.8939    0.9587   11.11      0.9507    1.012    11.02  
wc-scanStream        1.005     1.121    18.73      0.9619    1.026    11.54      0.9254    0.9501   16.69  
zebra                1.082     1.059     2.532     1.082     1.091     3.880     1.044     1.003     4.285 
zern                 0.9558    1.008     1.755     1.062     1.033     1.952     0.9511    1.015     2.039 
MIN                  0.8325    0.9266    0.9851    0.8508    0.8427    0.9820    0.8714    0.8335    0.9785
GMEAN                1.042     1.080     2.334     1.032     1.065     2.474     1.029     1.050     2.555 
MAX                  1.353     1.512    18.73      1.233     1.533    18.20      1.299     1.631    18.62  

For the benchmarks that run in all configurations (see below), -profile-tail-call-opt false -profile-intro-loops-opt false introduces considerable overhead, while -profile-tail-call-opt false -profile-intro-loops-opt true is not significantly worse than -profile-tail-call-opt true.

With -profile-tail-call-opt false -profile-intro-loops-opt false, even-odd, output1, psdes-random, reduce, tailfib terminate with Out of memory with max heap size 4Gb, due to tail-recursive functions that are normally turned into loops being executed as non-tail-recursive functions with explosive stack growth.

Interestingly, with -profile-tail-call-opt false -profile-intro-loops-opt true, even-odd also terminates with Out of memory with max heap size 4Gb.

With -codegen llvm, imp-for and tensor (with -profile-tail-call-opt true or -profile-tail-call-opt false -profile-intro-loops-opt true), LLVM is able to completely optimize away the inner loops, leading to run times of 0 (and meaningless run time ratios).

It may be worth considering making -profile-tail-call-opt false -profile-intro-loops-opt true the default when -const 'Exn.keepHistory true' is used in order to improve the accuracy of exception history. However, the fact that one benchmark (even-odd) exhausts heap with explosive stack growth is worrisome.

See #609.

The `-profile-tail-call-opt {true|false}` controls whether or not the
SSA{,2} shrinker optimizes tail calls in the presence of profiling.

`-profile-tail-call-opt false` is expected to have a significant
performance penalty, but can improve the accuracy of exception
history.  It likely worsens the accuracy of time profiling, since the
profiled program (without tail call optimizations) will be
significantly different from the non-profiled program (with tail call
optimizations).
The `-profile-intro-loops-opt {true|false}` controls whether or not the
SSA IntroduceLoops optimization applies in presence of profiling.

In particular, when `-profile-tail-call-opt false` but
`-profile-intro-loops-opt true`, then IntroduceLoops will recognize
self non-tail calls with eta return and handler continuations as tail
calls.  This is expected to recover some of the performance penalty of
`-profile-tail-call-opt false`, at the expense of less accurate
exception history; the exception history will have only one entry for
the recursive function, even if the exception is raised by a deeply
nested recursive (tail) call.
@MatthewFluet MatthewFluet merged commit 497710e into MLton:master Jun 18, 2025
20 checks passed
@MatthewFluet MatthewFluet deleted the profile-optional-opts branch June 18, 2025 23:43
@YawarRaza7349
Copy link
Contributor

Legendary implementation speed.

Also, if -profile-tail-call-opt true -profile-intro-loops-opt false is not allowed, then perhaps it might be clearer to instead have a single ternary flag, such as -profile-tail-call-opt {true|loops-only|false}.

@MatthewFluet
Copy link
Member Author

Also, if -profile-tail-call-opt true -profile-intro-loops-opt false is not allowed, then perhaps it might be clearer to instead have a single ternary flag, such as -profile-tail-call-opt {true|loops-only|false}.

Excellent idea! It is especially nice, because we can distinguish self tail calls from non-self tail calls in the shrinker; that would allow us to revert the changes to IntroduceLoops, returning it to the "simple" implementation. With that observation, we can even more nicely have the ternary flag as {true|self-only|false}.

MatthewFluet added a commit that referenced this pull request Jun 24, 2025
Revise `-profile-tail-call-opt` to `{always|self-only|never}`

Thanks to @YawarRaza7349 for the suggestion (#614 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants