Fuse emitting and printing of trees in the backend #4917

gzm0 · 2023-12-09T21:17:09Z

This allows us to use the Emitter's powerful caching mechanism to
directly cache printed trees (as byte buffers) and not cache
JavaScript trees anymore at all.

This reduces in-between run memory usage on the test suite from
1.13 GB (not GiB) to 1.01 GB on my machine (roughly 10%).

Runtime performance (both batch and incremental) is unaffected.

It is worth pointing out, that in order to avoid duplicate caching, we do not cache full class trees anymore.

linker/shared/src/main/scala/org/scalajs/linker/backend/javascript/Printers.scala

gzm0 · 2023-12-09T21:29:01Z

Performance measurements

Sorry for the swapped colors between inc and batch.

Incremental (including warmup run)

Full

Cropped (same data)

Batch

Supplemental info

Script to calculate "Full Backend" timing.

library(readr)
library(ggplot2)
library(dplyr)

d_full <- read_csv("logger-timings-batch.csv", col_names = c("variant", "op", "t_ns"), col_types = "ffn")

d_backend <- bind_cols(
  d_full %>% filter(op == "Emitter") %>% select(variant, t_ns),
  d_full %>% filter(op == "BasicBackend: Write result") %>% select(t_ns) %>% rename(t_ns2 = t_ns),
) %>% mutate(op = "Full Backend", t_ns = t_ns + t_ns2, t_ns2 = NULL)

d <- bind_rows(
  d_full %>% filter(grepl('Emitter|BasicBackend', op)),
  d_backend,
)

ggplot(d, aes(x = op, color = variant, y = t_ns)) + geom_boxplot()

logger-timings-batch.csv
logger-timings-inc.csv

gzm0 · 2023-12-17T14:26:43Z

I've had a look at this with YourKit.

Findings:

What	Size [MB]
main: JS trees	232
main: printed trees	120
main: total	352
pr: printed trees	237
absolute savings pr	115
wasted nested trees pr	117

Thoughts:

The savings are consistent with the overall heap measurements (equal up to rounding error).
It seems we're wasting a lot of space for nested trees on a codebase that is low on nested invocations (710 cases out of 6538). This is a major concern since JS class heavier codebases will suffer from this more.

gzm0 · 2023-12-17T14:48:29Z

I don't think I'm comfortable proceeding with this given the potential negative impacts depending on the codebase.

Options I see:

Abandon this idea.
Stop caching the entire class but only it's parts. @sjrd do you think this might be a viable approach?

gzm0 · 2023-12-17T14:58:52Z

FWIW: I'm going to try and benchmark the second option :)

sjrd · 2023-12-17T15:00:59Z

Stop caching the entire class but only it's parts. @sjrd do you think this might be a viable approach?

That is probably viable, yes.

gzm0 · 2023-12-17T15:25:24Z

Numbers look comparable (incremental):

logger-timings.csv

gzm0 · 2023-12-25T19:48:35Z

I have made some progress here.

@sjrd there are two points I'd like your input on:

Getting proper indentation indeed is not super tricky, what I do not know is how to test it: So far, tree printing was just unit tested (which is/was fine). However, at this point, there are a lot of non-local decisions that go into indenting the blocks that are cached, so it doesn't seem feasible to unit test this :-/ @sjrd any ideas?
In order to properly deal with semicolons in the printer, I needed to move semicolon printing into the statement (rather than the block context). As a result, we also print a semicolon after the last statement in a block. This leads to quite a bit of (uncompressed) fastlink size increase. Is this acceptable? (relevant commit: 567cb00)

TODOs for me:

Clean up the last 4 commits.
Benchmark (both memory and speed)
Diff output of linked test suite to check it's unchanged.

gzm0 · 2023-12-28T07:38:01Z

Updated benchmarking results (incremental only):

Cropped:

logger-timings.csv

gzm0 · 2023-12-28T07:59:58Z

Note to self: The adjustments to the JS printer are incorrect: There are semicolons after flow-control statements (e.g. if-else chains).

gzm0 · 2023-12-31T07:16:30Z

@sjrd, this and the dependencies (#4920, #4921, #4922) are ready for review.

I suggest you review 22db2f6 first, to get an idea of what the overall thing is doing.

gzm0 · 2024-01-01T12:11:18Z

The CI failure reveals a problem with the Transformed tree: We cannot implement show anymore for a tree (notably not for trees that contain non-PrintedTree chunks).

gzm0 · 2024-01-01T12:40:13Z

Fixing the show (and running scripted locally) reveals another problem: We cannot cache GCC trees that easily.

It seems they keep a reference to their parents (and siblings). So caching them potentially leaks big amounts of memory, but also doesn't actually let us re-use the nodes.

Maybe the best strategy for now is to not cache AST transforms for GCC for now?

sjrd · 2024-01-01T17:56:32Z

Maybe the best strategy for now is to not cache AST transforms for GCC for now?

That's probably fine. When we run GCC we're not incremental anyway because it isn't. One level of transformation is not going to change much, especially if we have to recreate new Nodes anyway.

gzm0 · 2024-01-02T11:06:47Z

I have added a commit (to be squashed) to not cache GCC trees. TODO: Write a unit test for the GCC backend to link twice consecutively (we should not only have caught this with a scripted test).

Ensures that we catch issues like this: scala-js#4917 (comment)

gzm0 · 2024-01-02T13:09:58Z

PR for test is here: #4924.

Ensures that we catch issues like this: scala-js#4917 (comment)

sjrd · 2024-01-23T12:25:38Z

linker/shared/src/main/scala/org/scalajs/linker/backend/emitter/Emitter.scala

-        (_tree, false)
+        // Input has not changed and we were not invalidated.
+        // --> nothing has changed (we recompute to save memory).
+        (compute, false)


This seems weird. Now, both sides of the if call compute, and they do not use its result other than to return it. Shouldn't we return only the Boolean then, and let the caller do the compute themselves? That would avoid the tuple and more importantly the extractChangedAndWithGlobals.

Hum, yes. You have discovered what I have attempted to brush under the rug 😅 If I extract compute after calling trackChanged, org.scalajs.linker.BasicLinkerBackendTest.noInvalidatedModuleInSecondRun fails.

I'm still at the stage of debugging where I think "this cannot happen". So all-in-all, I guess this clearly warrants further investigation :P

I found the issue: The alternative code was using a short-circuiting or assignment (||=) so the RHS was not always executed. Putting it into a val first fixed the problem. (updated).

This allows us to use the Emitter's powerful caching mechanism to directly cache printed trees (as byte buffers) and not cache JavaScript trees anymore at all. This reduces in-between run memory usage on the test suite from 1.12 GB (not GiB) to 1.00 GB on my machine (roughly 10%). Runtime performance (both batch and incremental) is unaffected. It is worth pointing out, that due to how the Emitter caches trees, classes that end up being ES6 classes is performed will be held twice in memory (once the individual methods, once the entire class). On the test suite, this is the case for 710 cases out of 6538.

In the next commit, we want to avoid caching entire classes because of the memory cost. However, the BasicLinkerBackend relies on the identity of the generated trees to detect changes: Since that identity will change if we stop caching them, we need to provide an explicit "changed" signal.

This reduces some memory overhead for negligible performance cost. Residual (post link memory) benchmarks for the test suite: Baseline: 1.13 GB, new 1.01 GB

sjrd

Cool. Looks good! Thanks!

gzm0 commented Dec 9, 2023

View reviewed changes

linker/shared/src/main/scala/org/scalajs/linker/backend/javascript/Printers.scala Outdated Show resolved Hide resolved

gzm0 marked this pull request as draft December 9, 2023 21:20

gzm0 changed the title ~~Fuse emitting and printing of trees in the backend~~ RFC: Fuse emitting and printing of trees in the backend Dec 9, 2023

gzm0 requested a review from sjrd December 9, 2023 21:29

gzm0 force-pushed the more-cache branch from 82a25db to 92c4317 Compare December 25, 2023 19:40

gzm0 force-pushed the more-cache branch from 92c4317 to 79ea629 Compare December 27, 2023 16:40

gzm0 force-pushed the more-cache branch from 79ea629 to c131ea7 Compare December 28, 2023 07:45

gzm0 force-pushed the more-cache branch 4 times, most recently from 268775c to bea04d2 Compare December 31, 2023 07:14

gzm0 changed the title ~~RFC: Fuse emitting and printing of trees in the backend~~ Fuse emitting and printing of trees in the backend Dec 31, 2023

gzm0 force-pushed the more-cache branch from bea04d2 to 49479e4 Compare December 31, 2023 15:00

gzm0 force-pushed the more-cache branch from 49479e4 to a9176d0 Compare January 1, 2024 12:28

gzm0 mentioned this pull request Jan 1, 2024

Changes to ClosureAstTransformer to fuse JS emitting and printing #4921

Closed

gzm0 force-pushed the more-cache branch 2 times, most recently from 0885927 to b255236 Compare January 2, 2024 11:02

gzm0 mentioned this pull request Jan 2, 2024

Emitter changes to fuse JS emitting and printing #4922

Merged

gzm0 force-pushed the more-cache branch from b255236 to 4076005 Compare January 2, 2024 12:50

gzm0 added a commit to gzm0/scala-js that referenced this pull request Jan 2, 2024

Test that we can link twice on the same linker with GCC

c033aaf

Ensures that we catch issues like this: scala-js#4917 (comment)

gzm0 mentioned this pull request Jan 2, 2024

Test that we can link twice on the same linker with GCC #4924

Merged

gzm0 added a commit to gzm0/scala-js that referenced this pull request Jan 2, 2024

Test that we can link twice on the same linker with GCC

3572a4b

Ensures that we catch issues like this: scala-js#4917 (comment)

sjrd reviewed Jan 23, 2024

View reviewed changes

gzm0 mentioned this pull request Jan 28, 2024

Fix #4482: Minify property names ourselves in fullLink when we don't use GCC. #4930

Closed

gzm0 force-pushed the more-cache branch from 4076005 to 8393e27 Compare January 28, 2024 16:07

gzm0 added 3 commits January 29, 2024 14:13

Do not cache overall class

42efb2a

This reduces some memory overhead for negligible performance cost. Residual (post link memory) benchmarks for the test suite: Baseline: 1.13 GB, new 1.01 GB

gzm0 force-pushed the more-cache branch from 8393e27 to 42efb2a Compare January 29, 2024 13:13

gzm0 marked this pull request as ready for review January 29, 2024 13:15

gzm0 requested a review from sjrd January 29, 2024 13:15

sjrd approved these changes Jan 29, 2024

View reviewed changes

sjrd merged commit 64e7725 into scala-js:main Jan 29, 2024

gzm0 deleted the more-cache branch February 3, 2024 13:25

gzm0 mentioned this pull request Mar 23, 2024

Investigate linker memory consumption #4906

Open

Fuse emitting and printing of trees in the backend #4917

Fuse emitting and printing of trees in the backend #4917

Uh oh!

Conversation

gzm0 commented Dec 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gzm0 commented Dec 9, 2023

Performance measurements

Incremental (including warmup run)

Full

Cropped (same data)

Batch

Supplemental info

Uh oh!

gzm0 commented Dec 17, 2023

Uh oh!

gzm0 commented Dec 17, 2023

Uh oh!

gzm0 commented Dec 17, 2023

Uh oh!

sjrd commented Dec 17, 2023

Uh oh!

gzm0 commented Dec 17, 2023

Uh oh!

gzm0 commented Dec 25, 2023

Uh oh!

gzm0 commented Dec 28, 2023

Uh oh!

gzm0 commented Dec 28, 2023

Uh oh!

gzm0 commented Dec 31, 2023

Uh oh!

gzm0 commented Jan 1, 2024

Uh oh!

gzm0 commented Jan 1, 2024

Uh oh!

sjrd commented Jan 1, 2024

Uh oh!

gzm0 commented Jan 2, 2024

Uh oh!

gzm0 commented Jan 2, 2024

Uh oh!

sjrd Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

gzm0 Jan 28, 2024

Choose a reason for hiding this comment

Uh oh!

gzm0 Jan 28, 2024

Choose a reason for hiding this comment

Uh oh!

sjrd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gzm0 commented Dec 9, 2023 •

edited

Loading