Optimize last_modified tracking #950

Korijn · 2025-01-22T23:12:48Z

🚀 Another massive performance boost! The skinning animation example runs at 180 fps on my machine now.

Callback mechanism and weakrefs are completely removed from the Transform API, of course this implies refactoring in a few areas, notably lighting, but actually it is a big improvement because now light buffers are only updated just before rendering, instead of on every transform update
- Note: I didn't look much further into the lights than needed to make the tests pass
- Note 2: ⚠️ This is technically a breaking change since on_update is removed from the Transform API
last_modified has its own specialized implementation in AffineTransform, RecursiveTransform and Camera, to be as simple as it can be in each case
flag_update no longer immediately propagates, instead propagation happens lazily when a cache attribute is accessed in RecursiveTransform.last_modified and Camera.last_modified (and not in AffineTransform)
The cache decorator only accesses last_modified once instead of three times
WorldObjects have a _world_last_modified flag which is used to track if the buffers need an update when a frame is rendered

⏭️ Next up: I am working on another improvement to eliminate redundant propagation of last_modified, but I will do that in a separate pull request.

🥳 Cool detail: Some pretty cool news, with this PR, finally, the Transform API is no longer at the top of the cProfile report!

ncalls: Total number of calls to the function. If there are two numbers, that means the function recursed and the first is the total number of calls and the second is the number of primitive (non-recursive) calls.

So RecursiveTransform.last_modified is called ~460k times in one full animation run, and 4m times if you include recursive calls! 😱 It's definitely worthwhile to keep optimizing this, I think 😁 it may be pygfx' hottest codepath!

Korijn · 2025-01-23T22:22:39Z

Ready to merge on my end!

…uring the propagation

almarklein

Epic work 😄 🚀

panxinmiao · 2025-01-24T10:23:46Z

Great! 🚀

I am a firm believer in the approach of "traversing the scene graph before rendering and actively updating the scene's Transform Matrix." 😄

Korijn · 2025-01-24T10:52:38Z

Great! 🚀

I am a firm believer in the approach of "traversing the scene graph before rendering and actively updating the scene's Transform Matrix." 😄

I agree, I think that's where we are now 😁

panxinmiao · 2025-01-25T07:22:44Z

⏭️ Next up: I am working on another improvement to eliminate redundant propagation of last_modified, but I will do that in a separate pull request.

Nice to hear that, 😄

I have always had an idea:

During each frame's render (in the renderer's render() method), when we traverse the scene graph to update the world matrices of nodes, is there a way to completely bypass the finer-grained propagation mechanism of RecursiveTransform based on flag_update()?

In this case, we only need to simply start from the root node (the scene object) and recursively update the world matrix for each child node.

You might think that for static scenes, updating the world matrix here is redundant (Unnecessary 4x4 matrix multiplication). However, for nearly any dynamic scene, this approach is undoubtedly faster than the one that requires tracking the update propagation state of the world transform.

This method incurs almost no additional overhead—it's just matrix multiplication. Furthermore, since the update starts from the root node and proceeds downward, the number of matrix multiplications is fixed. For each node in the scene, matrix multiplication occurs only once, and 4x4 matrix multiplication is extremely fast.

I even suspect that the cost of performing a single 4x4 matrix multiplication might be smaller than the extra performance overhead caused by RecursiveTransform automatically tracking "whether the world matrix needs to be updated." If that’s the case, even for static scenes, where the world matrix doesn’t need updating, the act of "checking whether the world matrix needs to be updated" might itself introduce enough performance overhead to counterbalance the cost of matrix multiplication.

If that’s the case, it’s like saying, "If we need to compute c = a + b, we only need to perform the addition to get the value of c, without needing to track whether a and b have changed, nor determining whether we need to recompute c."

I will try to do some testing and verification when I have time.

panxinmiao · 2025-01-25T07:31:36Z

You might think that for static scenes, updating the world matrix here is redundant (Unnecessary 4x4 matrix multiplication). However, for nearly any dynamic scene, this approach is undoubtedly faster than the one that requires tracking the update propagation state of the world transform.

In addition, if the user determines that it is a static scene (or if the user wants to fully control the update timing of the Transform matrix), we can also provide an option to disable automatic updates of the world matrix.

Korijn · 2025-01-25T08:52:33Z

I respectfully disagree. In this PR, the overhead of flag_update has been eliminated entirely. Flag_update does not propagate anymore.

It's now working exactly as you describe, only when a frame is rendered are world matrices computed by traversing the scene graph.

And I have one more PR to go to make it even more efficient.

panxinmiao · 2025-01-25T09:10:05Z

It's now working exactly as you describe, only when a frame is rendered are world matrices computed by traversing the scene graph.

Yes, I’m aware of that.

However, right now, when traversing the scene graph before rendering, we first need to determine whether the world matrix of each object needs to be updated, and only update those that actually need it. The problem is, checking whether the world matrix needs updating isn’t as simple or lightweight as it seems (because you have to check if the transformations of all its ancestor nodes have changed). I suspect that the cost of this check might not be smaller than the overhead of simply performing a single 4x4 matrix multiplication.

So, my thought is that maybe it’s better not to worry about this (whether the world matrix need to be updated ) too much, and just update from the root node down—it might be simpler and more efficient.

Korijn · 2025-01-25T09:28:32Z

I see. Well, I think that you have predicted my next move almost exactly, so I'll ask you to wait for the next PR to come.

I will also measure the difference with your proposal!

almarklein · 2025-01-25T20:10:20Z

Maybe I'm stating the obvious here, but IIUC the plan is that during the scene-graph traversal, you check the flag (without any propagation) and when it is dirty, you update the matrix, and from that of all its children (and their children etc). So you kind of have the best of both worlds.

Korijn · 2025-01-25T21:02:25Z

Maybe I'm stating the obvious here, but IIUC the plan is that during the scene-graph traversal, you check the flag (without any propagation) and when it is dirty, you update the matrix, and from that of all its children (and their children etc). So you kind of have the best of both worlds.

That's the plan.

The tricky bit is working around the cache decorator in this scenario. But it's not impossible. :)

almarklein mentioned this pull request Jan 23, 2025

Refactor to have a single scene traverse in the renderer #953

Merged

Korijn requested a review from almarklein January 23, 2025 22:07

Korijn force-pushed the transform-profiling branch from 41ca27e to d19369b Compare January 23, 2025 22:08

Korijn marked this pull request as ready for review January 23, 2025 22:10

Korijn force-pushed the transform-profiling branch from 518f336 to 7643b68 Compare January 23, 2025 22:21

Korijn added 4 commits January 24, 2025 09:41

profile and optimize transform some more

6758025

make flag_update much cheaper

f6cf57a

that cache decorator was the real problem

aeb28a3

this underscore makes a big difference! less function call overhead d…

fd217e5

…uring the propagation

Korijn force-pushed the transform-profiling branch from 7643b68 to fd217e5 Compare January 24, 2025 08:41

almarklein approved these changes Jan 24, 2025

View reviewed changes

Korijn merged commit 3f3b45d into main Jan 24, 2025
14 checks passed

Korijn deleted the transform-profiling branch January 24, 2025 08:59

This was referenced Jan 24, 2025

Auto update skeleton and skeleton_helper #894

Merged

Make bind_matrix a read-only property, and clean up np.linalg.inv. #948

Merged

Uh oh!

Optimize last_modified tracking #950

Optimize last_modified tracking #950

Uh oh!

Conversation

Korijn commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Korijn commented Jan 23, 2025

Uh oh!

almarklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

panxinmiao commented Jan 24, 2025

Uh oh!

Korijn commented Jan 24, 2025

Uh oh!

panxinmiao commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panxinmiao commented Jan 25, 2025

Uh oh!

Korijn commented Jan 25, 2025

Uh oh!

panxinmiao commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Korijn commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

almarklein commented Jan 25, 2025

Uh oh!

Korijn commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Korijn commented Jan 22, 2025 •

edited

Loading

panxinmiao commented Jan 25, 2025 •

edited

Loading

panxinmiao commented Jan 25, 2025 •

edited

Loading

Korijn commented Jan 25, 2025 •

edited

Loading

Korijn commented Jan 25, 2025 •

edited

Loading