feat(pregel): add requiredSrcColumns/requiredDstColumns API for triplet memory optimization#763
Conversation
…et memory optimization Signed-off-by: joelrobin18 <[email protected]>
a82cd81 to
be302db
Compare
Signed-off-by: joelrobin18 <[email protected]>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #763 +/- ##
==========================================
- Coverage 84.49% 84.12% -0.38%
==========================================
Files 66 66
Lines 3179 3262 +83
Branches 387 376 -11
==========================================
+ Hits 2686 2744 +58
- Misses 493 518 +25 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
SemyonSinchenko
left a comment
There was a problem hiding this comment.
Thanks a lot @joelrobin18 ! Looks very cool! I left two comments and I would like also to ask you to add a couple of words about the new API to the https://github.com/graphframes/graphframes/blob/main/docs/src/04-user-guide/10-pregel.md
Thanks!
Signed-off-by: joelrobin18 <[email protected]>
SemyonSinchenko
left a comment
There was a problem hiding this comment.
LGTM! Nice work, thanks a lot @joelrobin18 !
What changes were proposed in this pull request?
This PR adds a new API to Pregel for optimizing triplet memory consumption:
Why are the changes needed?
When constructing triplets in Pregel, we currently select all source and destination vertex columns (*), creating large intermediate DataFrames in memory. This is especially problematic for algorithms with large per-vertex state (e.g., cycle detection with stored sequences, future random walks).
This optimization allows users to explicitly specify which columns are needed, significantly reducing memory pressure for large-scale graph processing.