Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess#3717

Open
s0nskar wants to merge 2 commits into
apache:mainfrom
s0nskar:cache_policy
Open

[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess#3717
s0nskar wants to merge 2 commits into
apache:mainfrom
s0nskar:cache_policy

Conversation

@s0nskar
Copy link
Copy Markdown
Contributor

@s0nskar s0nskar commented Jun 2, 2026

What changes were proposed in this pull request?

Change get reducer file group cache to expireAfterAccess

Why are the changes needed?

Currently the policy is expireAfterWrite which is not efficient, as it strictly clears the cache after the timeout, without considering that that entry was hot or not. expireAfterAccess will make sure to only clear if it was not actively not being accessed.

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

How was this patch tested?

Existing UTs

@s0nskar
Copy link
Copy Markdown
Contributor Author

s0nskar commented Jun 2, 2026

Also i think we should increase the default value for celeborn.client.rpc.cache.expireTime from 15s to 1 min to avoid the cache misses. Please share your thoughts on this.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the reducer file group RPC response cache in ReducePartitionCommitHandler to use an access-based expiry policy, so frequently accessed entries are less likely to be evicted purely due to time since creation.

Changes:

  • Switch Guava cache eviction from expireAfterWrite to expireAfterAccess for the reducer file group RPC cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 103 to 108
// noinspection UnstableApiUsage
private val getReducerFileGroupRpcCache: Cache[Int, ByteBuffer] = CacheBuilder.newBuilder()
.concurrencyLevel(rpcCacheConcurrencyLevel)
.expireAfterWrite(rpcCacheExpireTime, TimeUnit.MILLISECONDS)
.expireAfterAccess(rpcCacheExpireTime, TimeUnit.MILLISECONDS)
.maximumSize(rpcCacheSize)
.build().asInstanceOf[Cache[Int, ByteBuffer]]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Copilot can you please explain more on this.

Copy link
Copy Markdown
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch looks reasonable and I think it's safe, but it hinges on one property that's worth making explicit in the PR.

Why it's likely safe

getReducerFileGroupRpcCache is consulted only in replyGetReducerFileGroup, which is gated behind isStageEnd(shuffleId). The cached content — reducerFileGroupsMap, getMapperAttempts, shufflePushFailedBatches — is all finalized at/before stage end (shufflePushFailedBatches is last written in handleMapperEnd, reducerFileGroupsMap in collectResult before setStageEnd). After stage end the response for a given shuffleId is effectively immutable until the whole shuffle is dropped by removeExpiredShuffle.

So keeping a hot entry alive longer can't serve stale data, and expireAfterAccess strictly improves on expireAfterWrite here — the old policy was re-serializing an identical response every 15s for actively-read shuffles. maximumSize still bounds memory, so there's no unbounded-growth risk.

One thing worth confirming

This cache has no explicit invalidate() anywhere (only construction + .get), so the expiry policy is the sole freshness control. expireAfterWrite gave a hard 15s staleness ceiling; expireAfterAccess removes that ceiling for continuously-accessed shuffles (a hot entry can live for the entire read phase). That's fine iff the response truly never changes after it's first served. Could you confirm no path mutates reducerFileGroupsMap / shufflePushFailedBatches for an already-served shuffleId — in particular a stage rerun / recompute that reuses the same Celeborn shuffleId, or a partition-split update during read? If reruns always allocate a fresh shuffleId (which I believe is the case), this is a non-issue.

Minor

The doc for celeborn.client.rpc.cache.expireTime ("The time before a cache item is removed.") now describes idle time rather than age — a one-line wording tweak would avoid confusion.

Otherwise the rationale (don't evict hot entries on a fixed timer) is sound. LGTM pending the confirmation above.

Copy link
Copy Markdown
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is safe. This cache is only accessed in replyGetReducerFileGroup, which is gated by isStageEnd(shuffleId). After stage end, reducerFileGroupsMap, mapperAttempts, and shufflePushFailedBatches are all immutable, so a hot entry staying alive longer cannot serve stale data.

Performance-wise, expireAfterWrite forces re-serialization of an identical response every 15s for actively-read shuffles. expireAfterAccess avoids this unnecessary work. maximumSize still bounds memory.

One minor pre-existing issue (not blocking this PR): removeExpiredShuffle cleans up stageEndShuffleSet, shuffleMapperAttempts, etc., but does not invalidate getReducerFileGroupRpcCache. With expireAfterAccess, orphaned entries will still expire naturally (no requests after shuffle removal), but an explicit getReducerFileGroupRpcCache.invalidate(shuffleId) in removeExpiredShuffle would be cleaner.

LGTM.

Reviewed with Claude Code

@s0nskar
Copy link
Copy Markdown
Contributor Author

s0nskar commented Jun 5, 2026

@SteNicholas @RexXiong Addressed review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants