[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess#3717
[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess#3717s0nskar wants to merge 2 commits into
Conversation
|
Also i think we should increase the default value for |
There was a problem hiding this comment.
Pull request overview
This PR updates the reducer file group RPC response cache in ReducePartitionCommitHandler to use an access-based expiry policy, so frequently accessed entries are less likely to be evicted purely due to time since creation.
Changes:
- Switch Guava cache eviction from
expireAfterWritetoexpireAfterAccessfor the reducer file group RPC cache.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // noinspection UnstableApiUsage | ||
| private val getReducerFileGroupRpcCache: Cache[Int, ByteBuffer] = CacheBuilder.newBuilder() | ||
| .concurrencyLevel(rpcCacheConcurrencyLevel) | ||
| .expireAfterWrite(rpcCacheExpireTime, TimeUnit.MILLISECONDS) | ||
| .expireAfterAccess(rpcCacheExpireTime, TimeUnit.MILLISECONDS) | ||
| .maximumSize(rpcCacheSize) | ||
| .build().asInstanceOf[Cache[Int, ByteBuffer]] |
There was a problem hiding this comment.
@Copilot can you please explain more on this.
SteNicholas
left a comment
There was a problem hiding this comment.
The switch looks reasonable and I think it's safe, but it hinges on one property that's worth making explicit in the PR.
Why it's likely safe
getReducerFileGroupRpcCache is consulted only in replyGetReducerFileGroup, which is gated behind isStageEnd(shuffleId). The cached content — reducerFileGroupsMap, getMapperAttempts, shufflePushFailedBatches — is all finalized at/before stage end (shufflePushFailedBatches is last written in handleMapperEnd, reducerFileGroupsMap in collectResult before setStageEnd). After stage end the response for a given shuffleId is effectively immutable until the whole shuffle is dropped by removeExpiredShuffle.
So keeping a hot entry alive longer can't serve stale data, and expireAfterAccess strictly improves on expireAfterWrite here — the old policy was re-serializing an identical response every 15s for actively-read shuffles. maximumSize still bounds memory, so there's no unbounded-growth risk.
One thing worth confirming
This cache has no explicit invalidate() anywhere (only construction + .get), so the expiry policy is the sole freshness control. expireAfterWrite gave a hard 15s staleness ceiling; expireAfterAccess removes that ceiling for continuously-accessed shuffles (a hot entry can live for the entire read phase). That's fine iff the response truly never changes after it's first served. Could you confirm no path mutates reducerFileGroupsMap / shufflePushFailedBatches for an already-served shuffleId — in particular a stage rerun / recompute that reuses the same Celeborn shuffleId, or a partition-split update during read? If reruns always allocate a fresh shuffleId (which I believe is the case), this is a non-issue.
Minor
The doc for celeborn.client.rpc.cache.expireTime ("The time before a cache item is removed.") now describes idle time rather than age — a one-line wording tweak would avoid confusion.
Otherwise the rationale (don't evict hot entries on a fixed timer) is sound. LGTM pending the confirmation above.
RexXiong
left a comment
There was a problem hiding this comment.
The change is safe. This cache is only accessed in replyGetReducerFileGroup, which is gated by isStageEnd(shuffleId). After stage end, reducerFileGroupsMap, mapperAttempts, and shufflePushFailedBatches are all immutable, so a hot entry staying alive longer cannot serve stale data.
Performance-wise, expireAfterWrite forces re-serialization of an identical response every 15s for actively-read shuffles. expireAfterAccess avoids this unnecessary work. maximumSize still bounds memory.
One minor pre-existing issue (not blocking this PR): removeExpiredShuffle cleans up stageEndShuffleSet, shuffleMapperAttempts, etc., but does not invalidate getReducerFileGroupRpcCache. With expireAfterAccess, orphaned entries will still expire naturally (no requests after shuffle removal), but an explicit getReducerFileGroupRpcCache.invalidate(shuffleId) in removeExpiredShuffle would be cleaner.
LGTM.
Reviewed with Claude Code
|
@SteNicholas @RexXiong Addressed review comments. |
What changes were proposed in this pull request?
Change get reducer file group cache to expireAfterAccess
Why are the changes needed?
Currently the policy is expireAfterWrite which is not efficient, as it strictly clears the cache after the timeout, without considering that that entry was hot or not.
expireAfterAccesswill make sure to only clear if it was not actively not being accessed.Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
How was this patch tested?
Existing UTs