Codestin Search App

s0nskar · 2026-06-02T13:26:47Z

What changes were proposed in this pull request?

Change get reducer file group cache to expireAfterAccess

Why are the changes needed?

Currently the policy is expireAfterWrite which is not efficient, as it strictly clears the cache after the timeout, without considering that that entry was hot or not. expireAfterAccess will make sure to only clear if it was not actively not being accessed.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Existing UTs

s0nskar · 2026-06-02T13:27:44Z

Also i think we should increase the default value for celeborn.client.rpc.cache.expireTime from 15s to 1 min to avoid the cache misses. Please share your thoughts on this.

Copilot

Pull request overview

This PR updates the reducer file group RPC response cache in ReducePartitionCommitHandler to use an access-based expiry policy, so frequently accessed entries are less likely to be evicted purely due to time since creation.

Changes:

Switch Guava cache eviction from expireAfterWrite to expireAfterAccess for the reducer file group RPC cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

s0nskar · 2026-06-03T08:12:56Z

  // noinspection UnstableApiUsage
  private val getReducerFileGroupRpcCache: Cache[Int, ByteBuffer] = CacheBuilder.newBuilder()
    .concurrencyLevel(rpcCacheConcurrencyLevel)
-    .expireAfterWrite(rpcCacheExpireTime, TimeUnit.MILLISECONDS)
+    .expireAfterAccess(rpcCacheExpireTime, TimeUnit.MILLISECONDS)
    .maximumSize(rpcCacheSize)
    .build().asInstanceOf[Cache[Int, ByteBuffer]]


@Copilot can you please explain more on this.

SteNicholas

The switch looks reasonable and I think it's safe, but it hinges on one property that's worth making explicit in the PR.

Why it's likely safe

getReducerFileGroupRpcCache is consulted only in replyGetReducerFileGroup, which is gated behind isStageEnd(shuffleId). The cached content — reducerFileGroupsMap, getMapperAttempts, shufflePushFailedBatches — is all finalized at/before stage end (shufflePushFailedBatches is last written in handleMapperEnd, reducerFileGroupsMap in collectResult before setStageEnd). After stage end the response for a given shuffleId is effectively immutable until the whole shuffle is dropped by removeExpiredShuffle.

So keeping a hot entry alive longer can't serve stale data, and expireAfterAccess strictly improves on expireAfterWrite here — the old policy was re-serializing an identical response every 15s for actively-read shuffles. maximumSize still bounds memory, so there's no unbounded-growth risk.

One thing worth confirming

This cache has no explicit invalidate() anywhere (only construction + .get), so the expiry policy is the sole freshness control. expireAfterWrite gave a hard 15s staleness ceiling; expireAfterAccess removes that ceiling for continuously-accessed shuffles (a hot entry can live for the entire read phase). That's fine iff the response truly never changes after it's first served. Could you confirm no path mutates reducerFileGroupsMap / shufflePushFailedBatches for an already-served shuffleId — in particular a stage rerun / recompute that reuses the same Celeborn shuffleId, or a partition-split update during read? If reruns always allocate a fresh shuffleId (which I believe is the case), this is a non-issue.

Minor

The doc for celeborn.client.rpc.cache.expireTime ("The time before a cache item is removed.") now describes idle time rather than age — a one-line wording tweak would avoid confusion.

Otherwise the rationale (don't evict hot entries on a fixed timer) is sound. LGTM pending the confirmation above.

RexXiong

The change is safe. This cache is only accessed in replyGetReducerFileGroup, which is gated by isStageEnd(shuffleId). After stage end, reducerFileGroupsMap, mapperAttempts, and shufflePushFailedBatches are all immutable, so a hot entry staying alive longer cannot serve stale data.

Performance-wise, expireAfterWrite forces re-serialization of an identical response every 15s for actively-read shuffles. expireAfterAccess avoids this unnecessary work. maximumSize still bounds memory.

One minor pre-existing issue (not blocking this PR): removeExpiredShuffle cleans up stageEndShuffleSet, shuffleMapperAttempts, etc., but does not invalidate getReducerFileGroupRpcCache. With expireAfterAccess, orphaned entries will still expire naturally (no requests after shuffle removal), but an explicit getReducerFileGroupRpcCache.invalidate(shuffleId) in removeExpiredShuffle would be cleaner.

LGTM.

Reviewed with Claude Code

s0nskar · 2026-06-05T08:36:45Z

@SteNicholas @RexXiong Addressed review comments.

[CELEBORN-XXXX] Change get reducer file group cache to expireAfterAccess

596ebfe

github-actions Bot added the module:client label Jun 2, 2026

SteNicholas requested a review from Copilot June 2, 2026 21:16

Copilot started reviewing on behalf of SteNicholas June 2, 2026 21:16 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

SteNicholas reviewed Jun 3, 2026

View reviewed changes

RexXiong reviewed Jun 4, 2026

View reviewed changes

review comments

b829b54

github-actions Bot added kind:documentation module:common labels Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess#3717

[CELEBORN-2347] Change get reducer file group cache to expireAfterAccess#3717
s0nskar wants to merge 2 commits into
apache:mainfrom
s0nskar:cache_policy

s0nskar commented Jun 2, 2026

Uh oh!

s0nskar commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

s0nskar Jun 3, 2026

Uh oh!

SteNicholas left a comment

Uh oh!

RexXiong left a comment

Uh oh!

s0nskar commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

s0nskar commented Jun 2, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

s0nskar commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

s0nskar Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas left a comment

Choose a reason for hiding this comment

Why it's likely safe

One thing worth confirming

Minor

Uh oh!

RexXiong left a comment

Choose a reason for hiding this comment

Uh oh!

s0nskar commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants