Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@themitvp
Copy link
Contributor

Fix Race Condition Bug

Package: @fortedigital/nextjs-cache-handler v2.5.0
File: packages/nextjs-cache-handler/src/handlers/redis-strings.ts
Issue: Cache hit ratio degraded to ~66.6% in production (should be ~100%)
Root Cause: Race condition in set() method


Problem Summary

In production with 28-31 Kubernetes pods running Next.js applications using this cache handler, we observed a consistent 66.6% cache hit ratio when it should be near 100%. This results in:

  • 33.4% false cache misses - data that exists in cache returns null
  • Wasted backend resources regenerating already-cached content
  • Degraded user experience due to slower response times
  • Unnecessary load on databases and APIs
image

Root Cause

The set() method in redis-strings.ts performs Redis operations in two sequential Promise.all() calls:

Buggy Code (Current)

// Phase 1: Write tags and TTL
await Promise.all([setTagsOperation, setSharedTtlOperation]);

// ... switch statement to prepare value operations ...

// Phase 2: Write value and expiration
await Promise.all([setOperation, expireOperation]);

The Race Condition

Between Phase 1 and Phase 2, there's a critical race window:

  1. Phase 1 completes β†’ Tags exist in Redis
  2. Concurrent get() calls arrive during the gap
  3. get() logic sees tags exist (line 187-193)
  4. But value doesn't exist yet (Phase 2 hasn't completed)
  5. get() returns null β†’ false cache miss ❌

With 28-31 pods and high concurrency, approximately 1/3 of requests fall into this race window, explaining the 66.6% hit ratio.


Reproduction

I created a test that simulates cache invalidation with concurrent readers:

Test Setup

  • 100 iterations of cache invalidation + regeneration
  • 20 concurrent readers per iteration
  • 2,000 total read attempts
  • Simulates production patterns

Results - BEFORE Fix

πŸ“Š FINAL RESULTS:

   Total cache reads:       2,000
   Successful hits:         1,989 (99.45%)
   False cache misses:      11 (0.55%)
   Cache hit ratio:         99.45%

😱 RACE CONDITION DETECTED

Note: Local testing shows 0.55% miss rate due to localhost Redis speed. In production with network latency across 28-31 pods, this amplifies to 33.4% (66.6% hit ratio).

Results - AFTER Fix

πŸ“Š FINAL RESULTS:

   Total cache reads:       2,000
   Successful hits:         2,000 (100.00%)
   False cache misses:      0 (0.00%)
   Cache hit ratio:         100.00%

βœ… NO RACE CONDITIONS DETECTED

   The atomic Promise.all() implementation successfully eliminates
   the race window by executing all operations together.

   100% cache hit ratio achieved! πŸŽ‰

The Fix

Execute all Redis operations atomically in a single Promise.all():

// Fixed: All operations atomic
await Promise.all(
  [
    setTagsOperation,
    setSharedTtlOperation,
    setOperation,
    expireOperation,
  ].filter(Boolean),
);

Why This Works

  • All Redis operations execute concurrently
  • No gap between tag writes and value writes
  • Concurrent get() calls either see:
    • βœ… Complete cache entry (tags + value)
    • βœ… No cache entry (during write)
  • Never see tags without value

@themitvp
Copy link
Contributor Author

@AyronK is it possible to deploy this as a release candidate or something? Then I can test it out in our production env and see if it really improves our metrics before we merge it into master

@AyronK
Copy link
Collaborator

AyronK commented Jan 23, 2026

@themitvp I'll take a look on Monday, but yes I can draft a prerelease.

@AyronK
Copy link
Collaborator

AyronK commented Jan 23, 2026

Btw you worried me so I quickly checked our production. It's at 98% of hits πŸ€”. Running with 10-30 nextjs applications at a time. I'll investigate further next week.

@AyronK
Copy link
Collaborator

AyronK commented Jan 23, 2026

But quickly checking your issue and code it may make sense in some scenarios, thanks for reporting and suggesting a fix!

@themitvp
Copy link
Contributor Author

Btw you worried me so I quickly checked our production. It's at 98% of hits πŸ€”. Running with 10-30 nextjs applications at a time. I'll investigate further next week.

Oh interesting, thanks for checking! I wonder if the issue is something else, but I can't see what else the issue could be. I also implemented a lock mechanism in my project in case of thundering herd issues, but it made no impact on cache hit ratio. But we added some logs and can see that it does actually happen quite a lot, that the pods are competing to set data, so at least the lock works.

I am also considering to open another PR to add the lock mechanism if you are interested.

I also thought that maybe we need to add some jitter to the TTLs, as we maybe have many keys that expire at the same time πŸ€” I will need to double check this.

@AyronK
Copy link
Collaborator

AyronK commented Jan 24, 2026

I also thought that maybe we need to add some jitter to the TTLs, as we maybe have many keys that expire at the same time πŸ€” I will need to double check this.

We do that, we have a random skew for each cache TTL (well, most of them). And I'm not saying there is no race condition, just that I didn't experience them, well nobody did (or is aware of that) until you I suppose πŸ˜….

I am also considering to open another PR to add the lock mechanism if you are interested.

Not sure what kind of locks you have in mind and where exactly.

@AyronK
Copy link
Collaborator

AyronK commented Jan 26, 2026

@themitvp 2.5.1-alhpa.1. Forgive me a typo πŸ˜†.

@themitvp
Copy link
Contributor Author

We do that, we have a random skew for each cache TTL (well, most of them). And I'm not saying there is no race condition, just that I didn't experience them, well nobody did (or is aware of that) until you I suppose πŸ˜….

Uh nice, good to know! I will experiment with that as well, but need to test each change separately otherwise I won't know what fixed it πŸ˜…

Not sure what kind of locks you have in mind and where exactly.

It is whenever we are trying to set a new value in Redis. We thought that maybe the odd cache hit ratio was because multiple pods where trying to set value at the same time. So we are doing a distributed Redis lock, so only the first request that sees expired cache acquires a lock and fetches fresh data, while the others will just see the lock and ignore setting data. We added some logging to it, and can actually see that it happens quite often. However this means that we still call our backend multiple times, so there is still some more work that we need to do here. When we finalize a nice solution then I can share it with you if it makes sense to add as part of the library.

@themitvp 2.5.1-alhpa.1. Forgive me a typo πŸ˜†.

Nice thank you @AyronK πŸ™Œ will update you when we know more πŸ‘Œ

@themitvp
Copy link
Contributor Author

@AyronK So the "fix" has been running stable in production for a little less than a day now, however it has made no big impact on any of our metrics, and the cache hit ratio remains at 66,6% 😞 So it is up to you if you wanna continue with this PR or close it, I think it does improve the code slightly, but also fair if you wanna keep it as it is.

So I will try adding jitter to our TTLs and see if that fixes anything, wish me luck 🀞

Copy link
Collaborator

@AyronK AyronK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the examples/race-condition-demo. I am not convinced it actually proves anything, especially after your experience from production. I would like to keep the improvement, but the code itself is enough 😊. Thanks for the contribution!

@themitvp
Copy link
Contributor Author

Done, I removed the example and comment πŸ‘Œ

@AyronK AyronK merged commit c4569a5 into fortedigital:master Jan 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants