Fix: Race condition #192

themitvp · 2026-01-23T13:48:25Z

Fix Race Condition Bug

Package: @fortedigital/nextjs-cache-handler v2.5.0
File: packages/nextjs-cache-handler/src/handlers/redis-strings.ts
Issue: Cache hit ratio degraded to ~66.6% in production (should be ~100%)
Root Cause: Race condition in set() method

Problem Summary

In production with 28-31 Kubernetes pods running Next.js applications using this cache handler, we observed a consistent 66.6% cache hit ratio when it should be near 100%. This results in:

33.4% false cache misses - data that exists in cache returns null
Wasted backend resources regenerating already-cached content
Degraded user experience due to slower response times
Unnecessary load on databases and APIs

Root Cause

The set() method in redis-strings.ts performs Redis operations in two sequential Promise.all() calls:

Buggy Code (Current)

// Phase 1: Write tags and TTL
await Promise.all([setTagsOperation, setSharedTtlOperation]);

// ... switch statement to prepare value operations ...

// Phase 2: Write value and expiration
await Promise.all([setOperation, expireOperation]);

The Race Condition

Between Phase 1 and Phase 2, there's a critical race window:

Phase 1 completes → Tags exist in Redis
Concurrent get() calls arrive during the gap
get() logic sees tags exist (line 187-193)
But value doesn't exist yet (Phase 2 hasn't completed)
get() returns null → false cache miss ❌

With 28-31 pods and high concurrency, approximately 1/3 of requests fall into this race window, explaining the 66.6% hit ratio.

Reproduction

I created a test that simulates cache invalidation with concurrent readers:

Test Setup

100 iterations of cache invalidation + regeneration
20 concurrent readers per iteration
2,000 total read attempts
Simulates production patterns

Results - BEFORE Fix

📊 FINAL RESULTS:

   Total cache reads:       2,000
   Successful hits:         1,989 (99.45%)
   False cache misses:      11 (0.55%)
   Cache hit ratio:         99.45%

😱 RACE CONDITION DETECTED

Note: Local testing shows 0.55% miss rate due to localhost Redis speed. In production with network latency across 28-31 pods, this amplifies to 33.4% (66.6% hit ratio).

Results - AFTER Fix

📊 FINAL RESULTS:

   Total cache reads:       2,000
   Successful hits:         2,000 (100.00%)
   False cache misses:      0 (0.00%)
   Cache hit ratio:         100.00%

✅ NO RACE CONDITIONS DETECTED

   The atomic Promise.all() implementation successfully eliminates
   the race window by executing all operations together.

   100% cache hit ratio achieved! 🎉

The Fix

Execute all Redis operations atomically in a single Promise.all():

// Fixed: All operations atomic
await Promise.all(
  [
    setTagsOperation,
    setSharedTtlOperation,
    setOperation,
    expireOperation,
  ].filter(Boolean),
);

Why This Works

All Redis operations execute concurrently
No gap between tag writes and value writes
Concurrent get() calls either see:
- ✅ Complete cache entry (tags + value)
- ✅ No cache entry (during write)
Never see tags without value

themitvp · 2026-01-23T14:00:59Z

@AyronK is it possible to deploy this as a release candidate or something? Then I can test it out in our production env and see if it really improves our metrics before we merge it into master

AyronK · 2026-01-23T15:15:12Z

@themitvp I'll take a look on Monday, but yes I can draft a prerelease.

AyronK · 2026-01-23T15:21:57Z

Btw you worried me so I quickly checked our production. It's at 98% of hits 🤔. Running with 10-30 nextjs applications at a time. I'll investigate further next week.

AyronK · 2026-01-23T15:22:34Z

But quickly checking your issue and code it may make sense in some scenarios, thanks for reporting and suggesting a fix!

themitvp · 2026-01-23T15:37:09Z

Btw you worried me so I quickly checked our production. It's at 98% of hits 🤔. Running with 10-30 nextjs applications at a time. I'll investigate further next week.

Oh interesting, thanks for checking! I wonder if the issue is something else, but I can't see what else the issue could be. I also implemented a lock mechanism in my project in case of thundering herd issues, but it made no impact on cache hit ratio. But we added some logs and can see that it does actually happen quite a lot, that the pods are competing to set data, so at least the lock works.

I am also considering to open another PR to add the lock mechanism if you are interested.

I also thought that maybe we need to add some jitter to the TTLs, as we maybe have many keys that expire at the same time 🤔 I will need to double check this.

AyronK · 2026-01-24T18:36:26Z

I also thought that maybe we need to add some jitter to the TTLs, as we maybe have many keys that expire at the same time 🤔 I will need to double check this.

We do that, we have a random skew for each cache TTL (well, most of them). And I'm not saying there is no race condition, just that I didn't experience them, well nobody did (or is aware of that) until you I suppose 😅.

I am also considering to open another PR to add the lock mechanism if you are interested.

Not sure what kind of locks you have in mind and where exactly.

AyronK · 2026-01-26T09:27:32Z

@themitvp 2.5.1-alhpa.1. Forgive me a typo 😆.

themitvp · 2026-01-26T10:23:25Z

We do that, we have a random skew for each cache TTL (well, most of them). And I'm not saying there is no race condition, just that I didn't experience them, well nobody did (or is aware of that) until you I suppose 😅.

Uh nice, good to know! I will experiment with that as well, but need to test each change separately otherwise I won't know what fixed it 😅

Not sure what kind of locks you have in mind and where exactly.

It is whenever we are trying to set a new value in Redis. We thought that maybe the odd cache hit ratio was because multiple pods where trying to set value at the same time. So we are doing a distributed Redis lock, so only the first request that sees expired cache acquires a lock and fetches fresh data, while the others will just see the lock and ignore setting data. We added some logging to it, and can actually see that it happens quite often. However this means that we still call our backend multiple times, so there is still some more work that we need to do here. When we finalize a nice solution then I can share it with you if it makes sense to add as part of the library.

@themitvp 2.5.1-alhpa.1. Forgive me a typo 😆.

Nice thank you @AyronK 🙌 will update you when we know more 👌

themitvp · 2026-01-27T10:30:36Z

@AyronK So the "fix" has been running stable in production for a little less than a day now, however it has made no big impact on any of our metrics, and the cache hit ratio remains at 66,6% 😞 So it is up to you if you wanna continue with this PR or close it, I think it does improve the code slightly, but also fair if you wanna keep it as it is.

So I will try adding jitter to our TTLs and see if that fixes anything, wish me luck 🤞

AyronK

Please remove the examples/race-condition-demo. I am not convinced it actually proves anything, especially after your experience from production. I would like to keep the improvement, but the code itself is enough 😊. Thanks for the contribution!

packages/nextjs-cache-handler/src/handlers/redis-strings.ts

themitvp · 2026-01-27T11:44:03Z

Done, I removed the example and comment 👌

themitvp added 2 commits January 23, 2026 11:33

fix: race condition

44efc88

add example to demo race condition

83cea4a

AyronK requested changes Jan 27, 2026

View reviewed changes

packages/nextjs-cache-handler/src/handlers/redis-strings.ts Outdated Show resolved Hide resolved

clean up

e36cd62

AyronK approved these changes Jan 27, 2026

View reviewed changes

AyronK merged commit c4569a5 into fortedigital:master Jan 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Race condition #192

Fix: Race condition #192

Uh oh!

themitvp commented Jan 23, 2026

Uh oh!

themitvp commented Jan 23, 2026

Uh oh!

AyronK commented Jan 23, 2026

Uh oh!

AyronK commented Jan 23, 2026 •

edited

Loading

Uh oh!

AyronK commented Jan 23, 2026

Uh oh!

themitvp commented Jan 23, 2026

Uh oh!

AyronK commented Jan 24, 2026 •

edited

Loading

Uh oh!

AyronK commented Jan 26, 2026

Uh oh!

themitvp commented Jan 26, 2026

Uh oh!

themitvp commented Jan 27, 2026

Uh oh!

AyronK left a comment

Uh oh!

Uh oh!

themitvp commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Race condition #192

Fix: Race condition #192

Uh oh!

Conversation

themitvp commented Jan 23, 2026

Fix Race Condition Bug

Problem Summary

Root Cause

Buggy Code (Current)

The Race Condition

Reproduction

Test Setup

Results - BEFORE Fix

Results - AFTER Fix

The Fix

Why This Works

Uh oh!

themitvp commented Jan 23, 2026

Uh oh!

AyronK commented Jan 23, 2026

Uh oh!

AyronK commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AyronK commented Jan 23, 2026

Uh oh!

themitvp commented Jan 23, 2026

Uh oh!

AyronK commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AyronK commented Jan 26, 2026

Uh oh!

themitvp commented Jan 26, 2026

Uh oh!

themitvp commented Jan 27, 2026

Uh oh!

AyronK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

themitvp commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AyronK commented Jan 23, 2026 •

edited

Loading

AyronK commented Jan 24, 2026 •

edited

Loading