-
Notifications
You must be signed in to change notification settings - Fork 22
Fix: Race condition #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Race condition #192
Conversation
|
@AyronK is it possible to deploy this as a release candidate or something? Then I can test it out in our production env and see if it really improves our metrics before we merge it into master |
|
@themitvp I'll take a look on Monday, but yes I can draft a prerelease. |
|
Btw you worried me so I quickly checked our production. It's at 98% of hits π€. Running with 10-30 nextjs applications at a time. I'll investigate further next week. |
|
But quickly checking your issue and code it may make sense in some scenarios, thanks for reporting and suggesting a fix! |
Oh interesting, thanks for checking! I wonder if the issue is something else, but I can't see what else the issue could be. I also implemented a lock mechanism in my project in case of thundering herd issues, but it made no impact on cache hit ratio. But we added some logs and can see that it does actually happen quite a lot, that the pods are competing to set data, so at least the lock works. I am also considering to open another PR to add the lock mechanism if you are interested. I also thought that maybe we need to add some jitter to the TTLs, as we maybe have many keys that expire at the same time π€ I will need to double check this. |
We do that, we have a random skew for each cache TTL (well, most of them). And I'm not saying there is no race condition, just that I didn't experience them, well nobody did (or is aware of that) until you I suppose π .
Not sure what kind of locks you have in mind and where exactly. |
|
@themitvp |
Uh nice, good to know! I will experiment with that as well, but need to test each change separately otherwise I won't know what fixed it π
It is whenever we are trying to set a new value in Redis. We thought that maybe the odd cache hit ratio was because multiple pods where trying to set value at the same time. So we are doing a distributed Redis lock, so only the first request that sees expired cache acquires a lock and fetches fresh data, while the others will just see the lock and ignore setting data. We added some logging to it, and can actually see that it happens quite often. However this means that we still call our backend multiple times, so there is still some more work that we need to do here. When we finalize a nice solution then I can share it with you if it makes sense to add as part of the library.
Nice thank you @AyronK π will update you when we know more π |
|
@AyronK So the "fix" has been running stable in production for a little less than a day now, however it has made no big impact on any of our metrics, and the cache hit ratio remains at 66,6% π So it is up to you if you wanna continue with this PR or close it, I think it does improve the code slightly, but also fair if you wanna keep it as it is. So I will try adding jitter to our TTLs and see if that fixes anything, wish me luck π€ |
AyronK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the examples/race-condition-demo. I am not convinced it actually proves anything, especially after your experience from production. I would like to keep the improvement, but the code itself is enough π. Thanks for the contribution!
|
Done, I removed the example and comment π |
Fix Race Condition Bug
Package:
@fortedigital/nextjs-cache-handlerv2.5.0File:
packages/nextjs-cache-handler/src/handlers/redis-strings.tsIssue: Cache hit ratio degraded to ~66.6% in production (should be ~100%)
Root Cause: Race condition in
set()methodProblem Summary
In production with 28-31 Kubernetes pods running Next.js applications using this cache handler, we observed a consistent 66.6% cache hit ratio when it should be near 100%. This results in:
nullRoot Cause
The
set()method inredis-strings.tsperforms Redis operations in two sequential Promise.all() calls:Buggy Code (Current)
The Race Condition
Between Phase 1 and Phase 2, there's a critical race window:
nullβ false cache miss βWith 28-31 pods and high concurrency, approximately 1/3 of requests fall into this race window, explaining the 66.6% hit ratio.
Reproduction
I created a test that simulates cache invalidation with concurrent readers:
Test Setup
Results - BEFORE Fix
Note: Local testing shows 0.55% miss rate due to localhost Redis speed. In production with network latency across 28-31 pods, this amplifies to 33.4% (66.6% hit ratio).
Results - AFTER Fix
The Fix
Execute all Redis operations atomically in a single
Promise.all():Why This Works
get()calls either see: