Adds caller cancellation token propagation in hedging and timeout strategies#3094
Conversation
martincostello
left a comment
There was a problem hiding this comment.
This looks a lot less involved than I thought it might have been.
Just a few comments.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3094 +/- ##
=======================================
Coverage 96.16% 96.16%
=======================================
Files 310 311 +1
Lines 7136 7139 +3
Branches 1005 1006 +1
=======================================
+ Hits 6862 6865 +3
Misses 221 221
Partials 53 53
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. |
|
CI failure is fixed by #3096. |
|
Some test fixes needed for |
|
Yeah just noticed, didn't know that was still a target — will fix. |
|
Didn't think it was worth doing the whole |
|
Thanks for your contribution @DaRosenberg - the changes from this pull request have been published as part of version 8.7.0 📦, which is now available from NuGet.org 🚀 |
Pull Request
The issue or feature being addressed
Fixes #3086 (Timeout strategy does not propagate the caller's
CancellationToken)When a resilience pipeline contains a strategy that substitutes the execution
CancellationTokenwith an internal one (timeout, and also hedging), a caller-initiated cancellation surfaces anOperationCanceledExceptionwhoseCancellationTokenis Polly's internal token rather than the caller's. This breaks the common pattern of letting caller cancellation pass through unchanged while wrapping other failures, because callers cannot reliably compareOperationCanceledException.CancellationTokento their own token.Details on the issue fix or feature implementation
Goal
Polly should throw an
OperationCanceledExceptioncarrying the caller's token if and only if the cancellation was actually caused by a cancellation request on that token — for any pipeline, regardless of which strategies it is composed of or how they are nested.Approach
A repo-wide audit shows that within
Polly.Coreexactly two strategies substitute the execution token: timeout (TimeoutResilienceStrategy) and hedging (TaskExecutionviaResilienceContext.InitializeFrom). Every other strategy and the pipeline plumbing only readcontext.CancellationToken, so they already emit the correct token at their own level.The fix therefore lives in those two strategies, via a small shared helper:
TimeoutRejectedException) is unchanged.Because each substituting strategy normalizes back to its own previous token, the behavior composes correctly through arbitrary nesting: an inner timeout rewrites to the mid-level token, the outer timeout rewrites that to the caller's token, and so on. The simplest case (
AddTimeoutonly) and deeply nested cases both end up with the caller's token.Design decisions and trade-offs
new OperationCanceledException(callerToken).TrySetStackTrace(), matching the existing convention inDelegatingComponent,CompositeComponent, and hedging's pre-execution cancellation check. We deliberately did not chain the original exception as anInnerException; it keeps the behavior consistent with the rest of the codebase, at the cost of not preserving the original deep stack trace in this specific path.Polly.Core) only. The legacy v7PolicyAPI uses a combined linked token and already documents (inAsyncTimeoutEngine/TimeoutEngine) that the token on the exception is not reliable for this determination. We left v7 untouched to avoid changing long-stable behavior.callerToken.IsCancellationRequested, read after the exception has been produced. Because aCancellationTokenis a monotonic latch with no record of when or why it fired, there is an inherent, unavoidable race: if a non-caller cause produces the exception and the caller then cancels within the small window before we inspect the token, the cancellation is attributed to the caller. We chose not to add machinery to fight this, for three reasons:previousToken.IsCancellationRequestedproxy, so this change does not make it worse (it only makes the resulting token more coherent).Tests
Issues(IssuesTests.CancellationTokenPropagation_3086.cs) with end-to-end coverage: the exact issue repro, timeout±retry in both orders, hedging, nested timeouts, a no-substitution baseline, and two "only if" guards (a real timeout still throwsTimeoutRejectedException; an unrelatedOperationCanceledExceptionis preserved when the caller token is not cancelled)All
Polly.Core.Testspass; branch and method coverage remain at 100% and the new code is fully covered.Confirm the following