[core] Handle 429 and 500 errors from worlds in runtime#966
[core] Handle 429 and 500 errors from worlds in runtime#966VaguelySerious merged 12 commits intomainfrom
Conversation
Signed-off-by: Peter Wielander <[email protected]>
🦋 Changeset detectedLatest commit: 9f75d33 The changes in this PR will be included in the next version bump. This PR includes changesets to release 19 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (42 failed)mongodb (1 failed):
turso (41 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
- Propagate Retry-After header as WorkflowAPIError.retryAfter on 429 responses - Add withThrottleRetry wrapper for both workflow and step handlers - Re-enqueue workflows on 5xx errors with exponential backoff (5s, 30s, 120min) - Track serverErrorRetryCount in queue payload for retry budgeting - Expose delaySeconds on QueueOptions interface Co-Authored-By: Claude Opus 4.6 <[email protected]>
| ); | ||
| // Short wait: sleep in-process, then retry once | ||
| await new Promise((resolve) => | ||
| setTimeout(resolve, retryAfterSeconds * 1000) |
There was a problem hiding this comment.
Should we account for function execution time limits specifically in the case of Vercel world? If the serverless fuction is already close to the end of it's limit and the workflow server throws a 429, adding a 10 sec sleep could potentially exceed the function execution limit and the function could get SIGKILLd midway.
There was a problem hiding this comment.
The workflow layer should never take much more than a few seconds, so I think it's highly unlikely that we'd run into timeouts, so I'm not too worried about this, but technically a concern
|
Rest of the code looks good except the one concern I have around function execution limits in the case of vercel world. |
pranaygp
left a comment
There was a problem hiding this comment.
Overall this looks solid — the 429 throttle retry and 5xx backoff are clean additions. The withThrottleRetry wrapper with the "safe to retry because 429 is pre-processing" invariant is a nice design. A few things to consider:
world-local doesn't support delaySeconds — The QueueOptions interface was updated and world-vercel passes it through, but world-local/src/queue.ts silently ignores opts?.delaySeconds. If a 5xx retry fires in local dev, the message will be re-enqueued but delivered immediately instead of after the intended backoff. Not a blocker since 5xx from world-local is uncommon, but could cause confusing behavior during local fault testing (which the PR description mentions wanting to add).
Step handler diff is mostly re-indentation — confirmed the actual logic changes are just wrapping in withThrottleRetry. The step handler does NOT get 5xx retry (only 429), which makes sense since the queue's built-in redelivery + step_started idempotency handles that.
Inline review notesSince GitHub's review API is having issues, posting inline comments here:
The third backoff step is 7200s (2 hours). Is the expectation that persistent 5xx would trigger an alert/incident on the server side well before this fires? A brief comment explaining the rationale for these specific values would help future readers.
The
If the first in-process retry succeeds, great. But if it gets a different error (not 429), that error is thrown — which means the handler fails entirely rather than deferring to queue. This seems correct since a non-429 error should propagate, just want to confirm the intent.
nit: |
pranaygp
left a comment
There was a problem hiding this comment.
Overall this is clean and well-structured. The two-tier 429 strategy (in-process for short waits, defer to queue for long waits) is nice. A few questions and concerns below.
| * | ||
| * Safe to retry the entire handler because 429 is sent from server middleware | ||
| * before the request is processed — no server state has changed. | ||
| */ |
There was a problem hiding this comment.
The comment says "Safe to retry the entire handler because 429 is sent from server middleware before the request is processed — no server state has changed."
This holds if the 429 always hits on the first API call in the handler. But the handler makes multiple world API calls sequentially (e.g. runs.get → events.create(run_started) → replay → events.create(run_completed)). If a later call gets 429'd, the retry re-executes everything from the top.
For the workflow handler this is probably fine since replay is deterministic and events are idempotent.
For the step handler this is more concerning — the retry would re-execute user step code, which may not be idempotent. Is the assumption that the workflow-server's rate limiting middleware rejects at the connection level (so all calls in one handler invocation either succeed or all fail)? If so, worth documenting that assumption.
packages/core/src/runtime.ts
Outdated
| err.status >= 500 | ||
| ) { | ||
| const retryCount = serverErrorRetryCount ?? 0; | ||
| const delaySecondSteps = [5, 30, 7200]; // 5s, 30s, 120min |
There was a problem hiding this comment.
The third step is 7200 seconds = 2 hours. Is that intentional? That's a very long time for a workflow to be stalled on a server error. If the 500 was truly transient, 2 hours feels excessive. If it's a sustained outage, you'd probably want to fail rather than retry after 2h when the server might have moved on.
Maybe something like [5, 30, 120] (5s, 30s, 2min) would be more practical?
There was a problem hiding this comment.
Changed to [5, 30, 120]. Realistically if we keep getting 500s, it'd be nice to re-try the run much later so we have a change to incident-mitigate, but I think it's fine to fail runs instead too
There was a problem hiding this comment.
cool. this was fully AI generated review btw. idk where it came up with 5s, 30, 120 - but I do agree i we should have some shorter ones before getting to 7200 anyway
| } | ||
|
|
||
| const maxRetries = stepFn.maxRetries ?? DEFAULT_STEP_MAX_RETRIES; | ||
| return await withThrottleRetry(async () => { |
There was a problem hiding this comment.
The step handler gets withThrottleRetry for 429s but no equivalent 5xx retry logic like the workflow handler has. If a world API call (e.g. events.create for step_started or step_completed) returns a 5xx, it'll be caught by the step's error handling and burn a step retry attempt — re-executing potentially expensive user code for what was a transient infrastructure error.
Is this intentional for this PR's scope, or should step handlers also get 5xx backoff retry?
There was a problem hiding this comment.
500s are handled via regular step re-try. 429s are only special-cased because they allow the step-retry mechanism to use the retry-after header instead of guessing at the retry
There was a problem hiding this comment.
right but the comment is about not burning a full step attempt for transient 5xx. doesn't need to be addressed in this PR necessarily but good to think about. happy to move in any direction for now since they're all better than nothing and then come back to this in the future with real experience
There was a problem hiding this comment.
Makes sense that steps rely on their existing retry mechanism. One concern though: with withThrottleRetry removed, a 429 from the world API during step execution (e.g. on step_started or step_completed) will now burn a step retry attempt — potentially exhausting maxRetries without the user's code ever being at fault.
Ideally 429s should be retried transparently without consuming an attempt. A targeted retry around just the world API calls (not wrapping the user code execution) could thread that needle. Worth tracking as a follow-up.
| idempotencyKey?: string; | ||
| headers?: Record<string, string>; | ||
| /** Delay message delivery by this many seconds */ | ||
| delaySeconds?: number; |
There was a problem hiding this comment.
Good that this was promoted from the vercel-specific type to the shared interface. Note that world-local currently ignores delaySeconds entirely — so 5xx retries in local dev will fire immediately instead of with backoff. Not a blocker but worth a follow-up or at minimum a comment.
There was a problem hiding this comment.
Seems fine to ignore for local world, since there are no 429s
There was a problem hiding this comment.
although we probably do actually want to support delaySeconds in local world anyway as we start using this option in queue more often in the future for more use cases. even if it's not implemented in this PR, I think we should leave an explicit comment that local world ignores this since it's a nuance. We should later have an e2e test that checks for this behaviour and would fail on local world without proper queue delaySeconds implementation (cc @TooTallNate )
Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
pranaygp
left a comment
There was a problem hiding this comment.
Looks good! The removal of withThrottleRetry from the step handler was the right fix for the re-execution safety concern.
One thing worth tracking as a follow-up: with withThrottleRetry removed from steps, a 429 from the world API during step execution (e.g. on step_started or step_completed) will now burn a step retry attempt. This means a transient throttle from the server could exhaust a step's maxRetries without the user's code ever being at fault. Ideally 429s should be retried transparently without consuming an attempt — maybe a targeted retry around just the world API calls (not the user code execution) would thread that needle.
pranaygp
left a comment
There was a problem hiding this comment.
Review: [core] Handle 429 and 500 errors from worlds in runtime
Overall this is a solid approach to handling transient server errors. The 429 throttle retry with short/long path split is well-designed, and the 500 retry via re-enqueue with exponential backoff is a safe pattern. A few items to address below.
| ); | ||
| runtimeLogger.warn( | ||
| 'Throttled again on retry, deferring to queue', | ||
| { |
There was a problem hiding this comment.
Redundant check: this is already inside if (WorkflowAPIError.is(err)), so the second WorkflowAPIError.is(err) on this line is always true. Should just be if (err.status === 429).
| step = startResult.step; | ||
| } catch (err) { | ||
| if (WorkflowAPIError.is(err)) { | ||
| if (WorkflowAPIError.is(err) && err.status === 429) { |
There was a problem hiding this comment.
The step handler has 429 handling here inside the step_started catch block, but unlike the workflow handler, it does not use withThrottleRetry and does not have the short-wait in-process retry path. This means a brief 429 (e.g., retryAfter=2s) will always defer to the queue rather than sleeping in-process. Is that intentional? If steps should also get the in-process retry for short waits, consider wrapping with withThrottleRetry similarly to the workflow handler.
| if (retryAfterSeconds < 10) { | ||
| runtimeLogger.warn( | ||
| 'Throttled by workflow-server (429), retrying in-process', | ||
| { |
There was a problem hiding this comment.
Nit: if retryAfter is undefined (header missing), this defaults to 1 second. For 429 responses without a Retry-After header, 1 second might be too aggressive. A slightly higher default (e.g., 3-5s) would be more conservative and reduce the chance of hitting the server again immediately while it's under load.
| workflowStartedAt = +workflowRun.startedAt; | ||
|
|
||
| // At this point, the workflow is "running" and `startedAt` should | ||
| // definitely be set. |
There was a problem hiding this comment.
The entire workflow handler body is now wrapped in withThrottleRetry. This means if a 429 is thrown by any world call during workflow execution (e.g., world.runs.get, world.events.create, runWorkflow), the entire handler will be retried from scratch.
- If
run_startedevent was already created successfully and then a later call throws 429, the retry will callworld.runs.getagain and find the run alreadyrunning, which is handled correctly. - However, if
run_completedevent creation throws 429, the retry would re-run the entire workflow (replay + execution). The workflow replay should be deterministic, but this could be expensive. Worth documenting this tradeoff in a comment.
| }); | ||
| // Use the run entity from the event response (no extra get call needed) | ||
| if (!result.run) { | ||
| if (workflowRun.status === 'pending') { |
There was a problem hiding this comment.
The 500 retry re-enqueues with a new requestedAt and incremented serverErrorRetryCount. The traceCarrier is serialized fresh via serializeTraceCarrier(). If the trace context matters for correlating retries, this creates a new trace for each retry attempt. Consider whether preserving the original traceContext from the message would be preferable for observability (linking all retry attempts to the same parent trace).
| WorldParseFormat, | ||
| PeerService, | ||
| RpcSystem, | ||
| RpcService, |
There was a problem hiding this comment.
Note: the HTTP Retry-After header can also be an HTTP-date (e.g., Wed, 21 Oct 2015 07:28:00 GMT). The parseInt approach correctly returns NaN for date values, falling back to undefined. This is fine since the workflow-server likely only sends numeric values, but a comment noting this would be helpful.
Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
Uses retry-after header for 429 when provided.
500s are limited to 3x retry, doing exponential backoff.
Review with white space off for sanity.
Claude's suggestion for e2e tests, which I think we might do separately. I think we should make some sort of testbench for world errors and how the runtime reacts.