Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[core] Handle 429 and 500 errors from worlds in runtime#966

Merged
VaguelySerious merged 12 commits intomainfrom
peter/429-core
Feb 9, 2026
Merged

[core] Handle 429 and 500 errors from worlds in runtime#966
VaguelySerious merged 12 commits intomainfrom
peter/429-core

Conversation

@VaguelySerious
Copy link
Member

@VaguelySerious VaguelySerious commented Feb 6, 2026

Uses retry-after header for 429 when provided.
500s are limited to 3x retry, doing exponential backoff.

Review with white space off for sanity.

Claude's suggestion for e2e tests, which I think we might do separately. I think we should make some sort of testbench for world errors and how the runtime reacts.

  3. E2E with a fault-injection endpoint (moderate effort, very targeted)

  Add a test-only endpoint to the workbench app (or a middleware) that can be told to return 429/500 for the next N requests. Something like:

  POST /e2e/inject-fault { status: 429, retryAfter: 3, count: 2 }

  Then the e2e test would:
  1. Inject a fault (e.g. "next 2 requests to workflow handler return 429 with Retry-After: 3")
  2. Start a workflow
  3. Assert it completes successfully despite the transient errors
  4. Optionally check logs/traces for the retry events

  This is the most surgical approach but requires adding test-only infrastructure. The world-local queue already makes HTTP requests to the app endpoints, so you could intercept at that
  layer.

Signed-off-by: Peter Wielander <[email protected]>
@changeset-bot
Copy link

changeset-bot bot commented Feb 6, 2026

🦋 Changeset detected

Latest commit: 9f75d33

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 19 packages
Name Type
@workflow/errors Patch
@workflow/world Patch
@workflow/world-vercel Patch
@workflow/core Patch
@workflow/builders Patch
@workflow/cli Patch
workflow Patch
@workflow/world-local Patch
@workflow/world-postgres Patch
@workflow/web-shared Patch
@workflow/world-testing Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Contributor

vercel bot commented Feb 6, 2026

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
✅ ▲ Vercel Production 490 0 38 528
✅ 💻 Local Development 418 0 62 480
✅ 📦 Local Production 418 0 62 480
✅ 🐘 Local Postgres 418 0 62 480
✅ 🪟 Windows 45 0 3 48
❌ 🌍 Community Worlds 105 42 9 156
✅ 📋 Other 123 0 21 144
Total 2017 42 257 2316

❌ Failed Tests

🌍 Community Worlds (42 failed)

mongodb (1 failed):

  • webhookWorkflow

turso (41 failed):

  • addTenWorkflow
  • addTenWorkflow
  • should work with react rendering in step
  • promiseAllWorkflow
  • promiseRaceWorkflow
  • promiseAnyWorkflow
  • hookWorkflow
  • webhookWorkflow
  • sleepingWorkflow
  • nullByteWorkflow
  • workflowAndStepMetadataWorkflow
  • fetchWorkflow
  • promiseRaceStressTestWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation workflow errors cross-file imports preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior regular Error retries until success
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior maxRetries=0 disables retries
  • error handling catchability FatalError can be caught and detected with FatalError.is()
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars)
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument
  • closureVariableWorkflow - nested step functions with closure variables
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • health check (queue-based) - workflow and step endpoints respond to health check messages
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly
  • Calculator.calculate - static workflow method using static step methods from another class
  • AllInOneService.processNumber - static workflow method using sibling static step methods
  • ChainableService.processWithThis - static step methods using this to reference the class
  • thisSerializationWorkflow - step function invoked with .call() and .apply()
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • instanceMethodStepWorkflow - instance methods with "use step" directive
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument
  • pages router addTenWorkflow via pages router
  • pages router promiseAllWorkflow via pages router
  • pages router sleepingWorkflow via pages router

Details by Category

✅ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 44 0 4
✅ example 44 0 4
✅ express 44 0 4
✅ fastify 44 0 4
✅ hono 44 0 4
✅ nextjs-turbopack 47 0 1
✅ nextjs-webpack 47 0 1
✅ nitro 44 0 4
✅ nuxt 44 0 4
✅ sveltekit 44 0 4
✅ vite 44 0 4
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 41 0 7
✅ express-stable 41 0 7
✅ fastify-stable 41 0 7
✅ hono-stable 41 0 7
✅ nextjs-turbopack-stable 45 0 3
✅ nextjs-webpack-stable 45 0 3
✅ nitro-stable 41 0 7
✅ nuxt-stable 41 0 7
✅ sveltekit-stable 41 0 7
✅ vite-stable 41 0 7
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 41 0 7
✅ express-stable 41 0 7
✅ fastify-stable 41 0 7
✅ hono-stable 41 0 7
✅ nextjs-turbopack-stable 45 0 3
✅ nextjs-webpack-stable 45 0 3
✅ nitro-stable 41 0 7
✅ nuxt-stable 41 0 7
✅ sveltekit-stable 41 0 7
✅ vite-stable 41 0 7
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 41 0 7
✅ express-stable 41 0 7
✅ fastify-stable 41 0 7
✅ hono-stable 41 0 7
✅ nextjs-turbopack-stable 45 0 3
✅ nextjs-webpack-stable 45 0 3
✅ nitro-stable 41 0 7
✅ nuxt-stable 41 0 7
✅ sveltekit-stable 41 0 7
✅ vite-stable 41 0 7
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 45 0 3
❌ 🌍 Community Worlds
App Passed Failed Skipped
✅ mongodb-dev 3 0 0
❌ mongodb 44 1 3
✅ redis-dev 3 0 0
✅ redis 45 0 3
✅ starter-dev 3 0 0
✅ turso-dev 3 0 0
❌ turso 4 41 3
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 41 0 7
✅ e2e-local-postgres-nest-stable 41 0 7
✅ e2e-local-prod-nest-stable 41 0 7

📋 View full workflow run

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

📊 Benchmark Results

📈 Comparing against baseline from main branch. Green 🟢 = faster, Red 🔺 = slower.

workflow with no steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 0.032s (+22.9% 🔺) 1.005s (~) 0.973s 10 1.00x
💻 Local Express 0.032s (-2.7%) 1.005s (~) 0.973s 10 1.02x
💻 Local Next.js (Turbopack) 0.046s 1.005s 0.959s 10 1.44x
🌐 Redis Next.js (Turbopack) 0.051s 1.005s 0.953s 10 1.62x
🐘 Postgres Nitro 0.106s (+13.5% 🔺) 1.009s (~) 0.903s 10 3.36x
🌐 MongoDB Next.js (Turbopack) 0.110s 1.007s 0.897s 10 3.48x
🐘 Postgres Express 0.194s (-45.0% 🟢) 1.021s (-2.1%) 0.827s 10 6.12x
🐘 Postgres Next.js (Turbopack) 0.444s 1.009s 0.565s 10 14.01x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 0.784s (-7.5% 🟢) 2.468s (-4.7%) 1.684s 10 1.00x
▲ Vercel Next.js (Turbopack) 0.786s (-14.1% 🟢) 2.385s (-1.5%) 1.599s 10 1.00x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

workflow with 1 step

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Next.js (Turbopack) 1.105s 2.006s 0.901s 10 1.00x
💻 Local Nitro 1.105s (+2.8%) 2.006s (~) 0.901s 10 1.00x
💻 Local Express 1.108s (~) 2.006s (~) 0.898s 10 1.00x
🌐 Redis Next.js (Turbopack) 1.110s 2.006s 0.896s 10 1.00x
🌐 MongoDB Next.js (Turbopack) 1.300s 2.007s 0.707s 10 1.18x
🐘 Postgres Next.js (Turbopack) 1.907s 2.113s 0.206s 10 1.73x
🐘 Postgres Express 2.460s (+10.6% 🔺) 3.014s (~) 0.554s 10 2.23x
🐘 Postgres Nitro 2.500s (+27.1% 🔺) 3.013s (+36.1% 🔺) 0.513s 10 2.26x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.957s (+12.5% 🔺) 3.946s (+7.1% 🔺) 0.988s 10 1.00x
▲ Vercel Express 3.215s (+22.0% 🔺) 4.569s (+21.6% 🔺) 1.354s 10 1.09x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack) | Express

workflow with 10 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 10.722s 11.022s 0.300s 3 1.00x
💻 Local Next.js (Turbopack) 10.745s 11.023s 0.278s 3 1.00x
💻 Local Nitro 10.838s (+2.4%) 11.022s (~) 0.184s 3 1.01x
💻 Local Express 10.839s (~) 11.024s (~) 0.185s 3 1.01x
🌐 MongoDB Next.js (Turbopack) 12.349s 13.025s 0.677s 3 1.15x
🐘 Postgres Next.js (Turbopack) 15.512s 16.049s 0.537s 2 1.45x
🐘 Postgres Express 20.303s (~) 21.061s (~) 0.758s 2 1.89x
🐘 Postgres Nitro 20.372s (+32.8% 🔺) 21.053s (+31.2% 🔺) 0.682s 2 1.90x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 18.895s (-5.9% 🟢) 20.661s (-1.5%) 1.766s 2 1.00x
▲ Vercel Next.js (Turbopack) 21.076s (+8.2% 🔺) 22.174s (+8.3% 🔺) 1.098s 2 1.12x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

workflow with 25 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 26.866s 27.050s 0.185s 3 1.00x
💻 Local Next.js (Turbopack) 27.305s 28.051s 0.746s 3 1.02x
💻 Local Nitro 27.454s (+2.6%) 28.049s (+3.7%) 0.596s 3 1.02x
💻 Local Express 27.557s (~) 28.052s (~) 0.495s 3 1.03x
🌐 MongoDB Next.js (Turbopack) 30.556s 31.046s 0.490s 2 1.14x
🐘 Postgres Nitro 50.218s (+21.3% 🔺) 51.134s (+21.4% 🔺) 0.915s 2 1.87x
🐘 Postgres Express 50.303s (~) 50.630s (-1.0%) 0.327s 2 1.87x
🐘 Postgres Next.js (Turbopack) 50.312s 50.623s 0.312s 2 1.87x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 48.126s (-3.0%) 48.969s (-5.3% 🟢) 0.843s 2 1.00x
▲ Vercel Next.js (Turbopack) 49.886s (-4.5%) 51.786s (-3.7%) 1.900s 2 1.04x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

workflow with 50 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 54.374s 55.101s 0.726s 2 1.00x
💻 Local Next.js (Turbopack) 56.919s 57.103s 0.184s 2 1.05x
💻 Local Nitro 57.120s (+2.8%) 58.098s (+3.6%) 0.978s 2 1.05x
💻 Local Express 57.342s (~) 58.106s (~) 0.764s 2 1.05x
🌐 MongoDB Next.js (Turbopack) 61.157s 61.592s 0.435s 2 1.12x
🐘 Postgres Next.js (Turbopack) 91.740s 92.206s 0.466s 1 1.69x
🐘 Postgres Express 100.159s (~) 100.227s (-1.0%) 0.068s 1 1.84x
🐘 Postgres Nitro 100.479s (+26.0% 🔺) 101.227s (+26.2% 🔺) 0.748s 1 1.85x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 101.457s (-0.6%) 102.618s (-0.8%) 1.161s 1 1.00x
▲ Vercel Next.js (Turbopack) 104.877s (+1.8%) 105.825s (+2.1%) 0.948s 1 1.03x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

Promise.all with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 1.254s 2.006s 0.753s 15 1.00x
💻 Local Nitro 1.405s (+4.5%) 2.006s (~) 0.601s 15 1.12x
💻 Local Express 1.416s (~) 2.005s (~) 0.589s 15 1.13x
💻 Local Next.js (Turbopack) 1.432s 2.005s 0.573s 15 1.14x
🐘 Postgres Next.js (Turbopack) 2.095s 2.474s 0.379s 13 1.67x
🌐 MongoDB Next.js (Turbopack) 2.164s 3.008s 0.844s 10 1.73x
🐘 Postgres Express 2.194s (-6.6% 🟢) 3.014s (~) 0.820s 10 1.75x
🐘 Postgres Nitro 2.485s (+40.2% 🔺) 3.014s (+45.0% 🔺) 0.529s 10 1.98x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.728s (+1.1%) 3.889s (+4.9%) 1.161s 8 1.00x
▲ Vercel Express 3.000s (+10.3% 🔺) 4.204s (+9.7% 🔺) 1.204s 8 1.10x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack) | Express

Promise.all with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 2.485s 3.008s 0.523s 10 1.00x
💻 Local Nitro 2.544s (+13.2% 🔺) 3.007s (~) 0.463s 10 1.02x
💻 Local Express 2.619s (+2.4%) 3.007s (~) 0.388s 10 1.05x
💻 Local Next.js (Turbopack) 2.642s 3.007s 0.364s 10 1.06x
🌐 MongoDB Next.js (Turbopack) 4.767s 5.178s 0.411s 6 1.92x
🐘 Postgres Express 8.884s (-0.5%) 9.546s (+2.9%) 0.661s 4 3.58x
🐘 Postgres Nitro 9.960s (-10.4% 🟢) 10.697s (-8.6% 🟢) 0.737s 3 4.01x
🐘 Postgres Next.js (Turbopack) 12.648s 13.037s 0.389s 3 5.09x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 3.552s (+9.2% 🔺) 5.099s (+14.9% 🔺) 1.547s 6 1.00x
▲ Vercel Express 3.700s (+17.3% 🔺) 5.005s (+20.3% 🔺) 1.305s 7 1.04x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack) | Express

Promise.all with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 4.067s 4.725s 0.658s 7 1.00x
💻 Local Nitro 7.071s (+19.5% 🔺) 7.767s (+21.1% 🔺) 0.696s 4 1.74x
💻 Local Next.js (Turbopack) 7.627s 8.016s 0.390s 4 1.88x
💻 Local Express 7.708s (+5.2% 🔺) 8.273s (+3.2%) 0.565s 4 1.90x
🌐 MongoDB Next.js (Turbopack) 9.877s 10.352s 0.475s 3 2.43x
🐘 Postgres Nitro 50.050s (-4.4%) 50.119s (-5.7% 🟢) 0.069s 1 12.31x
🐘 Postgres Express 51.197s (+8.8% 🔺) 52.128s (+8.4% 🔺) 0.931s 1 12.59x
🐘 Postgres Next.js (Turbopack) 55.317s 56.128s 0.811s 1 13.60x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 3.682s (-1.3%) 5.309s (+5.1% 🔺) 1.626s 6 1.00x
▲ Vercel Express 4.333s (+22.4% 🔺) 5.927s (+17.6% 🔺) 1.594s 7 1.18x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack) | Express

Promise.race with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 1.259s 2.006s 0.747s 15 1.00x
💻 Local Nitro 1.425s (+5.2% 🔺) 2.004s (~) 0.580s 15 1.13x
💻 Local Express 1.431s (~) 2.006s (~) 0.574s 15 1.14x
💻 Local Next.js (Turbopack) 1.464s 2.006s 0.542s 15 1.16x
🐘 Postgres Nitro 2.141s (+27.4% 🔺) 2.831s (+36.3% 🔺) 0.691s 11 1.70x
🐘 Postgres Express 2.142s (+2.0%) 2.741s (+5.6% 🔺) 0.599s 11 1.70x
🌐 MongoDB Next.js (Turbopack) 2.169s 3.008s 0.839s 10 1.72x
🐘 Postgres Next.js (Turbopack) 2.308s 2.741s 0.432s 11 1.83x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 2.631s (+1.9%) 3.817s (-0.8%) 1.186s 8 1.00x
▲ Vercel Next.js (Turbopack) 2.777s (+2.8%) 4.032s (+4.4%) 1.254s 8 1.06x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

Promise.race with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 2.493s 3.008s 0.515s 10 1.00x
💻 Local Nitro 2.635s (+9.4% 🔺) 3.007s (~) 0.372s 10 1.06x
💻 Local Next.js (Turbopack) 2.722s 3.008s 0.285s 10 1.09x
💻 Local Express 2.837s (+3.1%) 3.008s (~) 0.170s 10 1.14x
🌐 MongoDB Next.js (Turbopack) 4.725s 5.176s 0.451s 6 1.89x
🐘 Postgres Express 11.149s (+2.8%) 11.698s (+2.9%) 0.548s 3 4.47x
🐘 Postgres Nitro 12.296s (+14.2% 🔺) 13.039s (+14.7% 🔺) 0.743s 3 4.93x
🐘 Postgres Next.js (Turbopack) 12.730s 13.368s 0.638s 3 5.11x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 2.981s (-6.6% 🟢) 4.089s (-0.8%) 1.108s 8 1.00x
▲ Vercel Next.js (Turbopack) 3.066s (-10.0% 🟢) 3.923s (-15.5% 🟢) 0.857s 8 1.03x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

Promise.race with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🌐 Redis 🥇 Next.js (Turbopack) 4.079s 4.725s 0.646s 7 1.00x
💻 Local Nitro 7.739s (+16.5% 🔺) 8.016s (+14.3% 🔺) 0.277s 4 1.90x
💻 Local Next.js (Turbopack) 8.017s 8.518s 0.501s 4 1.97x
💻 Local Express 8.163s (+4.8%) 9.023s (+12.5% 🔺) 0.860s 4 2.00x
🌐 MongoDB Next.js (Turbopack) 9.734s 10.348s 0.614s 3 2.39x
🐘 Postgres Nitro 52.679s (+7.4% 🔺) 53.124s (+8.1% 🔺) 0.445s 1 12.91x
🐘 Postgres Next.js (Turbopack) 54.508s 55.127s 0.619s 1 13.36x
🐘 Postgres Express 54.742s (+4.8%) 55.132s (+3.8%) 0.390s 1 13.42x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 3.188s (-6.3% 🟢) 4.374s (-13.7% 🟢) 1.186s 7 1.00x
▲ Vercel Next.js (Turbopack) 3.623s (-14.2% 🟢) 5.046s (-9.5% 🟢) 1.422s 6 1.14x
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Express | Next.js (Turbopack)

Stream Benchmarks (includes TTFB metrics)
workflow with stream

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Next.js (Turbopack) 0.147s 1.001s 0.011s 1.016s 0.869s 10 1.00x
🌐 Redis Next.js (Turbopack) 0.148s 1.000s 0.001s 1.007s 0.859s 10 1.00x
💻 Local Nitro 0.173s (+55.5% 🔺) 1.002s (~) 0.010s (+14.6% 🔺) 1.015s (~) 0.842s 10 1.17x
💻 Local Express 0.175s (~) 1.002s (~) 0.011s (+9.1% 🔺) 1.016s (~) 0.841s 10 1.19x
🌐 MongoDB Next.js (Turbopack) 0.496s 0.951s 0.001s 1.009s 0.513s 10 3.37x
🐘 Postgres Next.js (Turbopack) 1.459s 1.650s 0.001s 2.012s 0.553s 10 9.90x
🐘 Postgres Nitro 2.144s (+68.7% 🔺) 2.900s (+63.8% 🔺) 0.001s (~) 3.015s (+49.7% 🔺) 0.871s 10 14.54x
🐘 Postgres Express 2.271s (+4.8%) 2.771s (-3.6%) 0.001s (-16.7% 🟢) 3.015s (~) 0.744s 10 15.41x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Express 2.396s (-6.6% 🟢) 3.058s (-8.5% 🟢) 0.167s (+8.2% 🔺) 3.971s (-6.6% 🟢) 1.575s 10 1.00x
▲ Vercel Next.js (Turbopack) 2.609s (+8.1% 🔺) 2.867s (-6.0% 🟢) 0.198s (+87.7% 🔺) 3.836s (-0.9%) 1.227s 10 1.09x
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Express | Next.js (Turbopack)

Summary

Fastest Framework by World

Winner determined by most benchmark wins

World 🥇 Fastest Framework Wins
💻 Local Nitro 7/12
🐘 Postgres Nitro 5/12
▲ Vercel Express 8/12
Fastest World by Framework

Winner determined by most benchmark wins

Framework 🥇 Fastest World Wins
Express 💻 Local 10/12
Next.js (Turbopack) 🌐 Redis 7/12
Nitro 💻 Local 12/12
Column Definitions
  • Workflow Time: Runtime reported by workflow (completedAt - createdAt) - primary metric
  • TTFB: Time to First Byte - time from workflow start until first stream byte received (stream benchmarks only)
  • Slurp: Time from first byte to complete stream consumption (stream benchmarks only)
  • Wall Time: Total testbench time (trigger workflow + poll for result)
  • Overhead: Testbench overhead (Wall Time - Workflow Time)
  • Samples: Number of benchmark iterations run
  • vs Fastest: How much slower compared to the fastest configuration for this benchmark

Worlds:

  • 💻 Local: In-memory filesystem world (local development)
  • 🐘 Postgres: PostgreSQL database world (local development)
  • ▲ Vercel: Vercel production/preview deployment
  • 🌐 Starter: Community world (local development)
  • 🌐 Turso: Community world (local development)
  • 🌐 MongoDB: Community world (local development)
  • 🌐 Redis: Community world (local development)
  • 🌐 Jazz: Community world (local development)

📋 View full workflow run

- Propagate Retry-After header as WorkflowAPIError.retryAfter on 429 responses
- Add withThrottleRetry wrapper for both workflow and step handlers
- Re-enqueue workflows on 5xx errors with exponential backoff (5s, 30s, 120min)
- Track serverErrorRetryCount in queue payload for retry budgeting
- Expose delaySeconds on QueueOptions interface

Co-Authored-By: Claude Opus 4.6 <[email protected]>
);
// Short wait: sleep in-process, then retry once
await new Promise((resolve) =>
setTimeout(resolve, retryAfterSeconds * 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we account for function execution time limits specifically in the case of Vercel world? If the serverless fuction is already close to the end of it's limit and the workflow server throws a 429, adding a 10 sec sleep could potentially exceed the function execution limit and the function could get SIGKILLd midway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow layer should never take much more than a few seconds, so I think it's highly unlikely that we'd run into timeouts, so I'm not too worried about this, but technically a concern

@karthikscale3
Copy link
Collaborator

Rest of the code looks good except the one concern I have around function execution limits in the case of vercel world.

Copy link
Collaborator

@pranaygp pranaygp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks solid — the 429 throttle retry and 5xx backoff are clean additions. The withThrottleRetry wrapper with the "safe to retry because 429 is pre-processing" invariant is a nice design. A few things to consider:

world-local doesn't support delaySeconds — The QueueOptions interface was updated and world-vercel passes it through, but world-local/src/queue.ts silently ignores opts?.delaySeconds. If a 5xx retry fires in local dev, the message will be re-enqueued but delivered immediately instead of after the intended backoff. Not a blocker since 5xx from world-local is uncommon, but could cause confusing behavior during local fault testing (which the PR description mentions wanting to add).

Step handler diff is mostly re-indentation — confirmed the actual logic changes are just wrapping in withThrottleRetry. The step handler does NOT get 5xx retry (only 429), which makes sense since the queue's built-in redelivery + step_started idempotency handles that.

@pranaygp
Copy link
Collaborator

pranaygp commented Feb 9, 2026

Inline review notes

Since GitHub's review API is having issues, posting inline comments here:


packages/core/src/runtime.ts:291delaySecondSteps = [5, 30, 7200]

The third backoff step is 7200s (2 hours). Is the expectation that persistent 5xx would trigger an alert/incident on the server side well before this fires? A brief comment explaining the rationale for these specific values would help future readers.


packages/core/src/runtime.ts:290-307serverErrorRetryCount lifecycle

The serverErrorRetryCount resets on each new workflow invocation (since the next queueMessage after a successful step won't include it). This means each invocation gets a fresh retry budget. That seems right for transient 5xx, but if the server is persistently down, each invocation will independently burn through all 3 retries before failing. Just want to confirm this is the intended behavior vs. tracking the count on the run itself.


packages/core/src/runtime/helpers.ts:425withThrottleRetry short-wait path

If the first in-process retry succeeds, great. But if it gets a different error (not 429), that error is thrown — which means the handler fails entirely rather than deferring to queue. This seems correct since a non-429 error should propagate, just want to confirm the intent.


packages/world-vercel/src/utils.ts:298Retry-After parsing

nit: Retry-After can also be an HTTP-date (RFC 9110). This only handles the delay-seconds format. Fine if you control the server, but worth a comment noting the assumption.

Copy link
Collaborator

@pranaygp pranaygp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is clean and well-structured. The two-tier 429 strategy (in-process for short waits, defer to queue for long waits) is nice. A few questions and concerns below.

*
* Safe to retry the entire handler because 429 is sent from server middleware
* before the request is processed — no server state has changed.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Safe to retry the entire handler because 429 is sent from server middleware before the request is processed — no server state has changed."

This holds if the 429 always hits on the first API call in the handler. But the handler makes multiple world API calls sequentially (e.g. runs.getevents.create(run_started) → replay → events.create(run_completed)). If a later call gets 429'd, the retry re-executes everything from the top.

For the workflow handler this is probably fine since replay is deterministic and events are idempotent.

For the step handler this is more concerning — the retry would re-execute user step code, which may not be idempotent. Is the assumption that the workflow-server's rate limiting middleware rejects at the connection level (so all calls in one handler invocation either succeed or all fail)? If so, worth documenting that assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

err.status >= 500
) {
const retryCount = serverErrorRetryCount ?? 0;
const delaySecondSteps = [5, 30, 7200]; // 5s, 30s, 120min
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third step is 7200 seconds = 2 hours. Is that intentional? That's a very long time for a workflow to be stalled on a server error. If the 500 was truly transient, 2 hours feels excessive. If it's a sustained outage, you'd probably want to fail rather than retry after 2h when the server might have moved on.

Maybe something like [5, 30, 120] (5s, 30s, 2min) would be more practical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to [5, 30, 120]. Realistically if we keep getting 500s, it'd be nice to re-try the run much later so we have a change to incident-mitigate, but I think it's fine to fail runs instead too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. this was fully AI generated review btw. idk where it came up with 5s, 30, 120 - but I do agree i we should have some shorter ones before getting to 7200 anyway

}

const maxRetries = stepFn.maxRetries ?? DEFAULT_STEP_MAX_RETRIES;
return await withThrottleRetry(async () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step handler gets withThrottleRetry for 429s but no equivalent 5xx retry logic like the workflow handler has. If a world API call (e.g. events.create for step_started or step_completed) returns a 5xx, it'll be caught by the step's error handling and burn a step retry attempt — re-executing potentially expensive user code for what was a transient infrastructure error.

Is this intentional for this PR's scope, or should step handlers also get 5xx backoff retry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500s are handled via regular step re-try. 429s are only special-cased because they allow the step-retry mechanism to use the retry-after header instead of guessing at the retry

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right but the comment is about not burning a full step attempt for transient 5xx. doesn't need to be addressed in this PR necessarily but good to think about. happy to move in any direction for now since they're all better than nothing and then come back to this in the future with real experience

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense that steps rely on their existing retry mechanism. One concern though: with withThrottleRetry removed, a 429 from the world API during step execution (e.g. on step_started or step_completed) will now burn a step retry attempt — potentially exhausting maxRetries without the user's code ever being at fault.

Ideally 429s should be retried transparently without consuming an attempt. A targeted retry around just the world API calls (not wrapping the user code execution) could thread that needle. Worth tracking as a follow-up.

idempotencyKey?: string;
headers?: Record<string, string>;
/** Delay message delivery by this many seconds */
delaySeconds?: number;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that this was promoted from the vercel-specific type to the shared interface. Note that world-local currently ignores delaySeconds entirely — so 5xx retries in local dev will fire immediately instead of with backoff. Not a blocker but worth a follow-up or at minimum a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to ignore for local world, since there are no 429s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although we probably do actually want to support delaySeconds in local world anyway as we start using this option in queue more often in the future for more use cases. even if it's not implemented in this PR, I think we should leave an explicit comment that local world ignores this since it's a nuance. We should later have an e2e test that checks for this behaviour and would fail on local world without proper queue delaySeconds implementation (cc @TooTallNate )

Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
Signed-off-by: Peter Wielander <[email protected]>
Copy link
Collaborator

@pranaygp pranaygp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The removal of withThrottleRetry from the step handler was the right fix for the re-execution safety concern.

One thing worth tracking as a follow-up: with withThrottleRetry removed from steps, a 429 from the world API during step execution (e.g. on step_started or step_completed) will now burn a step retry attempt. This means a transient throttle from the server could exhaust a step's maxRetries without the user's code ever being at fault. Ideally 429s should be retried transparently without consuming an attempt — maybe a targeted retry around just the world API calls (not the user code execution) would thread that needle.

Copy link
Collaborator

@pranaygp pranaygp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [core] Handle 429 and 500 errors from worlds in runtime

Overall this is a solid approach to handling transient server errors. The 429 throttle retry with short/long path split is well-designed, and the 500 retry via re-enqueue with exponential backoff is a safe pattern. A few items to address below.

);
runtimeLogger.warn(
'Throttled again on retry, deferring to queue',
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant check: this is already inside if (WorkflowAPIError.is(err)), so the second WorkflowAPIError.is(err) on this line is always true. Should just be if (err.status === 429).

step = startResult.step;
} catch (err) {
if (WorkflowAPIError.is(err)) {
if (WorkflowAPIError.is(err) && err.status === 429) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step handler has 429 handling here inside the step_started catch block, but unlike the workflow handler, it does not use withThrottleRetry and does not have the short-wait in-process retry path. This means a brief 429 (e.g., retryAfter=2s) will always defer to the queue rather than sleeping in-process. Is that intentional? If steps should also get the in-process retry for short waits, consider wrapping with withThrottleRetry similarly to the workflow handler.

if (retryAfterSeconds < 10) {
runtimeLogger.warn(
'Throttled by workflow-server (429), retrying in-process',
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if retryAfter is undefined (header missing), this defaults to 1 second. For 429 responses without a Retry-After header, 1 second might be too aggressive. A slightly higher default (e.g., 3-5s) would be more conservative and reduce the chance of hitting the server again immediately while it's under load.

workflowStartedAt = +workflowRun.startedAt;

// At this point, the workflow is "running" and `startedAt` should
// definitely be set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire workflow handler body is now wrapped in withThrottleRetry. This means if a 429 is thrown by any world call during workflow execution (e.g., world.runs.get, world.events.create, runWorkflow), the entire handler will be retried from scratch.

  1. If run_started event was already created successfully and then a later call throws 429, the retry will call world.runs.get again and find the run already running, which is handled correctly.
  2. However, if run_completed event creation throws 429, the retry would re-run the entire workflow (replay + execution). The workflow replay should be deterministic, but this could be expensive. Worth documenting this tradeoff in a comment.

});
// Use the run entity from the event response (no extra get call needed)
if (!result.run) {
if (workflowRun.status === 'pending') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 500 retry re-enqueues with a new requestedAt and incremented serverErrorRetryCount. The traceCarrier is serialized fresh via serializeTraceCarrier(). If the trace context matters for correlating retries, this creates a new trace for each retry attempt. Consider whether preserving the original traceContext from the message would be preferable for observability (linking all retry attempts to the same parent trace).

WorldParseFormat,
PeerService,
RpcSystem,
RpcService,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the HTTP Retry-After header can also be an HTTP-date (e.g., Wed, 21 Oct 2015 07:28:00 GMT). The parseInt approach correctly returns NaN for date values, falling back to undefined. This is fine since the workflow-server likely only sends numeric values, but a comment noting this would be helpful.

Signed-off-by: Peter Wielander <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants