Add stress benchmarks: 1000-step, data payload, and stream tests#1214
Add stress benchmarks: 1000-step, data payload, and stream tests#1214
Conversation
- Enable full benchmark suite via `stress-test` PR label (in addition to workflow_dispatch) - Add 1000-step sequential (100ms sleep) and concurrent benchmarks - Add 1MB data payload benchmarks (sequential + concurrent) with correctness assertions - Add stream stress benchmarks: pipeline transforms, parallel streams, and fan-out fan-in patterns with TTFB/slurp reporting - Extract consumeStreamWithMetrics helper for unified stream measurement - All stream benchmarks auto-integrate with existing PR comment reporting Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
🦋 Changeset detectedLatest commit: 9b1bd90 The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)No data available 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (45 failed)turso (45 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive stress benchmarks for the workflow system, focusing on high-volume step execution, large data payloads, and stream processing capabilities. The PR enables triggering the full benchmark suite via a stress-test label in addition to manual workflow dispatch.
Changes:
- Add 1000-step sequential benchmark with 100ms sleep (vs 1000ms for smaller counts) to the full suite
- Add 1MB data payload benchmarks (sequential and concurrent) with correctness assertions
- Add stream stress benchmarks: pipeline transforms, parallel streams, and fan-out/fan-in patterns with TTFB/slurp metrics
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| workbench/example/workflows/97_bench.ts | Add doWorkWithDelay helper, update sequentialStepsWorkflow to accept sleep parameter, add data payload workflows and stream stress test workflows |
| packages/core/e2e/bench.bench.ts | Add consumeStreamWithMetrics helper, update sequential step configs with sleep parameter, add data payload benchmarks and stream stress benchmarks |
| .github/workflows/benchmarks.yml | Add labeled trigger type and enable full suite when stress-test label is present |
| .changeset/stress-benchmarks.md | Add changeset describing the new benchmarks |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| args: [50, 1024 * 1024], | ||
| skip: !fullSuite, | ||
| time: 180000, | ||
| expectedTotalBytes: 50 * 1024 * 1024, |
There was a problem hiding this comment.
Incorrect expected byte count: Same issue as line 510 - the workflow returns a JSON summary stream (approximately 40-50 bytes), not 50MB of actual stream data. The correctness check will fail.
There was a problem hiding this comment.
Fixed in a3e2e05 — same fix as above, uses summaryStream: true to parse the JSON summary.
| args: [10, 1024 * 1024], | ||
| skip: false, | ||
| time: 60000, | ||
| expectedTotalBytes: 10 * 1024 * 1024, |
There was a problem hiding this comment.
Incorrect expected byte count: Same issue as lines 510 and 518 - the workflow returns a JSON summary stream (approximately 40-50 bytes), not 10MB of actual stream data. The correctness check will fail.
| args: [50, 1024 * 1024], | ||
| skip: !fullSuite, | ||
| time: 180000, | ||
| expectedTotalBytes: 50 * 1024 * 1024, |
There was a problem hiding this comment.
Incorrect expected byte count: Same issue as other parallel/fan-out tests - the workflow returns a JSON summary stream (approximately 40-50 bytes), not 50MB of actual stream data. The correctness check will fail.
| while (remaining > 0) { | ||
| const size = Math.min(chunkSize, remaining); | ||
| const chunk = new Uint8Array(size); | ||
| for (let i = 0; i < size; i++) { | ||
| chunk[i] = (chunkIndex + i) % 256; | ||
| } | ||
| controller.enqueue(chunk); | ||
| remaining -= size; |
There was a problem hiding this comment.
The byte pattern logic is incorrect. The comment on line 195 states "Chunk i contains bytes where each byte = (chunkIndex + byteOffset) % 256", but the implementation uses (chunkIndex + i) % 256 where chunkIndex is incremented once per chunk (not per byte). This means all chunks will have similar patterns offset by 1. To match the documented pattern, change to use a running byte counter: track totalBytesProcessed and use (totalBytesProcessed + i) % 256, or update the comment to accurately describe the current implementation.
| while (remaining > 0) { | |
| const size = Math.min(chunkSize, remaining); | |
| const chunk = new Uint8Array(size); | |
| for (let i = 0; i < size; i++) { | |
| chunk[i] = (chunkIndex + i) % 256; | |
| } | |
| controller.enqueue(chunk); | |
| remaining -= size; | |
| let totalBytesProcessed = 0; | |
| while (remaining > 0) { | |
| const size = Math.min(chunkSize, remaining); | |
| const chunk = new Uint8Array(size); | |
| for (let i = 0; i < size; i++) { | |
| chunk[i] = (totalBytesProcessed + i) % 256; | |
| } | |
| controller.enqueue(chunk); | |
| remaining -= size; | |
| totalBytesProcessed += size; |
There was a problem hiding this comment.
Fixed in a3e2e05 — now uses a running totalBytesProcessed counter and updated the comment to match.
| const { done, value } = await reader.read(); | ||
| if (done) break; | ||
| streamBytes += value.length; | ||
| totalBytes += value.length; |
There was a problem hiding this comment.
Race condition: totalBytes is being incremented from multiple parallel async operations without synchronization. While JavaScript is single-threaded and this may work in practice due to the event loop, it's safer to calculate totalBytes after the Promise.all completes by summing streamByteCounts, or use proper synchronization. Recommend: const totalBytes = streamByteCounts.reduce((sum, count) => sum + count, 0); after line 260.
There was a problem hiding this comment.
Fixed in a3e2e05 — removed the shared accumulator and now sums streamByteCounts after Promise.all completes.
| args: [10, 1024 * 1024], | ||
| skip: false, | ||
| time: 60000, | ||
| expectedTotalBytes: 10 * 1024 * 1024, |
There was a problem hiding this comment.
Incorrect expected byte count: The parallelStreamsWorkflow returns the result of consumeAndVerifyStreams, which returns a JSON summary stream (approximately 40-50 bytes), not the actual stream data. The correctness check on line 557 will fail because it expects 10 * 1024 * 1024 bytes, but will only receive the size of the JSON string like {"totalBytes":10485760,"streamCount":10}. Either update expectedTotalBytes to match the JSON summary size, or modify the workflow to return the consumed stream data instead of a summary.
There was a problem hiding this comment.
Fixed in a3e2e05 — the harness now parses the JSON summary stream and verifies the reported totalBytes field, with a summaryStream flag distinguishing pipeline (raw data) from parallel/fan-out (summary) benchmarks.
- Fix genLargeStream byte pattern: use running byte counter instead of chunk index so each byte has a unique pattern across the stream - Fix consumeAndVerifyStreams race condition: sum streamByteCounts after Promise.all instead of incrementing totalBytes in parallel - Fix parallel/fan-out stream benchmarks: parse the JSON summary stream to verify reported totalBytes rather than comparing raw stream size (these workflows return a small summary, not the full data) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…m check - Move parallel streams and fan-out fan-in benchmarks to full-suite-only (they exercise patterns that aren't stable on all backends/deployments) - Change stream correctness check from fatal throw to console.warn (some Vercel deployments don't return stream data) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
🐛 Bugs Uncovered by Stress BenchmarksThe new stress benchmarks have uncovered real bugs in the workflow runtime. Here are the detailed findings: Bug 1:
|
These are real bugs, not false positives. Revert the console.warn back to throw, and restore parallel/fan-out stream benchmarks as always-on. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
VaguelySerious
left a comment
There was a problem hiding this comment.
LGTM so far, I think you can merge already if you want given it's tag controlled
Summary
stress-testPR label (in addition to manualworkflow_dispatch)Test plan
pnpm typecheckpasses (verified locally)stress-testlabel triggers full suite in CIfirstByteTimeMs)🤖 Generated with Claude Code