feat(failure-handler): add cascade detection when ≥10 [aw] failures fire within 60 min#34060
Merged
pelikhan merged 4 commits intoMay 22, 2026
Merged
Conversation
Co-authored-by: pelikhan <[email protected]>
When ≥10 [aw] * failed issues are filed within 60 minutes: - Detect the cascade via GitHub search API - Create (or update) a single [aw] Failure cascade detected rollup issue - Label every individual issue in the window with cascade-suspected - Ensure both cascade-suspected and cascade-rollup labels exist - Integrate cascade check into both the new-issue and existing-issue paths - Export helpers and constants for testability - Add 9 unit tests covering all cascade paths Co-authored-by: pelikhan <[email protected]>
…TLE_PATTERN, makeFailureItems helper Co-authored-by: pelikhan <[email protected]>
Copilot
AI
changed the title
[WIP] Add failure-cascade detection to autotriage for high-volume issues
feat(failure-handler): add cascade detection when ≥10 [aw] failures fire within 60 min
May 22, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Adds automated detection/rollup for bursts of [aw] … failed issues so that high-volume, correlated failures are summarized and triaged as a single “cascade” event, reducing issue noise and improving operational signal.
Changes:
- Introduces failure-cascade detection in
handle_agent_failure.cjs(search recent failures, create/update rollup issue, label impacted issues). - Adds unit tests covering cascade detection behavior and API-error resilience.
- Updates generated workflow/action lock artifacts (including a schedule change in one workflow and a pinned action SHA update).
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/handle_agent_failure.cjs | Implements cascade detection, rollup issue management, and labeling logic; wires it into the existing failure-issue flow. |
| actions/setup/js/handle_agent_failure.test.cjs | Adds unit tests for cascade detection scenarios (thresholds, indexing lag, rollup update, label creation, non-fatal errors). |
| .github/workflows/release.lock.yml | Updates pinned docker/setup-buildx-action reference in the release workflow lockfile. |
| .github/workflows/developer-docs-consolidator.lock.yml | Changes the scheduled trigger cadence (daily → weekly) in the generated lock workflow. |
| .github/aw/actions-lock.json | Adds a new pinned entry for docker/setup-buildx-action@v4. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 5/5 changed files
- Comments generated: 6
Comment on lines
+1727
to
+1736
| const recentIssues = await findRecentFailureIssues(owner, repo); | ||
|
|
||
| // Ensure the triggering issue is included even if GitHub search indexing lags | ||
| const issueNumbers = new Set(recentIssues.map(i => i.number)); | ||
| issueNumbers.add(triggeringIssueNumber); | ||
|
|
||
| if (issueNumbers.size < CASCADE_THRESHOLD) { | ||
| core.info( | ||
| `Cascade check: ${issueNumbers.size} failure issue(s) in the last ${CASCADE_WINDOW_MINUTES} min (threshold: ${CASCADE_THRESHOLD}) — no cascade` | ||
| ); |
Comment on lines
+1748
to
+1763
| // Build rollup body | ||
| const affectedList = recentIssues | ||
| .map(i => `- [#${i.number}](${i.html_url}) — ${i.title}`) | ||
| .join("\n"); | ||
| const windowStart = new Date(Date.now() - CASCADE_WINDOW_MS); | ||
| const rollupBody = [ | ||
| `## ⚠️ Failure Cascade Detected`, | ||
| ``, | ||
| `**${issueNumbers.size} \`[aw] * failed\` issues** were filed within the last **${CASCADE_WINDOW_MINUTES} minutes** (since ${windowStart.toUTCString()}).`, | ||
| ``, | ||
| `This volume suggests a common root cause (e.g., lockfile drift, provider outage, infrastructure change) rather than isolated workflow failures.`, | ||
| ``, | ||
| `### Affected Workflows`, | ||
| ``, | ||
| affectedList || `_(none indexed yet — search indexing may lag)_`, | ||
| ``, |
| const color = ((r << 16) | (g << 8) | b).toString(16).padStart(6, "0"); | ||
| await github.rest.issues.createLabel({ owner, repo, name: labelName, color }); | ||
| core.info(`✓ Created label "${labelName}" (#${color})`); | ||
| } catch (createErr) { |
Comment on lines
+1669
to
+1675
| const result = await github.rest.search.issuesAndPullRequests({ | ||
| q: searchQuery, | ||
| per_page: 100, | ||
| sort: "created", | ||
| order: "asc", | ||
| }); | ||
| return result.data.items |
Comment on lines
68
to
+71
| on: | ||
| schedule: | ||
| - cron: "19 13 * * *" | ||
| # Friendly format: daily (scattered) | ||
| - cron: "19 13 * * 6" | ||
| # Friendly format: weekly (scattered) |
Comment on lines
1374
to
1376
| - name: Setup Docker Buildx (pre-validation) | ||
| uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0 (source v4) | ||
| uses: docker/setup-buildx-action@d7f5e7f509e45cec5c76c4d5afdd7de93d0b3df5 # v4 | ||
| - name: Build Docker image (validation only) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On 2026-05-22, ~27
[aw] X failedissues were filed in 12 hours due to coincident root causes (lockfile drift, Codex outage, AI Moderator loop), crowding out genuine signal with no indication of shared cause.Changes
actions/setup/js/handle_agent_failure.cjsdetectAndHandleFailureCascade(owner, repo, triggeringIssueNumber)— called after every failure issue create/update. Searches for open[aw] * failedissues (viaFAILURE_TITLE_PATTERN) created in the lastCASCADE_WINDOW_MS(60 min). WhenCASCADE_THRESHOLD(10) is reached:[aw] Failure cascade detectedrollup issue listing all affected workflowscascade-suspectedlabel to every issue in the window for batch-close once root cause is patchedcascade-suspected/cascade-rolluplabels if missing (404-safe)try/catchwithcore.warning(); cascade failure never breaks the underlying issue-tracking pathactions/setup/js/handle_agent_failure.test.cjs9 new unit tests covering: below-threshold no-op, exactly-at-threshold create+label, search indexing lag, existing rollup update, auto-label creation on 404, rollup body content, title filtering, and API error resilience.