Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat(failure-handler): add cascade detection when ≥10 [aw] failures fire within 60 min#34060

Merged
pelikhan merged 4 commits into
mainfrom
copilot/deep-report-add-failure-cascade-detection
May 22, 2026
Merged

feat(failure-handler): add cascade detection when ≥10 [aw] failures fire within 60 min#34060
pelikhan merged 4 commits into
mainfrom
copilot/deep-report-add-failure-cascade-detection

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

On 2026-05-22, ~27 [aw] X failed issues were filed in 12 hours due to coincident root causes (lockfile drift, Codex outage, AI Moderator loop), crowding out genuine signal with no indication of shared cause.

Changes

actions/setup/js/handle_agent_failure.cjs

  • detectAndHandleFailureCascade(owner, repo, triggeringIssueNumber) — called after every failure issue create/update. Searches for open [aw] * failed issues (via FAILURE_TITLE_PATTERN) created in the last CASCADE_WINDOW_MS (60 min). When CASCADE_THRESHOLD (10) is reached:
    • Creates (or updates) a single [aw] Failure cascade detected rollup issue listing all affected workflows
    • Adds cascade-suspected label to every issue in the window for batch-close once root cause is patched
    • Lazily creates cascade-suspected / cascade-rollup labels if missing (404-safe)
  • Triggering issue number is always merged into the window set to handle GitHub search indexing lag
  • All operations are non-fatal — wrapped in try/catch with core.warning(); cascade failure never breaks the underlying issue-tracking path
// Constants driving the detection
const CASCADE_WINDOW_MS  = CASCADE_WINDOW_MINUTES * 60 * 1000; // 60 min
const CASCADE_THRESHOLD  = 10;
const FAILURE_TITLE_PATTERN = /^\[aw\] .+ failed$/;

actions/setup/js/handle_agent_failure.test.cjs

9 new unit tests covering: below-threshold no-op, exactly-at-threshold create+label, search indexing lag, existing rollup update, auto-label creation on 404, rollup body content, title filtering, and API error resilience.

Copilot AI and others added 3 commits May 22, 2026 16:20
When ≥10 [aw] * failed issues are filed within 60 minutes:
- Detect the cascade via GitHub search API
- Create (or update) a single [aw] Failure cascade detected rollup issue
- Label every individual issue in the window with cascade-suspected
- Ensure both cascade-suspected and cascade-rollup labels exist
- Integrate cascade check into both the new-issue and existing-issue paths
- Export helpers and constants for testability
- Add 9 unit tests covering all cascade paths

Co-authored-by: pelikhan <[email protected]>
…TLE_PATTERN, makeFailureItems helper

Co-authored-by: pelikhan <[email protected]>
Copilot AI changed the title [WIP] Add failure-cascade detection to autotriage for high-volume issues feat(failure-handler): add cascade detection when ≥10 [aw] failures fire within 60 min May 22, 2026
Copilot AI requested a review from pelikhan May 22, 2026 16:30
@pelikhan pelikhan marked this pull request as ready for review May 22, 2026 17:31
Copilot AI review requested due to automatic review settings May 22, 2026 17:31
@pelikhan pelikhan merged commit 440f340 into main May 22, 2026
@pelikhan pelikhan deleted the copilot/deep-report-add-failure-cascade-detection branch May 22, 2026 17:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automated detection/rollup for bursts of [aw] … failed issues so that high-volume, correlated failures are summarized and triaged as a single “cascade” event, reducing issue noise and improving operational signal.

Changes:

  • Introduces failure-cascade detection in handle_agent_failure.cjs (search recent failures, create/update rollup issue, label impacted issues).
  • Adds unit tests covering cascade detection behavior and API-error resilience.
  • Updates generated workflow/action lock artifacts (including a schedule change in one workflow and a pinned action SHA update).
Show a summary per file
File Description
actions/setup/js/handle_agent_failure.cjs Implements cascade detection, rollup issue management, and labeling logic; wires it into the existing failure-issue flow.
actions/setup/js/handle_agent_failure.test.cjs Adds unit tests for cascade detection scenarios (thresholds, indexing lag, rollup update, label creation, non-fatal errors).
.github/workflows/release.lock.yml Updates pinned docker/setup-buildx-action reference in the release workflow lockfile.
.github/workflows/developer-docs-consolidator.lock.yml Changes the scheduled trigger cadence (daily → weekly) in the generated lock workflow.
.github/aw/actions-lock.json Adds a new pinned entry for docker/setup-buildx-action@v4.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 5/5 changed files
  • Comments generated: 6

Comment on lines +1727 to +1736
const recentIssues = await findRecentFailureIssues(owner, repo);

// Ensure the triggering issue is included even if GitHub search indexing lags
const issueNumbers = new Set(recentIssues.map(i => i.number));
issueNumbers.add(triggeringIssueNumber);

if (issueNumbers.size < CASCADE_THRESHOLD) {
core.info(
`Cascade check: ${issueNumbers.size} failure issue(s) in the last ${CASCADE_WINDOW_MINUTES} min (threshold: ${CASCADE_THRESHOLD}) — no cascade`
);
Comment on lines +1748 to +1763
// Build rollup body
const affectedList = recentIssues
.map(i => `- [#${i.number}](${i.html_url}) — ${i.title}`)
.join("\n");
const windowStart = new Date(Date.now() - CASCADE_WINDOW_MS);
const rollupBody = [
`## ⚠️ Failure Cascade Detected`,
``,
`**${issueNumbers.size} \`[aw] * failed\` issues** were filed within the last **${CASCADE_WINDOW_MINUTES} minutes** (since ${windowStart.toUTCString()}).`,
``,
`This volume suggests a common root cause (e.g., lockfile drift, provider outage, infrastructure change) rather than isolated workflow failures.`,
``,
`### Affected Workflows`,
``,
affectedList || `_(none indexed yet — search indexing may lag)_`,
``,
const color = ((r << 16) | (g << 8) | b).toString(16).padStart(6, "0");
await github.rest.issues.createLabel({ owner, repo, name: labelName, color });
core.info(`✓ Created label "${labelName}" (#${color})`);
} catch (createErr) {
Comment on lines +1669 to +1675
const result = await github.rest.search.issuesAndPullRequests({
q: searchQuery,
per_page: 100,
sort: "created",
order: "asc",
});
return result.data.items
Comment on lines 68 to +71
on:
schedule:
- cron: "19 13 * * *"
# Friendly format: daily (scattered)
- cron: "19 13 * * 6"
# Friendly format: weekly (scattered)
Comment on lines 1374 to 1376
- name: Setup Docker Buildx (pre-validation)
uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0 (source v4)
uses: docker/setup-buildx-action@d7f5e7f509e45cec5c76c4d5afdd7de93d0b3df5 # v4
- name: Build Docker image (validation only)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[deep-report] Add failure-cascade detection to autotriage when >10 [aw] X failed issues fire within 60 min

3 participants