Agent Workflow Metrics via GitHub Labels#33986
Agent Workflow Metrics via GitHub Labels#33986kubaflo wants to merge 130 commits intodotnet:mainfrom
Conversation
Clarify the Copilot PR-review prompt to execute five explicit phases (Understanding, Test Review, Fix Exploration, Alternative Comparison, Final Review). Add a new pipeline step that invokes a Copilot "post-comment" skill, captures its exit code, logs output to $(Build.ArtifactStagingDirectory)/copilot-logs, and sets a PostCommentFailed variable on failure. Ensure artifacts dir exists, surface warnings on failure, and make the original post-comment fallback step run only when the skill step failed. Update step display names and preserve artifact publishing.
Add a pipeline step (Cache Prompt File) that loads eng/pipelines/prompts/pr-review-prompt.md and copies it to /tmp/copilot-prompts/pr-review-prompt.md before the PR branch checkout. Update the later Copilot step to read the prompt from the cached location and adjust the error message. This prevents failures when the prompt file is absent on the PR branch by ensuring a stable copy is available for the review step.
Replace brittle iPhone Xs + iOS 18.5 lookup with a prioritized selection routine. The script now iterates preferred iOS versions and device models, picks the first available preferred device or falls back to the first available iPhone, and finally falls back to any available simulator. It also surfaces runtime info when listing available simulators and reports the selected device name and UDID before booting.
Replace direct checkout of the PR branch with logic that fetches the PR, computes the merge-base against the current branch, and cherry-picks commits from the merge-base..PR_HEAD onto the current branch (using --no-commit). Adds commit counting, a warning when no commits are found, conflict handling with status and diff output, and extra logging (current branch, merge base, commit count, last commit and status). Also updates the pipeline step display name to 'Cherry-pick PR Changes' and tweaks the fetch message.
Add a CI step to run ./build.ps1 --target=dotnet-buildtasks (Release, diagnostic) to compile MSBuild tasks required for MAUI builds. The step includes a retry on failure and sets DOTNET_TOKEN and PRIVATE_BUILD environment variables for accessing internal artifacts. Placed before the simulator/emulator listing to ensure tasks are available for subsequent MAUI jobs.
Introduce structured PRAgent phase output directories and content.md artifacts, and make agent workflows non-interactive/CI-first. Documentation (.github/agents/*, SKILLs and plan templates) updated to require writing phase outputs to CustomAgentLogsTmp/PRState/{PRNumber}/PRAgent/{phase}/content.md, to prefer continuing autonomously on environment blockers (retry once, then skip) and to remove strict requirements to create a single monolithic state file before starting. Review-PR.ps1 now creates PRAgent phase directories, documents CI-mode behavior and phase output paths, adjusts pr-finalize output locations, and updates labeling invocation. Several ai-summary-comment docs/scripts were updated to read generic "content" artifacts (and to remove SkipValidation usage). Overall changes align agent scripts and docs with CI-friendly, structured phase outputs and clearer failure/retry semantics.
Remove stale agent session notes and tighten PR scripting behavior. Changes: - Deleted .github/agent-pr-session/*.md (removed archived agent session files). - .github/scripts/Review-PR.ps1: updated PR log directory and display path to use "PRAgent/copilot-logs" subfolder. - .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1: stop falling back to an "unknown" PR folder; now errors and exits if -PRNumber is not provided. Reason: avoid ambiguous "unknown" PR artifacts and standardize log location under PRAgent; fail early when PR number is missing to prevent accidental runs with incorrect paths.
Align docs and script to the new path layout under CustomAgentLogsTmp/PRState/{number}/PRAgent. Updated PLAN-TEMPLATE.md (post-pr-finalize SummaryFile path), SKILL.md (auto-loading description), and post-try-fix-comment.ps1 (examples, parameter docs, and path regex) so try-fix and finalize operations look in the PRAgent/try-fix and PRAgent/pr-finalize locations.
Replace CI-specific language with general "autonomous/non-interactive" phrasing and tighten guidance to not prompt a human operator. Updates remove or reword references to "CI mode" and emphasize skipping blocked phases, retrying once, and continuing autonomously. Affected files: .github/agents/pr.md, .github/agents/pr/PLAN-TEMPLATE.md, .github/agents/pr/SHARED-RULES.md, .github/agents/pr/post-gate.md, and .github/scripts/Review-PR.ps1.
Remove the optional -Content parameter and make post-ai-summary-comment.ps1 always load phase content from CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/*/content.md. Update script help, examples, and validation messages; refactor auto-load logic to locate the repo root, load available phase files (pre-flight, gate, try-fix, report), build a status table and per-phase details, and synthesize the final comment. Also update SKILL.md to remove the "Provide content directly" section, adjust the Parameters table, and clarify the auto-loading behavior in the documentation.
Save the current branch and commit SHA before running the PR agent and use that pinned restore point to reliably restore the working tree between phases. Detects if the agent or finalize step changed branch/HEAD and recovers via git checkout/reset to the saved branch+SHA; otherwise performs targeted checkouts from the pinned SHA. Also update targeted file recoveries to use the pinned SHA. Additionally, clarify the try-fix skill docs: the baseline script requires the PR changes to be present on the current branch and should be reported as Blocked rather than switching branches when fix files are missing.
677d745 to
fc148e2
Compare
Introduce centralized agent label management and documentation. Adds a new shared script (.github/scripts/shared/Update-AgentLabels.ps1) that parses phase content.md files and idempotently creates/applies outcome, signal, and tracking labels (s/agent-*) via the GH API. Integrates label application into Review-PR.ps1 as Phase 4 (with a recovery attempt if the helper is missing). Adds comprehensive docs (.github/docs/agent-labels.md) and documents labeling behavior in .github/agents/pr/SHARED-RULES.md. Removes the older, in-file verification label logic from verify-tests-fail.ps1 and its calls, consolidating label responsibilities into the new helper. Labels are applied non-fatally and auto-created/updated on first use. Update Update-AgentLabels.ps1 Rename s/agent-fix-lose label to s/agent-fix-pr-picked Co-Authored-By: Copilot <[email protected]>
Introduce a -Unified mode to post-pr-finalize-comment.ps1 and call it from Review-PR.ps1. When enabled, the script injects or updates a PR Finalization section inside the existing AI Summary comment (or creates a new unified AI Summary comment) using explicit markers and a collapsible details block; it also removes any legacy standalone finalize comment. Dry-run preview support was added (writes preview file), and existing standalone behavior remains the default when -Unified is not passed. Changes made in .github/scripts/Review-PR.ps1 and .github/skills/ai-summary-comment/scripts/post-pr-finalize-comment.ps1.
Update Start-Emulator.ps1 to select iOS simulators that match UI test baseline devices. Replace the single preferred device list with a per-iOS-version mapping (iOS-18/iOS-17 prefer iPhone Xs; iOS-26 prefers iPhone 11 Pro) and adjust the selection logic to use the version-specific preferences. Comments were updated to document why certain devices are preferred to ensure consistency with UITest.cs baselines.
When deploying or starting iOS simulators, add logic to detect any other booted simulators and shut them down to prevent Appium from connecting to the wrong device. Implements parsing of `xcrun simctl list devices --json` and shuts down any booted simulator whose UDID does not match the target in both Build-AndDeploy.ps1 and Start-Emulator.ps1. Also update the success message to include the simulator name for clearer logs.
…ype unavailable iPhone Xs device type (com.apple.CoreSimulator.SimDeviceType.iPhone-Xs) is not available on newer Xcode versions on CI agents. iPhone 11 Pro has the same screen resolution (1125x2436 @3x) so snapshots match the baselines captured on iPhone Xs. Fallback order: iPhone Xs (existing) → iPhone 11 Pro (existing) → create iPhone Xs → create iPhone 11 Pro → first available iPhone. Fix: Start-Emulator.ps1 respects DEVICE_UDID env var and prefers iPhone 11 Pro Two fixes: 1. Check $env:DEVICE_UDID before auto-detecting - the CI pipeline sets this via ##vso[task.setvariable] but Start-Emulator.ps1 was ignoring it 2. Add iPhone 11 Pro as second preferred device for iOS 18/17 (same 1125x2436 resolution as iPhone Xs) - iPhone Xs device type is unavailable on CI agents Fix CI iOS simulator selection to use iPhone Xs for snapshot baselines The CI pipeline was selecting iPhone 16 Pro (1206x2472) which doesn't match the UI test baseline screenshots captured on iPhone Xs (1124x2286). Changes: - Create iPhone Xs simulator if not available on CI agent - Target the latest stable iOS runtime (18.x preferred) - Shutdown other booted simulators to prevent Appium conflicts Co-Authored-By: Copilot <[email protected]>
dotnet#34156) Pipeline runs for the Copilot CI pipeline had no meaningful title, making it hard to identify runs at a glance. This adds a step immediately after `Validate Parameters` that renames the run to `PR: {PRNumber} {Platform}` using the Azure DevOps logging command. ## Change - **`eng/pipelines/ci-copilot.yml`**: Adds a `Set Pipeline Run Title` step after `Validate Parameters`: ```yaml - script: | echo "##vso[build.updatebuildnumber]PR: ${{ parameters.PRNumber }} ${{ parameters.Platform }}" displayName: 'Set Pipeline Run Title' ``` Produces titles like `PR: 1234 android` or `PR: 5678 ios`. Implemented as a bash `script:` for compatibility with the macOS agents used by this pipeline. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > Create a pull request in `dotnet/maui` (base branch `copilot-ci`) to update the Azure DevOps pipeline at `eng/pipelines/ci-copilot.yml` so that the pipeline run title/build number is updated early in the run. > > Requirements: > - Add a step shortly after the existing **Validate Parameters** step to rename the pipeline run using Azure DevOps logging command `##vso[build.updatebuildnumber]...`. > - The run title should be exactly: `PR: {PR number} {Platform}` where: > - PR number comes from parameter `${{ parameters.PRNumber }}` > - Platform comes from parameter `${{ parameters.Platform }}` > - Use a clear `displayName`, e.g. `Set Pipeline Run Title`. > - Keep the change minimal and do not alter existing behavior beyond setting the run title. > > Context: > - File source URL: https://github.com/dotnet/maui/blob/copilot-ci/eng/pipelines/ci-copilot.yml > - CommitOID (context): 4896e12 > > Notes: > - Implement as a YAML step using `script:` (bash) for maximum compatibility on macOS agents. > - Ensure the title format does not include parentheses—use a single space between PR number and platform, e.g. `PR: 1234 android`. </details> <!-- START COPILOT CODING AGENT SUFFIX --> *This pull request was created from Copilot chat.* > <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/dotnet/maui/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: jfversluis <[email protected]>
Second test
This reverts commit 28bc3cc.
Introduce automated agent labeling for PR reviews: add a new shared labeler script (.github/scripts/shared/Update-AgentLabels.ps1) and wire it into Review-PR.ps1 as Phase 4 (Apply Labels). The labeler parses phase content.md files (gate/try-fix/report) to determine outcome, gate and fix signal labels, ensures labels exist, and applies/removes mutually-exclusive outcome/signal labels plus a tracking label (s/agent-reviewed). Add comprehensive docs (.github/docs/agent-labels.md) and update the PR agent SHARED-RULES.md to describe label meanings and expectations. Operations are idempotent and non-fatal; Review-PR.ps1 attempts a targeted recovery if the helper is missing.
de1a7e8 to
a530d85
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive GitHub label-based metrics system for tracking AI agent PR review workflow outcomes. The system uses s/agent-* prefixed labels to track review outcomes, test verification results, and fix comparison results across the automated PR review pipeline.
Changes:
- Introduces new label management module (
Update-AgentLabels.ps1) with idempotent label operations - Adds Phase 4 to Review-PR.ps1 for automatic label application based on phase outcomes
- Refactors agent output from centralized state files to distributed
content.mdfiles per phase - Updates all agent instructions and skill documentation to reflect the new phase output artifact structure
- Removes old label management code (
Update-VerificationLabels) in favor of centralized system - Adds new CI/Copilot pipeline configuration for automated agent PR reviews
- Cleans up Azure DevOps variable groups and pipeline configuration
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
.github/scripts/shared/Update-AgentLabels.ps1 |
New module implementing label management with parsing, application, and self-bootstrapping |
.github/scripts/Review-PR.ps1 |
Adds Phase 4 for label application; implements pinned SHA restoration; adds phase output directories |
.github/docs/agent-labels.md |
Comprehensive documentation of the label system, architecture, and usage examples |
eng/pipelines/ci-copilot.yml |
New Azure DevOps pipeline for running Copilot PR reviewer agent with full environment setup |
eng/pipelines/common/variables.yml |
Simplifies variable group structure; removes unused conditional logic |
eng/pipelines/common/provision.yml |
Adds skipCertificates parameter for CI scenarios |
.github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 |
Removes old label management; updates output path to new structure |
.github/skills/try-fix/SKILL.md |
Updates documentation to remove state file references |
.github/skills/learn-from-pr/SKILL.md |
Removes session markdown references |
.github/skills/ai-summary-comment/scripts/*.ps1 |
Updates all scripts to auto-load from PRAgent phase content.md files instead of state files |
.github/skills/ai-summary-comment/SKILL.md |
Documents new auto-loading behavior from phase files |
.github/skills/ai-summary-comment/NO-EXTERNAL-REFERENCES-RULE.md |
Simplifies by removing state file references |
.github/skills/ai-summary-comment/IMPROVEMENTS.md |
Updates terminology from "state file" to "content" |
.github/scripts/shared/Start-Emulator.ps1 |
Improves iOS simulator selection logic for UI test baseline compatibility |
.github/scripts/shared/Build-AndDeploy.ps1 |
Adds logic to shutdown other booted simulators before deployment |
.github/scripts/BuildAndRunHostApp.ps1 |
Adds test artifact collection for screenshots and page source |
.github/copilot-instructions.md |
Updates agent documentation to reflect new output structure |
.github/agents/pr/post-gate.md |
Updates for autonomous execution mode and phase output artifacts |
.github/agents/pr/SHARED-RULES.md |
Major update: documents phase output artifacts and agent label system; changes blocking behavior to autonomous |
.github/agents/pr/PLAN-TEMPLATE.md |
Updates plan template to reflect new phase output requirements |
.github/agents/pr.md |
Removes state file creation steps; updates for phase output artifacts |
.github/agents/learn-from-pr.md |
Removes session markdown references |
| | `s/agent-fix-win` | 🟢 `#66BB6A` | AI found a better alternative fix than the PR | Fix phase: alternative selected over PR's fix | | ||
| | `s/agent-fix-pr-picked` | 🟠 `#FF7043` | AI could not beat the PR fix — PR is the best among all candidates | Fix phase: PR selected as best after comparison | | ||
|
|
||
| Gate labels (`gate-passed`/`gate-failed`) are mutually exclusive with each other. Fix labels (`fix-win`/`fix-lose`) are mutually exclusive with each other. |
There was a problem hiding this comment.
The label name s/agent-fix-pr-picked in the code does not match the documentation which refers to it as s/agent-fix-lose.
In the documentation at line 34, the table mentions "Fix labels (fix-win/fix-lose)" suggesting the label should be called s/agent-fix-lose, but the actual label defined in Update-AgentLabels.ps1 line 35 is s/agent-fix-pr-picked.
Either the code should use s/agent-fix-lose to match the documentation's naming pattern, or the documentation should be updated to consistently use s/agent-fix-pr-picked. The current mismatch could cause confusion when users try to query these labels.
| Gate labels (`gate-passed`/`gate-failed`) are mutually exclusive with each other. Fix labels (`fix-win`/`fix-lose`) are mutually exclusive with each other. | |
| Gate labels (`gate-passed`/`gate-failed`) are mutually exclusive with each other. Fix labels (`fix-win`/`fix-pr-picked`) are mutually exclusive with each other. |
| # PRs the agent approved | ||
| is:pr label:s/agent-approved | ||
|
|
||
| # PRs where agent found a better fix |
There was a problem hiding this comment.
The comment at line 107 says "PRs where agent found a better fix" but queries for s/agent-fix-pr-picked. This is semantically backwards.
According to the label definitions:
s/agent-fix-win= "AI found a better alternative fix than the PR"s/agent-fix-pr-picked= "AI could not beat the PR fix — PR is the best"
So the query comment should say "PRs where agent could NOT beat the PR fix" or the query should use label:s/agent-fix-win instead.
| # PRs where agent found a better fix | |
| # PRs where agent could NOT beat the PR fix (PR fix was best) |
| - group: SDL_Settings | ||
| - group: AzureDevOps-Artifact-Feeds-Pats | ||
| - ${{ if eq(variables['Build.DefinitionName'], 'dotnet-maui') }}: | ||
| - group: Publish-Build-Assets # This variable group contains secrets to publis to BAR |
There was a problem hiding this comment.
Typo in the comment: "publis" should be "publish". The comment reads "This variable group contains secrets to publis to BAR" but should read "This variable group contains secrets to publish to BAR".
| - group: Publish-Build-Assets # This variable group contains secrets to publis to BAR | |
| - group: Publish-Build-Assets # This variable group contains secrets to publish to BAR |
| $script:ManualLabels = @{ | ||
| 's/agent-fix-implemented' = @{ Description = 'PR author implemented the agent suggested fix'; Color = '7B1FA2' } | ||
| } |
There was a problem hiding this comment.
The PR description mentions TWO manual labels (s/agent-fix-implemented and s/agent-suggestions-implemented), but the code only defines ONE manual label (s/agent-fix-implemented).
The PR description states:
| `s/agent-fix-implemented` | 🟣 `#7B1FA2` | PR author implemented the agent's suggested fix | Maintainer applies when PR author adopts agent's recommendation |
| `s/agent-suggestions-implemented` | 🟣 `#7B1FA2` | PR author implemented the agent's code suggestions | Maintainer applies when PR author adopts agent's recommendation |
However, Update-AgentLabels.ps1 only defines s/agent-fix-implemented (line 39), and the documentation only documents s/agent-fix-implemented (line 50). Either add the second manual label to the code, or remove it from the PR description.
| if ($reportContent -match '(?i)Final\s+Recommendation:\s*APPROVE|✅\s*Final\s+Recommendation:\s*APPROVE') { | ||
| $result.Outcome = 'approved' | ||
| } | ||
| elseif ($reportContent -match '(?i)Final\s+Recommendation:\s*REQUEST.CHANGES|⚠️\s*Final\s+Recommendation:\s*REQUEST.CHANGES') { |
There was a problem hiding this comment.
The regex pattern on line 387 uses REQUEST.CHANGES with a literal dot (.), but the pattern likely intends to match either "REQUEST CHANGES" or "REQUEST_CHANGES".
In regex, . matches any character, so this would also match "REQUESTXCHANGES" or "REQUEST-CHANGES" etc. If the intent is to match a space or underscore, the pattern should be REQUEST[\s_]CHANGES. If the intent is only to match with a space (which seems more likely based on line 384's APPROVE pattern), then it should be REQUEST\s+CHANGES.
| elseif ($reportContent -match '(?i)Final\s+Recommendation:\s*REQUEST.CHANGES|⚠️\s*Final\s+Recommendation:\s*REQUEST.CHANGES') { | |
| elseif ($reportContent -match '(?i)Final\s+Recommendation:\s*REQUEST\s+CHANGES|⚠️\s*Final\s+Recommendation:\s*REQUEST\s+CHANGES') { |
| Gate phase result: 'passed', 'failed', or $null (skipped) | ||
|
|
||
| .PARAMETER FixResult | ||
| Fix phase result: 'win' (PR best), 'lose' (alternative better), or $null (skipped) |
There was a problem hiding this comment.
The parameter documentation comment on line 214 is backwards. It says:
'win' (PR best), 'lose' (alternative better)
But based on the actual logic in the function (lines 259-286) and the label descriptions:
'win'→ appliess/agent-fix-win→ "AI found a better alternative fix than the PR" (agent wins, not PR)'lose'→ appliess/agent-fix-pr-picked→ "AI could not beat the PR fix — PR is the best" (agent loses, not alternative better)
The comment should say: 'win' (agent found better alternative), 'lose' (PR is best) which is already correctly stated on line 219.
| Fix phase result: 'win' (PR best), 'lose' (alternative better), or $null (skipped) | |
| Fix phase result: 'win' (agent found better alternative), 'lose' (PR is best), or $null (skipped) |
Agent Workflow Labels
GitHub labels for tracking outcomes of the AI agent PR review workflow (
Review-PR.ps1).All labels use the
s/agent-*prefix for easy querying on GitHub.Label Categories
Outcome Labels
Mutually exclusive — exactly one is applied per PR review run.
s/agent-approved#2E7D32s/agent-changes-requested#E65100s/agent-review-incomplete#B71C1CWhen a new outcome label is applied, any previously applied outcome label is automatically removed.
Signal Labels
Additive — multiple can coexist on a single PR.
s/agent-gate-passed#4CAF50s/agent-gate-failed#FF9800s/agent-fix-win#66BB6As/agent-fix-lose#FF7043Gate labels (
gate-passed/gate-failed) are mutually exclusive with each other. Fix labels (fix-win/fix-lose) are mutually exclusive with each other.Tracking Label
Always applied on every completed agent run.
s/agent-reviewed#1565C0Manual Label
Applied by MAUI maintainers, not by automation.
s/agent-fix-implemented#7B1FA2s/agent-suggestions-implemented#7B1FA2How It Works
Architecture
Labels are applied exclusively from
Review-PR.ps1Phase 4. No other script applies agent labels. This single-source design avoids label conflicts and simplifies debugging.How Labels Are Parsed
The
Parse-PhaseOutcomesfunction inUpdate-AgentLabels.ps1readscontent.mdfiles from each phase directory:gate/content.md**Result:** ✅ PASSEDs/agent-gate-passedgate/content.md**Result:** ❌ FAILEDs/agent-gate-failedtry-fix/content.md**Selected Fix:** Candidate ...s/agent-fix-wintry-fix/content.md**Selected Fix:** PR ...s/agent-fix-losereport/content.mdFinal Recommendation: APPROVEs/agent-approvedreport/content.mdFinal Recommendation: REQUEST CHANGESs/agent-changes-requesteds/agent-review-incompleteSelf-Bootstrapping
Labels are created automatically on first use via
Ensure-LabelExists. No manual setup required. If a label already exists but has a stale description or color, it is updated.Querying Labels
All labels use the
s/agent-*prefix, making them easy to filter on GitHub.Common Queries
Metrics You Can Derive
is:pr label:s/agent-reviewedlabel:s/agent-approvedvslabel:s/agent-changes-requestedcountslabel:s/agent-gate-passedvslabel:s/agent-gate-failedcountslabel:s/agent-fix-winvslabel:s/agent-fix-losecountslabel:s/agent-fix-implemented/label:s/agent-changes-requestedlabel:s/agent-review-incomplete/label:s/agent-reviewedImplementation Details
Files
.github/scripts/shared/Update-AgentLabels.ps1.github/scripts/Review-PR.ps1Apply-AgentLabelsin Phase 4.github/agents/pr/SHARED-RULES.mdKey Functions
Apply-AgentLabelsParse-PhaseOutcomescontent.mdfiles, returns outcome/gate/fix resultsUpdate-AgentOutcomeLabelUpdate-AgentSignalLabelsUpdate-AgentReviewedLabelEnsure-LabelExistsDesign Principles
Review-PR.ps1only — no other scripts touch labelsMigrated From
The following old infrastructure was removed as part of this implementation:
Update-VerificationLabelsfunction inverify-tests-fail.ps1— removed (labels now come fromReview-PR.ps1only)s/ai-reproduction-confirmed/s/ai-reproduction-failedlabels — superseded bys/agent-gate-passed/s/agent-gate-failed