Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve Tower telemetry error handling on transient gateway failures#7190

Merged
pditommaso merged 4 commits into
masterfrom
fix-tower-502-error-reason
May 29, 2026
Merged

Improve Tower telemetry error handling on transient gateway failures#7190
pditommaso merged 4 commits into
masterfrom
fix-tower-502-error-reason

Conversation

@pditommaso

Copy link
Copy Markdown
Member

Problem

A pipeline run aborted because the Seqera Platform progress/telemetry endpoint returned a transient HTTP 502 Bad Gateway. From the user's perspective the run "stopped quietly": the failure reason was hard to find and the run gave up after only a few seconds of retries.

Investigation of the run log surfaced three distinct issues:

  1. The reason was an unreadable HTML dump. TowerClient could not parse the gateway's HTML error page as JSON, so it surfaced the entire <html>…502 Bad Gateway…</html> markup as the AbortRunException message.
  2. The retry window was too short. The retry policy gave up in ~5 seconds (5 attempts) on a transient 502 that is usually short-lived.
  3. (Discovered while testing) AuthCommandImplTest was hitting the real network and could run for >6 minutes.

Changes

1. Surface a concise reason for HTTP gateway errors

TowerClient.parseCause now reduces an HTML error body to a concise reason by extracting the <title> text (gateways/proxies put the status there, e.g. 502 Bad Gateway); the match tolerates attributes and collapses internal whitespace. Non-HTML bodies fall back to the plain text as-is, and JSON error objects are unchanged.

Resulting abort message:

Unexpected HTTP response
- endpoint    : https://cloud.seqera.io/api/trace/<id>/progress?workspaceId=<id>
- status code : 502
- response msg: 502 Bad Gateway

2. Extend the retry window to ~3 minutes

Raise the Tower retry policy default maxAttempts from 5 to 10. Combined with the existing delay (350ms), multiplier (2.0) and maxDelay (90s), the 9 exponential backoff gaps span a retry window of ~3 minutes (0.35·(2^9−1) ≈ 178.9s), so transient gateway errors are ridden out before the run fails.

3. Make AuthCommandImplTest hermetic

Several collectStatus test cases stubbed checkApiConnection but did not isolate SysEnv nor stub createTowerClient. Because collectStatus calls createTowerClient(endpoint, accessToken).getUserInfo() when a token is present, these tests picked up a real TOWER_ACCESS_TOKEN from the developer's environment and made live network calls — one against https://unreachable.example.com, which (after extending the retry window) hung ~3 minutes. Isolating the environment with SysEnv.push([:])/pop() drops the class runtime from >6.5 min to ~8s.

Testing

  • TowerClientTest — 30 tests, incl. new cases for HTML→reason extraction (attributes, multiline title, plain-text fallback) and the traceProgress 502 abort path.
  • TowerRetryPolicyTest — asserts the new default and that the computed backoff window is ~3 min.
  • AuthCommandImplTest — 58 tests, now ~8s.

🤖 Generated with Claude Code

pditommaso and others added 3 commits May 29, 2026 17:23
When the Seqera Platform progress/telemetry endpoint returns a non-JSON
error body (e.g. an HTML `502 Bad Gateway` page from a gateway/proxy),
TowerClient previously surfaced the whole HTML payload as the failure
cause. The resulting AbortRunException message was an unreadable wall of
markup, obscuring the real reason.

Reduce HTML error bodies to a concise reason by extracting the `<title>`
(which gateways/proxies use to carry the status reason, e.g. "502 Bad
Gateway"); the match tolerates attributes and collapses internal
whitespace. Non-HTML bodies fall back to the plain text as-is, and JSON
error objects are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
A transient gateway error (e.g. HTTP 502) on the Tower telemetry endpoint
exhausted the retry policy in ~5 seconds (5 attempts) and aborted the run,
even though such errors are usually short-lived.

Raise the default maxAttempts for the Tower retry policy from 5 to 10.
Combined with the existing delay (350ms), multiplier (2.0) and maxDelay
(90s), the 9 exponential backoff gaps span a retry window of about 3
minutes, allowing transient gateway errors to be ridden out before
failing the run.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Several collectStatus test cases stubbed checkApiConnection but did not
isolate SysEnv nor stub createTowerClient. Since collectStatus calls
createTowerClient(endpoint, accessToken).getUserInfo() when an access
token is present, these tests picked up a real TOWER_ACCESS_TOKEN from the
developer environment and made live network calls -- one of them against
https://unreachable.example.com, which (after extending the retry window)
hung for ~3 minutes. The whole test class could exceed 6 minutes.

Isolate the environment with SysEnv.push([:]) / SysEnv.pop() (the pattern
already used elsewhere in this class) in the affected cases, so no test
contacts the network. Runtime drops from >6.5 min to ~8s.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@netlify

netlify Bot commented May 29, 2026

Copy link
Copy Markdown

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 53af6b0
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a19b6e7579467000867fc16

@pditommaso pditommaso requested a review from bentsherman May 29, 2026 15:26
Document and lock the behaviour that a background session abort (e.g. a
Tower telemetry 502) is captured by `workflow.errorReport` and
`workflow.errorMessage` via the `session.error` branch of
`WorkflowMetadata.setErrorAttributes()` when no task fault is present.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso merged commit b0d7e33 into master May 29, 2026
18 of 22 checks passed
@pditommaso pditommaso deleted the fix-tower-502-error-reason branch May 29, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants