Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Make plugin metadata prefetch resilient to registry rate limiting [#7176]#7181

Merged
pditommaso merged 5 commits into
masterfrom
fix-plugin-registry-rate-limit
May 28, 2026
Merged

Make plugin metadata prefetch resilient to registry rate limiting [#7176]#7181
pditommaso merged 5 commits into
masterfrom
fix-plugin-registry-rate-limit

Conversation

@pditommaso

@pditommaso pditommaso commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #7176 — under burst CI load the plugin registry returns HTTP 429 with Retry-After, but the HTTP client ignored the hint and every attempt landed inside the same rate-limit window, aborting Nextflow startup.

Four layered changes:

  1. Skip registry round-trip when plugins are already cachedPluginUpdater.prefetchMetadata filters out plugins that are pinned to a specific version and already installed locally. When every requested plugin matches, the registry call is skipped entirely. Unpinned specs still go through so latest-release resolution keeps working.

  2. Non-fatal prefetchHttpPluginRepository.prefetch catches PluginRuntimeException, logs a warn, and leaves an empty plugins map. Prefetch is an optimization, not a hard requirement; a sustained rate-limit window or true outage must not abort startup. The lazy getPlugin(id) path still surfaces per-plugin errors when an actually-missing plugin is needed downstream. The plugins field is initialised eagerly so getPlugin and refresh are safe to call when prefetch was skipped or failed.

  3. Honor Retry-After in the HTTP layer — picks up [email protected] + [email protected] (libseqera#72) so HxClient retries on 429/503 honor the server-supplied delay as a lower bound, capped by the configured maxDelay.

  4. Explicit NXF_OFFLINE guardPluginsFacade.start() skips prefetchMetadata entirely when NXF_OFFLINE=true. Currently handled transitively (no PrefetchUpdateRepository is wired in offline mode, so the iteration is a no-op), but an explicit if (!offline) at the call site makes the no-network guarantee unmissable in the code and protects against regressions if a remote PrefetchUpdateRepository is later added to the offline-mode repo list by mistake.

Why the Failsafe bump

[email protected] targets Failsafe ≥3.2.1 because RetryPolicyBuilder.handleIf now requires CheckedPredicate and ExecutionAttemptedEvent renamed getLastFailuregetLastException. Bumping Nextflow's direct Failsafe dependency from 3.1.03.3.2 aligns with the lib and surfaces some pre-existing API drift that is migrated mechanically (no behavior change):

  • event.lastFailureevent.lastException
  • Predicate<? extends Throwable>CheckedPredicate<? extends Throwable> on RetryPolicyBuilder.handleIf

Sites touched: PluginUpdater, PublishDir, GridTaskHandler, K8sClient, BatchClient (nf-google), AzBatchService, AzFileSystem (nf-azure), and the matching AzBatchServiceTest Mock declaration.

End-to-end behavior

  • CI common case (pinned + already cached): zero registry calls.
  • CI cold case (first install of a pinned plugin): registry returns 429 → HxClient waits min(maxDelay=90s, max(backoff, Retry-After)) → retries up to 5x → usually succeeds.
  • Sustained outage / hard rate-limit: retries exhaust → prefetch() degrades non-fatally → startup continues; downstream operations that genuinely need the metadata fail with a specific error.

Test plan

  • :nf-commons:test --tests HttpPluginRepositoryTest --tests PluginUpdaterTest — green
  • :plugins:nf-k8s:test, :plugins:nf-google:test, :plugins:nf-azure:test --tests AzBatchServiceTest --tests AzFileSystemTest — green
  • All affected plugins compile cleanly under Failsafe 3.3.2 (./gradlew compileGroovy on each)
  • New WireMock test 'prefetch degrades gracefully when the registry keeps returning 429' in HttpPluginRepositoryTest
  • CI to verify end-to-end on a real Nextflow invocation

🤖 Generated with Claude Code

]

Under burst CI load the plugin registry returns HTTP 429 with a
Retry-After hint that the client previously ignored, so every attempt
landed inside the same rate-limit window and Nextflow startup aborted.

Two defensive layers around the registry metadata prefetch in
nf-commons, paired with a dependency bump that brings Retry-After
honoring into the HTTP layer:

  1. PluginUpdater.prefetchMetadata filters out plugins that are
     pinned to a specific version and already installed locally. When
     every requested plugin matches, the registry call is skipped
     entirely - no network round-trip on every Nextflow invocation in
     the common CI case (pinned versions, plugins already cached).
     Unpinned specs still go through so latest-release resolution
     keeps working.

  2. HttpPluginRepository.prefetch wraps the fetch in a try/catch on
     PluginRuntimeException, logs a warn, and leaves an empty plugins
     map. Prefetch is an optimization, not a hard requirement: a
     sustained registry rate-limit window or true outage must not
     abort Nextflow startup. The lazy getPlugin(id) path still
     surfaces per-plugin errors when an actually-missing plugin is
     needed downstream. The plugins field is initialised eagerly so
     getPlugin and refresh are safe to call when prefetch was skipped
     or failed.

To get HxClient to honor Retry-After on 429/503 the build picks up
[email protected] and [email protected], which require Failsafe >=3.2.1
(the new RetryPolicyBuilder.handleIf takes CheckedPredicate, and
ExecutionAttemptedEvent renamed getLastFailure to getLastException).
Bumping Failsafe 3.1.0 -> 3.3.2 surfaces a small amount of API drift
in pre-existing call sites that is migrated mechanically:

  - event.lastFailure -> event.lastException
  - Predicate<? extends Throwable> -> CheckedPredicate<? extends Throwable>
    on RetryPolicyBuilder.handleIf

Sites updated: PluginUpdater, PublishDir, GridTaskHandler, K8sClient,
BatchClient (nf-google), AzBatchService, AzFileSystem (nf-azure), plus
the matching AzBatchServiceTest Mock declaration.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@netlify

netlify Bot commented May 27, 2026

Copy link
Copy Markdown

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit e4cc899
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a1722b8f9a4ef00082df36f

@pditommaso pditommaso requested review from bentsherman and jorgee May 27, 2026 12:49
Defence-in-depth guard at PluginsFacade.start(): when NXF_OFFLINE=true
the registry call must never happen. The current code achieves this
transitively (PluginUpdater wraps repos with LocalUpdateRepository in
offline mode, so prefetchMetadata's PrefetchUpdateRepository iteration
is a no-op), but an explicit if(!offline) at the call site makes the
no-network guarantee unmissable in the code and protects against
regressions if a remote PrefetchUpdateRepository is later wired in
offline mode by mistake.

Signed-off-by: Paolo Di Tommaso <[email protected]>
The Failsafe 3.1.0 -> 3.3.2 bump exposed a runtime NoSuchMethodError on
nf-google and nf-seqera: both transitively pull older lib-httpx versions
(2.1.1 via lib-cloudinfo for nf-google, 2.2.0 via sched-client for
nf-seqera), which in turn drag failsafe-3.1.0 into each plugin's bundled
runtime libs.

At runtime the plugin classes - compiled against the new Failsafe and
referencing handleIf(CheckedPredicate) - resolve against the plugin's
own classpath first (PF4J classloader), find failsafe-3.1.0 there, and
throw java.lang.NoSuchMethodError because that overload only exists
since Failsafe 3.2.1.

Declaring [email protected] explicitly in each affected plugin forces
Gradle to use the newer version; that in turn pulls in [email protected]
and [email protected]. Verified by inspecting build/target/libs and
re-running each plugin's test suite.

Signed-off-by: Paolo Di Tommaso <[email protected]>
Comment thread modules/nf-commons/src/main/nextflow/plugin/HttpPluginRepository.groovy Outdated
Comment thread modules/nf-commons/src/main/nextflow/plugin/PluginsFacade.groovy
Address review feedback: use an early-return guard instead of nesting
the try/catch inside the (plugins not empty) check. In Groovy an empty
or null list is falsy, so `!plugins` covers both cases.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso force-pushed the fix-plugin-registry-rate-limit branch from b8d1407 to ebe7286 Compare May 27, 2026 15:03
The skip-if-installed optimization in `prefetchMetadata` relied on
`pluginManager.getPlugin()`, but at prefetch time the local plugins have
not been loaded into the manager yet (its per-run root is still empty),
so the check always returned false and the `/v1/plugins/dependencies`
round-trip happened on every startup even when the plugin was cached.

Check the on-disk plugin store instead, mirroring `installPlugin()` which
reuses the cached copy whenever the `<id>-<version>` store directory
exists (it only downloads when missing). Its presence is the authoritative
signal that no remote metadata is required to start the plugin.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: jorgee <[email protected]>
@jorgee

jorgee commented May 27, 2026

Copy link
Copy Markdown
Contributor

Pushed e4cc899d4 — fixes a gap in the skip-if-installed optimization.

The problem

prefetchMetadata filtered out already-installed plugins via isAlreadyInstalled, which checked pluginManager.getPlugin(ref.id). But at prefetch time (PluginsFacade.start()prefetchMetadata, before prepareAndStart) the local plugins have not been loaded into the manager yet — its per-run plugins root (.nextflow/plr/<uuid>/) is still empty, and pf4j logs No plugins. So getPlugin() always returned null, isAlreadyInstalled always returned false, and the /v1/plugins/dependencies round-trip fired on every startup even when the plugin was cached — exactly the condition that triggers the 429 storm under parallel CI load.

Verified empirically before the fix: a run (and plugin install) of an already-installed [email protected] still made 1 registry call (and reused the cached zip without downloading).

The change

isAlreadyInstalled now checks the on-disk plugin store instead of the runtime manager, mirroring installPlugin() which reuses the cached copy whenever the <id>-<version> store directory exists (it only downloads when missing — the classes/META-INF/MANIFEST.MF check there is a corruption warning, not a gate):

return pluginsStore != null && FilesEx.exists(pluginsStore.resolve("${ref.id}-${ref.version}"))

Directory presence is the authoritative "no remote metadata needed" signal and works for both real installs and the test fixtures. (privateprotected for direct unit testing.)

New unit test

PluginUpdaterTest › 'should detect already-installed pinned plugin from the on-disk store' — confirms getPlugin() is null at this point (the trap the old code fell into), that a pinned + cached plugin is detected, that a different version is not, and that an unpinned spec still needs metadata. Verified red against the old code, green after the fix.

Validation

  • :nf-commons:test --tests nextflow.plugin.* — green (incl. HttpPluginRepositoryTest, PluginUpdaterTest).
  • End-to-end with a make pack binary (NXF_PLUGINS_DIR per process):
    • run / plugin install of cached [email protected]skipping registry metadata prefetch, 0 registry calls, exit 0.
    • cold install (empty dir) → 1 registry call, plugin installed, exit 0 (normal path intact).
    • 100 parallel jobs, plugin cached (the CI scenario)100/100 success, 0 total registry calls (before this fix all 100 would hit the registry).

@jorgee jorgee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine with me.

@pditommaso pditommaso merged commit 1c47215 into master May 28, 2026
27 checks passed
@pditommaso pditommaso deleted the fix-plugin-registry-rate-limit branch May 28, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HttpPluginRepository.prefetch() crashes startup on HTTP 429 — registry.nextflow.io/api/v1/plugins/dependencies called unconditionally since 26.04.2

3 participants