Make plugin metadata prefetch resilient to registry rate limiting [#7176]#7181
Conversation
] Under burst CI load the plugin registry returns HTTP 429 with a Retry-After hint that the client previously ignored, so every attempt landed inside the same rate-limit window and Nextflow startup aborted. Two defensive layers around the registry metadata prefetch in nf-commons, paired with a dependency bump that brings Retry-After honoring into the HTTP layer: 1. PluginUpdater.prefetchMetadata filters out plugins that are pinned to a specific version and already installed locally. When every requested plugin matches, the registry call is skipped entirely - no network round-trip on every Nextflow invocation in the common CI case (pinned versions, plugins already cached). Unpinned specs still go through so latest-release resolution keeps working. 2. HttpPluginRepository.prefetch wraps the fetch in a try/catch on PluginRuntimeException, logs a warn, and leaves an empty plugins map. Prefetch is an optimization, not a hard requirement: a sustained registry rate-limit window or true outage must not abort Nextflow startup. The lazy getPlugin(id) path still surfaces per-plugin errors when an actually-missing plugin is needed downstream. The plugins field is initialised eagerly so getPlugin and refresh are safe to call when prefetch was skipped or failed. To get HxClient to honor Retry-After on 429/503 the build picks up [email protected] and [email protected], which require Failsafe >=3.2.1 (the new RetryPolicyBuilder.handleIf takes CheckedPredicate, and ExecutionAttemptedEvent renamed getLastFailure to getLastException). Bumping Failsafe 3.1.0 -> 3.3.2 surfaces a small amount of API drift in pre-existing call sites that is migrated mechanically: - event.lastFailure -> event.lastException - Predicate<? extends Throwable> -> CheckedPredicate<? extends Throwable> on RetryPolicyBuilder.handleIf Sites updated: PluginUpdater, PublishDir, GridTaskHandler, K8sClient, BatchClient (nf-google), AzBatchService, AzFileSystem (nf-azure), plus the matching AzBatchServiceTest Mock declaration. Signed-off-by: Paolo Di Tommaso <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Defence-in-depth guard at PluginsFacade.start(): when NXF_OFFLINE=true the registry call must never happen. The current code achieves this transitively (PluginUpdater wraps repos with LocalUpdateRepository in offline mode, so prefetchMetadata's PrefetchUpdateRepository iteration is a no-op), but an explicit if(!offline) at the call site makes the no-network guarantee unmissable in the code and protects against regressions if a remote PrefetchUpdateRepository is later wired in offline mode by mistake. Signed-off-by: Paolo Di Tommaso <[email protected]>
The Failsafe 3.1.0 -> 3.3.2 bump exposed a runtime NoSuchMethodError on nf-google and nf-seqera: both transitively pull older lib-httpx versions (2.1.1 via lib-cloudinfo for nf-google, 2.2.0 via sched-client for nf-seqera), which in turn drag failsafe-3.1.0 into each plugin's bundled runtime libs. At runtime the plugin classes - compiled against the new Failsafe and referencing handleIf(CheckedPredicate) - resolve against the plugin's own classpath first (PF4J classloader), find failsafe-3.1.0 there, and throw java.lang.NoSuchMethodError because that overload only exists since Failsafe 3.2.1. Declaring [email protected] explicitly in each affected plugin forces Gradle to use the newer version; that in turn pulls in [email protected] and [email protected]. Verified by inspecting build/target/libs and re-running each plugin's test suite. Signed-off-by: Paolo Di Tommaso <[email protected]>
Address review feedback: use an early-return guard instead of nesting the try/catch inside the (plugins not empty) check. In Groovy an empty or null list is falsy, so `!plugins` covers both cases. Signed-off-by: Paolo Di Tommaso <[email protected]>
b8d1407 to
ebe7286
Compare
The skip-if-installed optimization in `prefetchMetadata` relied on `pluginManager.getPlugin()`, but at prefetch time the local plugins have not been loaded into the manager yet (its per-run root is still empty), so the check always returned false and the `/v1/plugins/dependencies` round-trip happened on every startup even when the plugin was cached. Check the on-disk plugin store instead, mirroring `installPlugin()` which reuses the cached copy whenever the `<id>-<version>` store directory exists (it only downloads when missing). Its presence is the authoritative signal that no remote metadata is required to start the plugin. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: jorgee <[email protected]>
|
Pushed The problem
Verified empirically before the fix: a The change
return pluginsStore != null && FilesEx.exists(pluginsStore.resolve("${ref.id}-${ref.version}"))Directory presence is the authoritative "no remote metadata needed" signal and works for both real installs and the test fixtures. ( New unit test
Validation
|
Summary
Fixes #7176 — under burst CI load the plugin registry returns HTTP 429 with
Retry-After, but the HTTP client ignored the hint and every attempt landed inside the same rate-limit window, aborting Nextflow startup.Four layered changes:
Skip registry round-trip when plugins are already cached —
PluginUpdater.prefetchMetadatafilters out plugins that are pinned to a specific version and already installed locally. When every requested plugin matches, the registry call is skipped entirely. Unpinned specs still go through so latest-release resolution keeps working.Non-fatal prefetch —
HttpPluginRepository.prefetchcatchesPluginRuntimeException, logs awarn, and leaves an empty plugins map. Prefetch is an optimization, not a hard requirement; a sustained rate-limit window or true outage must not abort startup. The lazygetPlugin(id)path still surfaces per-plugin errors when an actually-missing plugin is needed downstream. Thepluginsfield is initialised eagerly sogetPluginandrefreshare safe to call when prefetch was skipped or failed.Honor
Retry-Afterin the HTTP layer — picks up[email protected]+[email protected](libseqera#72) soHxClientretries on 429/503 honor the server-supplied delay as a lower bound, capped by the configuredmaxDelay.Explicit
NXF_OFFLINEguard —PluginsFacade.start()skipsprefetchMetadataentirely whenNXF_OFFLINE=true. Currently handled transitively (noPrefetchUpdateRepositoryis wired in offline mode, so the iteration is a no-op), but an explicitif (!offline)at the call site makes the no-network guarantee unmissable in the code and protects against regressions if a remotePrefetchUpdateRepositoryis later added to the offline-mode repo list by mistake.Why the Failsafe bump
[email protected]targets Failsafe ≥3.2.1 becauseRetryPolicyBuilder.handleIfnow requiresCheckedPredicateandExecutionAttemptedEventrenamedgetLastFailure→getLastException. Bumping Nextflow's direct Failsafe dependency from3.1.0→3.3.2aligns with the lib and surfaces some pre-existing API drift that is migrated mechanically (no behavior change):event.lastFailure→event.lastExceptionPredicate<? extends Throwable>→CheckedPredicate<? extends Throwable>onRetryPolicyBuilder.handleIfSites touched:
PluginUpdater,PublishDir,GridTaskHandler,K8sClient,BatchClient(nf-google),AzBatchService,AzFileSystem(nf-azure), and the matchingAzBatchServiceTestMockdeclaration.End-to-end behavior
HxClientwaitsmin(maxDelay=90s, max(backoff, Retry-After))→ retries up to 5x → usually succeeds.prefetch()degrades non-fatally → startup continues; downstream operations that genuinely need the metadata fail with a specific error.Test plan
:nf-commons:test --tests HttpPluginRepositoryTest --tests PluginUpdaterTest— green:plugins:nf-k8s:test,:plugins:nf-google:test,:plugins:nf-azure:test --tests AzBatchServiceTest --tests AzFileSystemTest— green./gradlew compileGroovyon each)'prefetch degrades gracefully when the registry keeps returning 429'inHttpPluginRepositoryTest🤖 Generated with Claude Code