refactor: unify resilience controls#1449
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the resilience configuration system by centralizing settings into a new ResilienceSettings structure, removing legacy model-level availability tracking, and replacing the per-target circuit breaker logic with a global provider-level breaker. While the changes improve resilience management, the new implementation introduces tight coupling by allowing the open-sse workspace package to import directly from the host application's internal modules (@/lib/db/readCache and @/lib/localDb). This violates the intended isolation of the workspace package. Additionally, the NumberField component in the resilience settings UI prevents users from clearing input fields, which should be addressed to improve usability.
| export async function getRuntimeProviderProfile(provider: string | null | undefined) { | ||
| const fallback = getProviderProfile(provider); | ||
| try { | ||
| const { getCachedSettings } = await import("@/lib/db/readCache"); |
There was a problem hiding this comment.
The workspace package @omniroute/open-sse is using a host-application alias (@/lib/db/readCache) in its imports. Since this directory is intended to be a standalone workspace package (as noted in ARCHITECTURE.md), it should not depend on the internal modules of the application that consumes it. This creates a circular dependency and prevents the package from being used independently or published correctly. Consider refactoring this to inject the settings dependency or use a shared interface.
| try { | ||
| const { getProviderConnections } = await import("@/lib/localDb"); | ||
| const connections = await getProviderConnections(); | ||
| const { getProviderConnections, getSettings } = await import("@/lib/localDb"); |
There was a problem hiding this comment.
This workspace package is importing directly from the application layer (@/lib/localDb). This tight coupling violates package boundaries and makes the @omniroute/open-sse package non-portable. Dependencies on the host application's database or configuration should be injected at runtime rather than being hardcoded via application-level aliases.
| if (event.target.value === "") return; | ||
| const nextValue = Number(event.target.value); |
There was a problem hiding this comment.
The NumberField component returns early when the input is an empty string, which prevents users from clearing the field to type a new value (the controlled input will immediately revert to its previous state). Removing this check allows the state to be updated correctly when the field is cleared, improving the user experience during configuration tuning.
| if (event.target.value === "") return; | |
| const nextValue = Number(event.target.value); | |
| const nextValue = Number(event.target.value); |
There was a problem hiding this comment.
Pull request overview
Refactors OmniRoute’s resilience model to a unified, plan-aligned stack (request queue → connection cooldown → provider breaker → wait-for-cooldown), removing legacy global model availability/quarantine surfaces and aligning runtime, dashboard, API, MCP, and docs for the 3.7.0 release.
Changes:
- Introduces
ResilienceSettingswith normalization/legacy-compat wiring and updates/api/resilience+ dashboard/MCP presets to the new shape. - Removes legacy model availability/quarantine domain + API/UI, shifting runtime health to provider breaker + connection cooldowns.
- Updates combo routing and chat pipeline behavior/tests to respect provider breaker responses and the unified cooldown-aware retry semantics.
Reviewed changes
Copilot reviewed 68 out of 68 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/thundering-herd.test.ts | Updates backoff ceiling expectations to use BACKOFF_CONFIG clamping. |
| tests/unit/sse-auth.test.ts | Aligns auth/cooldown expectations with unified cooldown constants and terminal statuses. |
| tests/unit/rate-limit-manager.test.ts | Updates assertions to reflect new limiter state exposure vs model lockout behavior. |
| tests/unit/error-classification.test.ts | Expands provider profile assertions and updates backoff/upstream-hint semantics. |
| tests/unit/domain-branch-hardening.test.ts | Removes modelAvailability reset/coverage now that the domain module is removed. |
| tests/unit/combo-routing-engine.test.ts | Updates combo tests to skip provider-breaker-open responses and renames execution key helper. |
| tests/unit/combo-context-relay.test.ts | Updates context-relay behavior to skip provider-breaker-open responses instead of per-target breaker state. |
| tests/unit/combo-config.test.ts | Removes legacy combo resilience fields (timeoutMs, healthcheck) from defaults/config resolution. |
| tests/unit/combo-circuit-breaker.test.ts | Removes legacy per-target combo circuit-breaker integration tests. |
| tests/unit/chat-route-coverage.test.ts | Shifts 503 coverage from model cooldown to connection cooldown and adds provider breaker response headers/code assertions. |
| tests/unit/chat-helpers.test.ts | Updates pipeline gate checks to provider breaker response and simplifies breaker execution path. |
| tests/unit/chat-combo-live-test.test.ts | Updates “live test” semantics to bypass connection cooldown and sets breaker timing fields explicitly. |
| tests/unit/batch-a-domain.test.ts | Removes model availability domain tests from Batch A suite. |
| tests/unit/autocombo-unification.test.ts | Updates intelligent routing helper test to config-only scoring (no health extraction). |
| tests/unit/auth-terminal-status.test.ts | Adds coverage for terminal/non-terminal auth classifications without adding cooldown. |
| tests/unit/account-fallback-service.test.ts | Updates profile expectations and removes provider-cooldown mutation behavior assertions. |
| tests/integration/security-hardening.test.ts | Removes legacy /api/models/availability route from endpoint validation checks. |
| tests/integration/proxy-pipeline.test.ts | Updates pipeline wiring assertions to new breaker response helper and credential preflight usage. |
| tests/integration/integration-wiring.test.ts | Removes /api/models/availability existence/method checks; updates wiring expectations. |
| tests/integration/chatcore-compression-integration.test.ts | Removes modelAvailability resets from integration harness reset logic. |
| tests/integration/chat-pipeline.test.ts | Removes global model-unavailable integration test and associated resets. |
| tests/integration/_chatPipelineHarness.ts | Removes modelAvailability helpers from harness surface and reset flow. |
| tests/e2e/resilience-plan-alignment.spec.ts | Adds Playwright coverage ensuring UI surfaces align to the new resilience plan and stop calling legacy endpoints. |
| src/types/settings.ts | Adds resilienceSettings?: ResilienceSettings and removes legacy combo defaults fields. |
| src/sse/services/cooldownAwareRetry.ts | Reads wait-for-cooldown behavior from resolveResilienceSettings and renames settings fields. |
| src/sse/services/auth.ts | Refactors markAccountUnavailable to unify cooldown decisions, terminal statuses, and provider error classification. |
| src/sse/handlers/chatHelpers.ts | Uses providerCircuitOpenResponse for breaker gating; removes model availability gating and breaker.execute usage. |
| src/sse/handlers/chat.ts | Removes model availability quarantine integration and updates provider breaker handling + wait-for-cooldown semantics. |
| src/shared/validation/schemas.ts | Adds plan-aligned resilience schemas while keeping legacy profile/default schemas for compatibility. |
| src/shared/utils/circuitBreaker.ts | Extends breaker status payload with retryAfterMs. |
| src/lib/resilience/settings.ts | New module defining defaults, normalization, merge, and legacy compatibility mapping for resilience settings. |
| src/lib/monitoring/observability.ts | Normalizes breaker status and exposes providerBreakers array + retryAfterMs in health payload. |
| src/lib/dataPaths.js | Removes legacy compiled JS dataPaths artifact. |
| src/lib/combos/intelligentRouting.ts | Removes health extraction; keeps provider scoring based on config only. |
| src/domain/modelAvailability.ts | Removes legacy global model availability/quarantine domain module. |
| src/app/api/settings/combo-defaults/route.ts | Sanitizes legacy combo resilience keys out of persisted defaults/overrides. |
| src/app/api/resilience/route.ts | Implements plan-aligned GET/PATCH, legacy patch normalization, persistence, and runtime sync. |
| src/app/api/resilience/reset/route.ts | Updates endpoint semantics/docs to “reset provider circuit breakers” only. |
| src/app/api/policies/route.ts | Removes circuit breaker states from policies API payload (locked identifiers only). |
| src/app/api/models/availability/route.ts | Removes legacy model availability API route. |
| src/app/(dashboard)/layout.tsx | Removes ModelStatusProvider wrapper now that model availability UI is removed. |
| src/app/(dashboard)/dashboard/settings/components/PoliciesPanel.tsx | Removes circuit breaker UI from Policies panel; focuses on locked identifiers. |
| src/app/(dashboard)/dashboard/settings/components/ComboDefaultsTab.tsx | Removes legacy combo resilience fields and sanitizes legacy keys on load/save. |
| src/app/(dashboard)/dashboard/providers/page.tsx | Removes ModelAvailabilityBadge from providers page header. |
| src/app/(dashboard)/dashboard/providers/components/ModelStatusContext.tsx | Removes legacy shared polling context for model availability. |
| src/app/(dashboard)/dashboard/providers/components/ModelStatusBadge.tsx | Removes per-model status badge component. |
| src/app/(dashboard)/dashboard/providers/components/ModelAvailabilityPanel.tsx | Removes legacy model availability panel. |
| src/app/(dashboard)/dashboard/providers/components/ModelAvailabilityBadge.tsx | Removes legacy model availability badge/popup. |
| src/app/(dashboard)/dashboard/providers/[id]/page.tsx | Removes per-model status badge in provider model rows. |
| src/app/(dashboard)/dashboard/health/page.tsx | Displays breaker retryAfterMs in provider health UI. |
| src/app/(dashboard)/dashboard/endpoint/components/MCPDashboard.tsx | Updates resilience presets to new plan-aligned structure. |
| src/app/(dashboard)/dashboard/combos/page.tsx | Sanitizes legacy combo config keys and removes legacy advanced fields from templates/UI. |
| src/app/(dashboard)/dashboard/combos/IntelligentComboPanel.tsx | Removes health polling/exclusions UI; makes panel explicitly config-only. |
| open-sse/utils/error.ts | Adds structured providerCircuitOpenResponse (503 + code + headers). |
| open-sse/services/rateLimitManager.ts | Integrates request queue settings from ResilienceSettings and applies them across Bottleneck limiters. |
| open-sse/services/comboConfig.ts | Filters legacy combo resilience keys during config cascade resolution. |
| open-sse/services/combo.ts | Removes per-target breaker logic; skips targets when provider breaker open responses are returned. |
| open-sse/services/accountFallback.ts | Reworks provider profile derivation to use ResilienceSettings and maps legacy “provider cooldown” helpers to shared breaker. |
| open-sse/services/AGENTS.md | Updates architecture notes to reflect provider-breaker/global model. |
| open-sse/mcp-server/tools/advancedTools.ts | Updates MCP resilience profile payloads to new structure. |
| open-sse/handlers/chatCore.ts | Removes legacy per-model quota lockout handling in favor of shared auth/fallback path. |
| docs/openapi.yaml | Removes /api/models/availability and updates /api/resilience to GET/PATCH config semantics. |
| docs/USER_GUIDE.md | Updates resilience documentation to the new multi-layer model and surface responsibilities. |
| docs/ARCHITECTURE.md | Updates architecture references to remove model availability and reflect new resilience surfaces. |
| docs/API_REFERENCE.md | Removes model availability API docs and updates resilience section wording. |
| README.md | Updates resilience feature description to match the new plan-aligned model. |
| settings: null, | ||
| relayOptions: null as any, | ||
| allCombos: null, | ||
| relayOptions: null, | ||
| }); |
There was a problem hiding this comment.
Object literal passed to handleComboChat defines relayOptions twice. In TypeScript this is a compile error (duplicate property name) and the first value is silently overwritten at runtime. Remove the duplicate and keep a single relayOptions entry (and do the same for other call sites in this file).
|
Great refactor! This unified resilience model is much cleaner than the fragmented approach we had before. I have a suggestion: Consider removing 429 from PROVIDER_FAILURE_ERROR_CODES. Rate limiting (429) is expected behavior that's already handled at model-level and account-level cooldowns. Including it in the provider-wide circuit breaker causes premature cooldown of the entire provider, reducing availability for legitimate traffic that could be routed to other accounts or models. const PROVIDER_FAILURE_ERROR_CODES = new Set([408, 500, 502, 503, 504]); I had a related PR (#1442) that addressed this and added configurable thresholds. Since your refactor already covers the threshold configuration via providerBreaker.failureThreshold, I'm happy to close mine and let this proceed. The only remaining item would be the 429 removal and a small test isolation fix in settings-api.test.ts that I can submit separately after this merges. Thanks for the comprehensive cleanup! |
|
@clousky2020 Good catch. I adopted this in the latest update.
I also updated the Resilience settings copy and the related tests/docs so the UI, runtime behavior, and validation all match this rule. Thanks for calling it out. |
Summary
This PR refactors OmniRoute's entire circuit-breaker and cooldown stack for
release/v3.7.0.Before this work, resilience behavior was spread across too many overlapping layers: connection cooldowns, provider breakers, model-availability state, combo-specific health state, and dashboard/API-specific views of runtime health. The result was hard to reason about and hard to trust: the same failure could be interpreted differently depending on where it was handled, and different surfaces could drift away from the actual runtime behavior.
The goal of this refactor is to give OmniRoute one clear resilience model with clear responsibilities at each layer, so the runtime, dashboard, MCP, API, and docs all describe the same system.
What the new system is for
The rebuilt resilience system is split into four layers. Each layer exists to answer a different operational question.
1.
requestQueueWhat this layer is for
This layer protects a single connection/account from being hit by overlapping local requests when that connection should process work sequentially.
What it means operationally
2.
connectionCooldownWhat this layer is for
This layer decides when one specific connection/account should temporarily step out of rotation after a retryable failure.
What it means operationally
429) or temporarily unstable, OmniRoute cools down that connection instead of repeatedly hammering itbaseCooldownMs, upstream retry hints, and bounded backoff to decide how long that connection should rest429rate limits stay here and do not trip the provider breaker3.
providerBreakerWhat this layer is for
This layer decides when a provider as a whole is unhealthy enough that OmniRoute should stop routing traffic into it for a while.
What it means operationally
408,500,502,503,504)429rate limits do not contribute to the provider-wide breaker4.
waitForCooldownWhat this layer is for
This layer decides what OmniRoute should do when every eligible connection is temporarily unavailable only because of cooldowns.
What it means operationally
What combo routing is responsible for
Combo routing is now treated as a routing layer, not as a second resilience system.
That distinction is important.
A combo should answer questions like:
A combo should not maintain its own parallel understanding of provider health, cooldown state, or breaker state.
What this means operationally
In other words: combo decides where to try next; the resilience system decides whether that target is currently eligible.
Runtime behavior after the refactor
The runtime now follows one consistent flow:
That gives each decision a single home instead of spreading the same responsibility across multiple overlapping subsystems.
Account and fallback handling
This refactor also makes the fallback path easier to reason about.
429rate limits stay in cooldown/backoff and wait-for-cooldown behavior instead of being counted as provider-breaker failuresDashboard, API, and MCP behavior
The non-runtime surfaces now describe the same resilience model instead of inventing their own parallel layers.
Dashboard
429handling as part of Connection CooldownAPI
/api/resilienceexposes the same settings shape used by the runtime and dashboardconnectionCooldownis configured with:baseCooldownMsuseUpstreamRetryHintsmaxBackoffStepsproviderBreakeronly counts provider-wide final transient failures; connection-scoped429handling stays outside this layerMCP
Documentation
The docs now describe the same 3.7.0 resilience model used by the implementation:
README.mddocs/API_REFERENCE.mddocs/ARCHITECTURE.mddocs/USER_GUIDE.mddocs/openapi.yamlValidation
Type and test coverage
npm run typecheck:corenode --import tsx/esm --test tests/integration/integration-wiring.test.tsnode --import tsx/esm --test tests/unit/autocombo-unification.test.tsnode --import tsx/esm --test tests/unit/account-fallback-service.test.tsnode --import tsx/esm --test tests/integration/chat-pipeline.test.tsgit pushpre-push hook:npm run test:unit(3240passing tests)Playwright coverage
Playwright coverage was run in dev-server mode and verifies the behavior of the new model across UI surfaces:
Commands used:
OMNIROUTE_PLAYWRIGHT_SERVER_MODE=dev npx playwright test tests/e2e/resilience-plan-alignment.spec.tsOMNIROUTE_PLAYWRIGHT_SERVER_MODE=dev npx playwright test tests/e2e/combo-unification.spec.tsNote
Playwright was executed in dev-server mode because the current branch still has unrelated Next build/prerender failures in start/build mode (
/_global-error,/callback). The behavior covered above was validated successfully in dev mode.