Thanks to visit codestin.com
Credit goes to github.com

Skip to content

refactor: unify resilience controls#1449

Merged
diegosouzapw merged 2 commits into
diegosouzapw:release/v3.7.0from
rdself:coder/pr1443-v370-squash
Apr 21, 2026
Merged

refactor: unify resilience controls#1449
diegosouzapw merged 2 commits into
diegosouzapw:release/v3.7.0from
rdself:coder/pr1443-v370-squash

Conversation

@rdself
Copy link
Copy Markdown
Contributor

@rdself rdself commented Apr 20, 2026

Summary

This PR refactors OmniRoute's entire circuit-breaker and cooldown stack for release/v3.7.0.

Before this work, resilience behavior was spread across too many overlapping layers: connection cooldowns, provider breakers, model-availability state, combo-specific health state, and dashboard/API-specific views of runtime health. The result was hard to reason about and hard to trust: the same failure could be interpreted differently depending on where it was handled, and different surfaces could drift away from the actual runtime behavior.

The goal of this refactor is to give OmniRoute one clear resilience model with clear responsibilities at each layer, so the runtime, dashboard, MCP, API, and docs all describe the same system.

What the new system is for

The rebuilt resilience system is split into four layers. Each layer exists to answer a different operational question.

1. requestQueue

What this layer is for

This layer protects a single connection/account from being hit by overlapping local requests when that connection should process work sequentially.

What it means operationally

  • if a connection is healthy but already busy, requests can be serialized instead of racing each other
  • this is about local concurrency control, not failure handling
  • it prevents self-inflicted contention on the same account

2. connectionCooldown

What this layer is for

This layer decides when one specific connection/account should temporarily step out of rotation after a retryable failure.

What it means operationally

  • if one account is rate-limited (429) or temporarily unstable, OmniRoute cools down that connection instead of repeatedly hammering it
  • this is connection-scoped protection, not provider-wide protection
  • the runtime uses baseCooldownMs, upstream retry hints, and bounded backoff to decide how long that connection should rest
  • connection-scoped 429 rate limits stay here and do not trip the provider breaker

3. providerBreaker

What this layer is for

This layer decides when a provider as a whole is unhealthy enough that OmniRoute should stop routing traffic into it for a while.

What it means operationally

  • this is the provider-wide safety gate
  • if failures are no longer isolated to one bad account and look like provider-level instability, the breaker opens
  • the breaker tracks provider-wide final transient failures after fallback exhaustion (408, 500, 502, 503, 504)
  • connection-scoped 429 rate limits do not contribute to the provider-wide breaker
  • direct requests and combo routing use the same provider-level breaker state
  • the Health page is the runtime view of this provider-wide state

4. waitForCooldown

What this layer is for

This layer decides what OmniRoute should do when every eligible connection is temporarily unavailable only because of cooldowns.

What it means operationally

  • OmniRoute can either fail fast or wait briefly for the earliest cooldown to expire
  • this is the policy layer for cooldown-aware retry behavior
  • it prevents the system from either giving up too early or retrying forever

What combo routing is responsible for

Combo routing is now treated as a routing layer, not as a second resilience system.

That distinction is important.

A combo should answer questions like:

  • which providers/models are candidates for this request
  • in what order or strategy they should be tried
  • how intelligent routing should score or prioritize those candidates

A combo should not maintain its own parallel understanding of provider health, cooldown state, or breaker state.

What this means operationally

  • combos use the same provider breaker state as direct model requests
  • combos use the same connection cooldown/account availability state as the rest of the runtime
  • combo fallback happens on top of the unified runtime resilience state, not beside it
  • intelligent combo panels focus on routing inputs such as candidate pool, mode pack, and exploration rate
  • runtime health stays on the Health page instead of being reinterpreted inside combo-specific UI panels

In other words: combo decides where to try next; the resilience system decides whether that target is currently eligible.

Runtime behavior after the refactor

The runtime now follows one consistent flow:

  1. queue locally when the selected connection should not run concurrent work
  2. cool down individual connections when a failure is connection-scoped and retryable
  3. open the provider breaker only when failures are provider-scoped after fallback exhaustion
  4. let combo routing choose among the remaining eligible targets using its configured strategy
  5. decide whether to wait or fail when all remaining candidates are only cooling down

That gives each decision a single home instead of spreading the same responsibility across multiple overlapping subsystems.

Account and fallback handling

This refactor also makes the fallback path easier to reason about.

  • cooldown writes, account suppression, and model lockouts all flow through the shared auth/fallback path
  • connection-scoped 429 rate limits stay in cooldown/backoff and wait-for-cooldown behavior instead of being counted as provider-breaker failures
  • provider-specific project-routing failures are treated as retryable provider responses instead of terminal account death
  • OAuth invalid-token responses stay in the provider recovery path instead of being flattened into a generic terminal state too early

Dashboard, API, and MCP behavior

The non-runtime surfaces now describe the same resilience model instead of inventing their own parallel layers.

Dashboard

  • the Resilience settings page exposes the same four-part model used by the runtime
  • the Resilience settings copy explicitly describes connection-scoped 429 handling as part of Connection Cooldown
  • the Health page shows provider-wide breaker state
  • intelligent combo panels show routing configuration inputs, not a second runtime health model
  • provider pages reflect provider/connection state from the unified runtime model

API

  • /api/resilience exposes the same settings shape used by the runtime and dashboard
  • connectionCooldown is configured with:
    • baseCooldownMs
    • useUpstreamRetryHints
    • maxBackoffSteps
  • providerBreaker only counts provider-wide final transient failures; connection-scoped 429 handling stays outside this layer

MCP

  • MCP dashboard presets use the same resilience structure
  • advanced MCP resilience tools switch profiles using the same model

Documentation

The docs now describe the same 3.7.0 resilience model used by the implementation:

  • README.md
  • docs/API_REFERENCE.md
  • docs/ARCHITECTURE.md
  • docs/USER_GUIDE.md
  • docs/openapi.yaml

Validation

Type and test coverage

  • npm run typecheck:core
  • node --import tsx/esm --test tests/integration/integration-wiring.test.ts
  • node --import tsx/esm --test tests/unit/autocombo-unification.test.ts
  • node --import tsx/esm --test tests/unit/account-fallback-service.test.ts
  • node --import tsx/esm --test tests/integration/chat-pipeline.test.ts
  • git push pre-push hook: npm run test:unit (3240 passing tests)

Playwright coverage

Playwright coverage was run in dev-server mode and verifies the behavior of the new model across UI surfaces:

  • resilience settings UI with the 3.7.0 connection cooldown fields
  • Health page provider breaker rendering for multiple provider states
  • provider dashboard behavior under the unified provider/runtime model
  • intelligent combo panels showing configuration-oriented routing inputs

Commands used:

  • OMNIROUTE_PLAYWRIGHT_SERVER_MODE=dev npx playwright test tests/e2e/resilience-plan-alignment.spec.ts
  • OMNIROUTE_PLAYWRIGHT_SERVER_MODE=dev npx playwright test tests/e2e/combo-unification.spec.ts

Note

Playwright was executed in dev-server mode because the current branch still has unrelated Next build/prerender failures in start/build mode (/_global-error, /callback). The behavior covered above was validated successfully in dev mode.

@rdself rdself requested a review from diegosouzapw as a code owner April 20, 2026 07:28
Copilot AI review requested due to automatic review settings April 20, 2026 07:28
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the resilience configuration system by centralizing settings into a new ResilienceSettings structure, removing legacy model-level availability tracking, and replacing the per-target circuit breaker logic with a global provider-level breaker. While the changes improve resilience management, the new implementation introduces tight coupling by allowing the open-sse workspace package to import directly from the host application's internal modules (@/lib/db/readCache and @/lib/localDb). This violates the intended isolation of the workspace package. Additionally, the NumberField component in the resilience settings UI prevents users from clearing input fields, which should be addressed to improve usability.

export async function getRuntimeProviderProfile(provider: string | null | undefined) {
const fallback = getProviderProfile(provider);
try {
const { getCachedSettings } = await import("@/lib/db/readCache");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The workspace package @omniroute/open-sse is using a host-application alias (@/lib/db/readCache) in its imports. Since this directory is intended to be a standalone workspace package (as noted in ARCHITECTURE.md), it should not depend on the internal modules of the application that consumes it. This creates a circular dependency and prevents the package from being used independently or published correctly. Consider refactoring this to inject the settings dependency or use a shared interface.

try {
const { getProviderConnections } = await import("@/lib/localDb");
const connections = await getProviderConnections();
const { getProviderConnections, getSettings } = await import("@/lib/localDb");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This workspace package is importing directly from the application layer (@/lib/localDb). This tight coupling violates package boundaries and makes the @omniroute/open-sse package non-portable. Dependencies on the host application's database or configuration should be injected at runtime rather than being hardcoded via application-level aliases.

Comment on lines +97 to +98
if (event.target.value === "") return;
const nextValue = Number(event.target.value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The NumberField component returns early when the input is an empty string, which prevents users from clearing the field to type a new value (the controlled input will immediately revert to its previous state). Removing this check allows the state to be updated correctly when the field is cleared, improving the user experience during configuration tuning.

Suggested change
if (event.target.value === "") return;
const nextValue = Number(event.target.value);
const nextValue = Number(event.target.value);

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors OmniRoute’s resilience model to a unified, plan-aligned stack (request queue → connection cooldown → provider breaker → wait-for-cooldown), removing legacy global model availability/quarantine surfaces and aligning runtime, dashboard, API, MCP, and docs for the 3.7.0 release.

Changes:

  • Introduces ResilienceSettings with normalization/legacy-compat wiring and updates /api/resilience + dashboard/MCP presets to the new shape.
  • Removes legacy model availability/quarantine domain + API/UI, shifting runtime health to provider breaker + connection cooldowns.
  • Updates combo routing and chat pipeline behavior/tests to respect provider breaker responses and the unified cooldown-aware retry semantics.

Reviewed changes

Copilot reviewed 68 out of 68 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/thundering-herd.test.ts Updates backoff ceiling expectations to use BACKOFF_CONFIG clamping.
tests/unit/sse-auth.test.ts Aligns auth/cooldown expectations with unified cooldown constants and terminal statuses.
tests/unit/rate-limit-manager.test.ts Updates assertions to reflect new limiter state exposure vs model lockout behavior.
tests/unit/error-classification.test.ts Expands provider profile assertions and updates backoff/upstream-hint semantics.
tests/unit/domain-branch-hardening.test.ts Removes modelAvailability reset/coverage now that the domain module is removed.
tests/unit/combo-routing-engine.test.ts Updates combo tests to skip provider-breaker-open responses and renames execution key helper.
tests/unit/combo-context-relay.test.ts Updates context-relay behavior to skip provider-breaker-open responses instead of per-target breaker state.
tests/unit/combo-config.test.ts Removes legacy combo resilience fields (timeoutMs, healthcheck) from defaults/config resolution.
tests/unit/combo-circuit-breaker.test.ts Removes legacy per-target combo circuit-breaker integration tests.
tests/unit/chat-route-coverage.test.ts Shifts 503 coverage from model cooldown to connection cooldown and adds provider breaker response headers/code assertions.
tests/unit/chat-helpers.test.ts Updates pipeline gate checks to provider breaker response and simplifies breaker execution path.
tests/unit/chat-combo-live-test.test.ts Updates “live test” semantics to bypass connection cooldown and sets breaker timing fields explicitly.
tests/unit/batch-a-domain.test.ts Removes model availability domain tests from Batch A suite.
tests/unit/autocombo-unification.test.ts Updates intelligent routing helper test to config-only scoring (no health extraction).
tests/unit/auth-terminal-status.test.ts Adds coverage for terminal/non-terminal auth classifications without adding cooldown.
tests/unit/account-fallback-service.test.ts Updates profile expectations and removes provider-cooldown mutation behavior assertions.
tests/integration/security-hardening.test.ts Removes legacy /api/models/availability route from endpoint validation checks.
tests/integration/proxy-pipeline.test.ts Updates pipeline wiring assertions to new breaker response helper and credential preflight usage.
tests/integration/integration-wiring.test.ts Removes /api/models/availability existence/method checks; updates wiring expectations.
tests/integration/chatcore-compression-integration.test.ts Removes modelAvailability resets from integration harness reset logic.
tests/integration/chat-pipeline.test.ts Removes global model-unavailable integration test and associated resets.
tests/integration/_chatPipelineHarness.ts Removes modelAvailability helpers from harness surface and reset flow.
tests/e2e/resilience-plan-alignment.spec.ts Adds Playwright coverage ensuring UI surfaces align to the new resilience plan and stop calling legacy endpoints.
src/types/settings.ts Adds resilienceSettings?: ResilienceSettings and removes legacy combo defaults fields.
src/sse/services/cooldownAwareRetry.ts Reads wait-for-cooldown behavior from resolveResilienceSettings and renames settings fields.
src/sse/services/auth.ts Refactors markAccountUnavailable to unify cooldown decisions, terminal statuses, and provider error classification.
src/sse/handlers/chatHelpers.ts Uses providerCircuitOpenResponse for breaker gating; removes model availability gating and breaker.execute usage.
src/sse/handlers/chat.ts Removes model availability quarantine integration and updates provider breaker handling + wait-for-cooldown semantics.
src/shared/validation/schemas.ts Adds plan-aligned resilience schemas while keeping legacy profile/default schemas for compatibility.
src/shared/utils/circuitBreaker.ts Extends breaker status payload with retryAfterMs.
src/lib/resilience/settings.ts New module defining defaults, normalization, merge, and legacy compatibility mapping for resilience settings.
src/lib/monitoring/observability.ts Normalizes breaker status and exposes providerBreakers array + retryAfterMs in health payload.
src/lib/dataPaths.js Removes legacy compiled JS dataPaths artifact.
src/lib/combos/intelligentRouting.ts Removes health extraction; keeps provider scoring based on config only.
src/domain/modelAvailability.ts Removes legacy global model availability/quarantine domain module.
src/app/api/settings/combo-defaults/route.ts Sanitizes legacy combo resilience keys out of persisted defaults/overrides.
src/app/api/resilience/route.ts Implements plan-aligned GET/PATCH, legacy patch normalization, persistence, and runtime sync.
src/app/api/resilience/reset/route.ts Updates endpoint semantics/docs to “reset provider circuit breakers” only.
src/app/api/policies/route.ts Removes circuit breaker states from policies API payload (locked identifiers only).
src/app/api/models/availability/route.ts Removes legacy model availability API route.
src/app/(dashboard)/layout.tsx Removes ModelStatusProvider wrapper now that model availability UI is removed.
src/app/(dashboard)/dashboard/settings/components/PoliciesPanel.tsx Removes circuit breaker UI from Policies panel; focuses on locked identifiers.
src/app/(dashboard)/dashboard/settings/components/ComboDefaultsTab.tsx Removes legacy combo resilience fields and sanitizes legacy keys on load/save.
src/app/(dashboard)/dashboard/providers/page.tsx Removes ModelAvailabilityBadge from providers page header.
src/app/(dashboard)/dashboard/providers/components/ModelStatusContext.tsx Removes legacy shared polling context for model availability.
src/app/(dashboard)/dashboard/providers/components/ModelStatusBadge.tsx Removes per-model status badge component.
src/app/(dashboard)/dashboard/providers/components/ModelAvailabilityPanel.tsx Removes legacy model availability panel.
src/app/(dashboard)/dashboard/providers/components/ModelAvailabilityBadge.tsx Removes legacy model availability badge/popup.
src/app/(dashboard)/dashboard/providers/[id]/page.tsx Removes per-model status badge in provider model rows.
src/app/(dashboard)/dashboard/health/page.tsx Displays breaker retryAfterMs in provider health UI.
src/app/(dashboard)/dashboard/endpoint/components/MCPDashboard.tsx Updates resilience presets to new plan-aligned structure.
src/app/(dashboard)/dashboard/combos/page.tsx Sanitizes legacy combo config keys and removes legacy advanced fields from templates/UI.
src/app/(dashboard)/dashboard/combos/IntelligentComboPanel.tsx Removes health polling/exclusions UI; makes panel explicitly config-only.
open-sse/utils/error.ts Adds structured providerCircuitOpenResponse (503 + code + headers).
open-sse/services/rateLimitManager.ts Integrates request queue settings from ResilienceSettings and applies them across Bottleneck limiters.
open-sse/services/comboConfig.ts Filters legacy combo resilience keys during config cascade resolution.
open-sse/services/combo.ts Removes per-target breaker logic; skips targets when provider breaker open responses are returned.
open-sse/services/accountFallback.ts Reworks provider profile derivation to use ResilienceSettings and maps legacy “provider cooldown” helpers to shared breaker.
open-sse/services/AGENTS.md Updates architecture notes to reflect provider-breaker/global model.
open-sse/mcp-server/tools/advancedTools.ts Updates MCP resilience profile payloads to new structure.
open-sse/handlers/chatCore.ts Removes legacy per-model quota lockout handling in favor of shared auth/fallback path.
docs/openapi.yaml Removes /api/models/availability and updates /api/resilience to GET/PATCH config semantics.
docs/USER_GUIDE.md Updates resilience documentation to the new multi-layer model and surface responsibilities.
docs/ARCHITECTURE.md Updates architecture references to remove model availability and reflect new resilience surfaces.
docs/API_REFERENCE.md Removes model availability API docs and updates resilience section wording.
README.md Updates resilience feature description to match the new plan-aligned model.

Comment on lines 1415 to 1419
settings: null,
relayOptions: null as any,
allCombos: null,
relayOptions: null,
});
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object literal passed to handleComboChat defines relayOptions twice. In TypeScript this is a compile error (duplicate property name) and the first value is silently overwritten at runtime. Remove the duplicate and keep a single relayOptions entry (and do the same for other call sites in this file).

Copilot uses AI. Check for mistakes.
@clousky2020
Copy link
Copy Markdown
Contributor

Great refactor! This unified resilience model is much cleaner than the fragmented approach we had before.

I have a suggestion: Consider removing 429 from PROVIDER_FAILURE_ERROR_CODES.

Rate limiting (429) is expected behavior that's already handled at model-level and account-level cooldowns. Including it in the provider-wide circuit breaker causes premature cooldown of the entire provider, reducing availability for legitimate traffic that could be routed to other accounts or models.

const PROVIDER_FAILURE_ERROR_CODES = new Set([408, 500, 502, 503, 504]);

I had a related PR (#1442) that addressed this and added configurable thresholds. Since your refactor already covers the threshold configuration via providerBreaker.failureThreshold, I'm happy to close mine and let this proceed. The only remaining item would be the 429 removal and a small test isolation fix in settings-api.test.ts that I can submit separately after this merges.

Thanks for the comprehensive cleanup!

Copy link
Copy Markdown
Contributor Author

rdself commented Apr 20, 2026

@clousky2020 Good catch. I adopted this in the latest update.

429 now stays at the connection-cooldown layer and no longer contributes to the provider-wide circuit breaker. The provider breaker now counts provider-wide final transient failures after fallback exhaustion (408, 500, 502, 503, 504), while rate-limit handling stays connection-scoped.

I also updated the Resilience settings copy and the related tests/docs so the UI, runtime behavior, and validation all match this rule.

Thanks for calling it out.

@diegosouzapw diegosouzapw merged commit b38e57d into diegosouzapw:release/v3.7.0 Apr 21, 2026
2 checks passed
@rdself rdself deleted the coder/pr1443-v370-squash branch April 23, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants