Feature update#1
Conversation
📝 WalkthroughWalkthroughGreenfield full-stack InsightSQL: FastAPI backend with asyncpg/Postgres schema and seed; Azure OpenAI LLM integrations for Text‑to‑SQL, RCA, and SQL healing; WebSocket real‑time broadcasts; and a Vite+React dashboard with hooks, components, and demo scenarios. ChangesInsightSQL Application
🎯 4 (Complex) | ⏱️ ~60 minutes sequenceDiagram
participant Browser
participant API as FastAPI
participant LLM as AzureOpenAI
participant DB as PostgreSQL
participant WS as WS Manager
Browser->>API: POST /api/incidents/{id}/ask (question)
API->>DB: build schema map, read incident
API->>LLM: generate_sql(prompt with schema_map)
LLM-->>API: SQL text
API->>DB: explain_query / execute_readonly (validate & shadow-run)
DB-->>API: plan / rows
API->>DB: INSERT ops.evidence_runs
API->>WS: broadcast(evidence_added)
WS-->>Browser: realtime update
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
|
@coderabbitai help whats the status? |
There was a problem hiding this comment.
Actionable comments posted: 2
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (32)
frontend/src/hooks/useWebSocket.js-3-4 (1)
3-4:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid hardcoding the WebSocket endpoint to localhost.
This will fail outside local dev and can break under HTTPS due to mixed-content restrictions. Make the endpoint environment-driven (or protocol-aware) so it works across environments.
Suggested fix
-const WS_URL = 'ws://localhost:8000/ws'; +const WS_URL = + import.meta.env.VITE_WS_URL ?? + `${window.location.protocol === 'https:' ? 'wss' : 'ws'}://${window.location.host}/ws`;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/hooks/useWebSocket.js` around lines 3 - 4, The WS_URL constant is hardcoded to ws://localhost:8000/ws which breaks non-local and HTTPS deployments; update the logic that defines WS_URL (and leave RECONNECT_DELAYS unchanged) to read an environment-configurable URL (e.g. process.env.REACT_APP_WS_URL or similar) and/or derive the scheme from window.location.protocol (use wss: when page is https:, ws: otherwise) and window.location.host as a fallback so the socket endpoint is environment-driven and protocol-aware; change the WS_URL declaration to prefer the env var, then fall back to a constructed ws/wss URL using window.location.protocol and host.frontend/src/components/IncidentBanner.jsx-5-8 (1)
5-8:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReset elapsed on incident switch and depend on a stable incident identity.
The timer currently doesn’t reset for a new incident, and using the whole
incidentobject as dependency can cause unnecessary interval restarts.Suggested fix
useEffect(() => { + setElapsed(0); const t = setInterval(() => setElapsed(e => e + 1), 1000); return () => clearInterval(t); - }, [incident]); + }, [incident?.incident_id]);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/components/IncidentBanner.jsx` around lines 5 - 8, The useEffect in IncidentBanner.jsx currently depends on the whole incident object and doesn't reset the elapsed counter when a new incident appears; update the effect to depend on a stable identity (e.g., incident?.id) instead of incident, and inside the effect reset the state with setElapsed(0) before creating the interval (keep the interval logic using setInterval and clearInterval as-is). This ensures the timer restarts only when the incident id changes and avoids unnecessary interval restarts due to object reference changes.frontend/src/components/TimelineChart.jsx-4-4 (1)
4-4:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
metricsTimelinewith a default to prevent undefined access.Line 9 and Line 76 assume an array is always passed. During initial load this can throw.
Suggested fix
-export default function TimelineChart({ metricsTimeline }) { +export default function TimelineChart({ metricsTimeline = [] }) {Also applies to: 9-9, 76-76
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/components/TimelineChart.jsx` at line 4, The component TimelineChart assumes metricsTimeline is always an array and can throw when undefined; update the component to guard it (e.g., set a default empty array) so accesses at TimelineChart (function) usages around the spots that read metricsTimeline (the map/filter at the locations noted) won't crash—either change the function signature to accept a default (metricsTimeline = []) or create a local const safeMetrics = metricsTimeline || [] and use safeMetrics where the code currently uses metricsTimeline.frontend/src/components/PanelHealth.jsx-5-6 (1)
5-6:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd a safe default for
panelsto avoid first-render crashes.Line 16 and Line 56 assume
panelsis always defined. If data is still loading, this throws before the empty state can render.Suggested fix
-export default function PanelHealth({ panels, healing }) { +export default function PanelHealth({ panels = [], healing }) {Also applies to: 16-16, 56-56
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/components/PanelHealth.jsx` around lines 5 - 6, PanelHealth can crash on first render because panels may be undefined; give panels a safe default and guard its usage: change the component signature to accept a default (e.g., function PanelHealth({ panels = [], healing }) ) and/or replace direct uses with safe access (e.g., panels?.map or panels.length checks). Keep the existing healing handling (healingEntries = Object.values(healing || {})) but ensure anywhere PanelHealth reads panels (iterations or indexing) uses the default or optional chaining to avoid runtime errors.frontend/src/index.css-162-162 (1)
162-162:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeyframe names violate configured kebab-case pattern.
slideInandfadeInshould be renamed and their animation references updated to clear lint errors.Suggested fix
-.banner { +.banner { background: var(--crit-bg); @@ - animation: slideIn 0.35s ease; + animation: slide-in 0.35s ease; } @@ -@keyframes slideIn { +@keyframes slide-in { from { opacity: 0; transform: translateY(-8px); } to { opacity: 1; transform: translateY(0); } } -@keyframes fadeIn { +@keyframes fade-in { from { opacity: 0; } to { opacity: 1; } } @@ -.anim-in { animation: fadeIn 0.3s ease; } +.anim-in { animation: fade-in 0.3s ease; }Also applies to: 645-645, 649-649, 657-657
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/index.css` at line 162, Keyframe names slideIn and fadeIn violate kebab-case linting; rename the `@keyframes` declarations (e.g., slide-in and fade-in) and update all animation property references (e.g., animation: slideIn 0.35s ease) to use the new kebab-case identifiers (animation: slide-in 0.35s ease) so the `@keyframes` declarations and every use site match; ensure you update both the declarations and uses of slideIn and fadeIn across the file (including other occurrences noted) to eliminate the lint errors.frontend/src/components/PanelHealth.jsx-35-40 (1)
35-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winError-only healing entries are currently dropped.
Line 36 returns early when SQL fields are missing, which also suppresses
h.errorrendering for those entries.Suggested fix
- {healingEntries.map(h => { - if (!h.old_sql && !h.new_sql) return null; + {healingEntries.map(h => { + if (!h.error && !h.old_sql && !h.new_sql) return null; return ( <div key={h.panel_id} className="anim-in"> {h.error && <div className="error-pill">Error: {h.error}</div>}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/components/PanelHealth.jsx` around lines 35 - 40, The current map over healingEntries in PanelHealth.jsx returns null whenever both h.old_sql and h.new_sql are falsy, which unintentionally drops entries that carry only h.error; update the rendering logic in the healingEntries.map callback (the block that checks h.old_sql and h.new_sql and returns early) to still render a container when h.error exists even if SQL fields are missing—i.e., change the early return to only skip when there is no h.error and no SQL, and ensure the existing error-pill render (h.error && <div className="error-pill">...) can run for entries lacking SQL.frontend/src/components/TopologyGraph.jsx-65-65 (1)
65-65:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDependency array is too narrow and can leave stale topology content.
Rebuild logic depends only on counts, so same-length updates (renames, rewired edges, replaced IDs) won’t refresh the graph.
Suggested fix
- }, [topology.nodes.length, topology.edges.length]); + }, [topology.nodes, topology.edges]);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/components/TopologyGraph.jsx` at line 65, The effect in TopologyGraph.jsx currently only depends on topology.nodes.length and topology.edges.length which misses content-only changes; update the useEffect dependency so it reacts to actual topology changes — for example replace [topology.nodes.length, topology.edges.length] with a dependency that tracks node/edge content such as [topology] or [JSON.stringify(topology.nodes), JSON.stringify(topology.edges)] (or include topology.nodes and topology.edges arrays directly) in the useEffect inside the TopologyGraph component so renames, rewired edges or ID replacements correctly trigger a rebuild.backend/app/panels/router.py-131-142 (1)
131-142:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWrap INSERT and UPDATE operations in an explicit transaction to ensure atomic active-version switch.
The two separate
conn.execute()calls create a window for inconsistency: the INSERT could succeed while the UPDATE fails (leaving two active versions), or vice versa. Addasync with conn.transaction():around both operations. Also consider including the finalUPDATE ops.dashboard_panelsin the same transaction for consistency.Suggested change
- # Insert broken version - new_version = row["version_no"] + 1 - await conn.execute( - """INSERT INTO ops.panel_query_versions - (panel_id, version_no, sql_text, generated_by, is_active) - VALUES ($1, $2, $3, 'human', true)""", - panel_id, new_version, broken_sql, - ) - # Deactivate old version - await conn.execute( - """UPDATE ops.panel_query_versions SET is_active = false - WHERE panel_id = $1 AND version_no = $2""", - panel_id, row["version_no"], - ) + new_version = row["version_no"] + 1 + async with conn.transaction(): + await conn.execute( + """UPDATE ops.panel_query_versions SET is_active = false + WHERE panel_id = $1 AND version_no = $2""", + panel_id, row["version_no"], + ) + await conn.execute( + """INSERT INTO ops.panel_query_versions + (panel_id, version_no, sql_text, generated_by, is_active) + VALUES ($1, $2, $3, 'human', true)""", + panel_id, new_version, broken_sql, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/panels/router.py` around lines 131 - 142, The INSERT into ops.panel_query_versions and the subsequent UPDATE that deactivates the old version must be executed inside a single database transaction to avoid transient inconsistent state; wrap the two conn.execute(...) calls (and also the final UPDATE ops.dashboard_panels if present in the same flow) in an async with conn.transaction(): block so both the INSERT of the new version and the UPDATE of the old version are atomic and will roll back together on error, locating the changes around the existing calls to conn.execute(...) that insert into ops.panel_query_versions and update ops.panel_query_versions (and the UPDATE ops.dashboard_panels statement) in router.py.backend/app/main.py-42-47 (1)
42-47:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix CORS wildcard + credentials configuration.
The combination
allow_origins=["*"]withallow_credentials=Trueviolates the CORS specification. Browsers will actively reject this configuration and block cross-origin requests with credentials, making cookies and authorization headers inaccessible to client-side code. Use an explicit allowlist of trusted origins instead:Suggested change
app.add_middleware( CORSMiddleware, - allow_origins=["*"], + allow_origins=[ + "http://localhost:5173", + # add deployed frontend origin(s) explicitly + ], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/main.py` around lines 42 - 47, The CORS config in app.add_middleware using CORSMiddleware currently sets allow_origins=["*"] while allow_credentials=True which violates the CORS spec; update the CORSMiddleware configuration in main.py (the app.add_middleware call) to replace the wildcard origin with an explicit allowlist of trusted origins (e.g., load a list from an env var like TRUSTED_ORIGINS or a settings variable and pass that list to allow_origins) and keep allow_credentials=True only when the origins list is explicit; ensure allow_methods and allow_headers remain as needed and validate the trusted-origins parsing so the middleware receives a proper list rather than a single comma string.backend/app/panels/router.py-68-69 (1)
68-69:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse HTTP error status codes for missing panel query states.
These are client-visible error states that should return 404 status codes instead of success responses. Replace the error dict returns at lines 68 and 121 with proper HTTP exceptions. This requires importing
HTTPExceptionfromfastapi.Suggested change
-from fastapi import APIRouter +from fastapi import APIRouter, HTTPException @@ if not row: - return {"error": "No active query for panel"} + raise HTTPException(status_code=404, detail="No active query for panel") @@ if not row: - return {"error": "No active query found"} + raise HTTPException(status_code=404, detail="No active query found")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/panels/router.py` around lines 68 - 69, Replace the client-visible error dict responses that return {"error": "No active query for panel"} with proper FastAPI HTTP exceptions: import HTTPException from fastapi and raise HTTPException(status_code=404, detail="No active query for panel") in both places (the two return sites that currently return that error dict). Ensure you remove or replace the original return statements so the endpoint raises the exception instead of returning a 200 JSON payload.backend/app/main.py-62-71 (1)
62-71:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMove
manager.disconnect()to afinallyblock to ensure cleanup on all exceptions.Currently, cleanup only happens when
WebSocketDisconnectis caught. Other exceptions raised bywebsocket.receive_text()ormanager.send_personal()will leave stale sockets inmanager.active_connections, causing resource leaks.Suggested change
`@app.websocket`("/ws") async def websocket_endpoint(websocket: WebSocket): """WebSocket endpoint for real-time dashboard updates.""" await manager.connect(websocket) try: while True: # Keep connection alive, listen for client messages data = await websocket.receive_text() # Client can send pings or commands if data == "ping": await manager.send_personal(websocket, "pong", {}) - except WebSocketDisconnect: - manager.disconnect(websocket) + except WebSocketDisconnect: + pass + finally: + manager.disconnect(websocket)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/main.py` around lines 62 - 71, The cleanup call manager.disconnect(websocket) must run for all exit paths, not just on WebSocketDisconnect; wrap the loop in try/finally so that after the async loop using websocket.receive_text() and manager.send_personal(...) ends (whether due to WebSocketDisconnect or any other exception) manager.disconnect(websocket) is invoked in the finally block; leave the existing WebSocketDisconnect except block only if you need special handling, but ensure the final cleanup is performed in the finally section referencing manager.disconnect(websocket).backend/app/incidents/router.py-35-35 (1)
35-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse HTTPException for error responses instead of returning error dictionaries with 200 status.
Lines 35 and 67 return error payloads as JSON with HTTP 200 (success) status. Use
HTTPExceptionto return proper HTTP error codes: 404 for the missing incident (line 35) and 422 for the missing question validation error (line 67).Suggested change
-from fastapi import APIRouter +from fastapi import APIRouter, HTTPException @@ if not incident: - return {"error": "Incident not found"} + raise HTTPException(status_code=404, detail="Incident not found") @@ question = body.get("question", "") if not question: - return {"error": "question is required"} + raise HTTPException(status_code=422, detail="question is required")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/incidents/router.py` at line 35, Replace the plain JSON error returns with FastAPI HTTPException raises: where the code currently returns {"error": "Incident not found"} (in the incident lookup handler, symbol: router or the endpoint function that checks for the incident) raise HTTPException(status_code=404, detail="Incident not found"); likewise replace the validation error return (the branch that returns the missing-question payload at the location around line 67 in the same endpoint or validation helper) with raise HTTPException(status_code=422, detail="Missing or invalid question"); ensure you import HTTPException from fastapi at the top of the module if not already imported.backend/app/ws/manager.py-35-39 (1)
35-39:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden broadcast loop against concurrent list mutation and blind exception swallowing.
Iterate over a snapshot and catch expected socket-send failures explicitly; avoid masking unrelated bugs. The current code directly iterates over
self.active_connectionsand then modifies it viadisconnect()calls, which can cause aRuntimeErrorif a disconnect occurs during iteration. Additionally, the bareexcept Exception:masks unexpected errors.Suggested change
-from fastapi import WebSocket +from fastapi import WebSocket, WebSocketDisconnect @@ - for connection in self.active_connections: + for connection in list(self.active_connections): try: await connection.send_text(message_json) - except Exception: + except (WebSocketDisconnect, RuntimeError): disconnected.append(connection)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ws/manager.py` around lines 35 - 39, The broadcast loop iterates self.active_connections directly and uses a bare except which can both raise RuntimeError if the list mutates and hide real bugs; copy the connections into a snapshot (e.g., list(self.active_connections)) before iterating, catch only expected send failures (e.g., WebSocketDisconnect, ConnectionClosedError/ConnectionClosedOK, asyncio.CancelledError) when awaiting connection.send_text, append those failed connections to disconnected, and after the loop call the existing disconnect/cleanup logic for each entry; re-raise or log other unexpected exceptions instead of swallowing them. Reference symbols: self.active_connections, connection.send_text, disconnected, disconnect().backend/app/panels/router.py-80-86 (1)
80-86:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCatch database errors explicitly instead of masking programming errors.
The bare
Exceptioncatch silently records application bugs (e.g., attribute errors, type mismatches) as panel SQL failures. Catchasyncpg.PostgresErrorto only record legitimate database errors and let unexpected exceptions surface for debugging.Suggested change
+import asyncpg @@ - except Exception as e: + except asyncpg.PostgresError as e: # Record failure await conn.execute( """INSERT INTO ops.query_failures (panel_id, error_text, bad_sql) VALUES ($1, $2, $3)""", panel_id, str(e), row["sql_text"], )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/panels/router.py` around lines 80 - 86, Replace the broad "except Exception as e" that records failures in ops.query_failures with an exception handler that only catches asyncpg.PostgresError (import asyncpg.PostgresError at the top), so that only database errors trigger the INSERT via conn.execute; let other exceptions propagate (or re-raise) so programming errors aren't masked. Specifically, update the try/except around the query execution (the block that calls conn.execute to insert into ops.query_failures) to "except asyncpg.PostgresError as e:" and preserve the existing INSERT logic inside that handler.backend/app/demo/scenarios.py-274-296 (1)
274-296:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake panel-break mutation atomic to avoid partial panel state.
This sequence updates
ops.panel_query_versions,ops.dashboard_panels, andops.query_failureswithout a transaction. If any statement fails afteris_active=false, the panel can be left with no active query version.Suggested fix
# Break the panel async with pool.acquire() as conn: - active = await conn.fetchrow( - "SELECT version_no, sql_text FROM ops.panel_query_versions WHERE panel_id = $1 AND is_active = true", - panel_id, - ) - if not active: - return - - broken_sql = active["sql_text"].replace("display_name", "resource_name") - new_version = active["version_no"] + 1 - - await conn.execute("UPDATE ops.panel_query_versions SET is_active = false WHERE panel_id = $1", panel_id) - await conn.execute( - """INSERT INTO ops.panel_query_versions (panel_id, version_no, sql_text, generated_by, is_active) - VALUES ($1, $2, $3, 'human', true)""", - panel_id, new_version, broken_sql, - ) - await conn.execute("UPDATE ops.dashboard_panels SET status = 'failed' WHERE panel_id = $1", panel_id) - await conn.execute( - """INSERT INTO ops.query_failures (panel_id, error_text, bad_sql) - VALUES ($1, 'column \"resource_name\" does not exist', $2)""", - panel_id, broken_sql, - ) + async with conn.transaction(): + active = await conn.fetchrow( + "SELECT version_no, sql_text FROM ops.panel_query_versions WHERE panel_id = $1 AND is_active = true", + panel_id, + ) + if not active: + return + + broken_sql = active["sql_text"].replace("display_name", "resource_name") + new_version = active["version_no"] + 1 + + await conn.execute("UPDATE ops.panel_query_versions SET is_active = false WHERE panel_id = $1", panel_id) + await conn.execute( + """INSERT INTO ops.panel_query_versions (panel_id, version_no, sql_text, generated_by, is_active) + VALUES ($1, $2, $3, 'human', true)""", + panel_id, new_version, broken_sql, + ) + await conn.execute("UPDATE ops.dashboard_panels SET status = 'failed' WHERE panel_id = $1", panel_id) + await conn.execute( + """INSERT INTO ops.query_failures (panel_id, error_text, bad_sql) + VALUES ($1, 'column \"resource_name\" does not exist', $2)""", + panel_id, broken_sql, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/demo/scenarios.py` around lines 274 - 296, The sequence that updates ops.panel_query_versions, ops.dashboard_panels, and inserts into ops.query_failures (using variables active, broken_sql, new_version) must be executed atomically; wrap the block that sets is_active = false, INSERTs the new version, updates dashboard_panels status, and INSERTs query_failures in a single database transaction (use conn.transaction() or equivalent on the acquired conn) so that any error rolls back all changes and prevents leaving no active version.backend/app/demo/router.py-53-68 (1)
53-68:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRun the reset DELETE sequence inside one transaction.
If any DELETE fails, the current approach can leave a partially reset database.
Suggested fix
pool = await get_pool() async with pool.acquire() as conn: - # Truncate all telemetry and state tables (order matters for FK constraints) - await conn.execute("DELETE FROM ops.remediation_actions") - await conn.execute("DELETE FROM ops.evidence_runs") - await conn.execute("DELETE FROM ops.incidents") - await conn.execute("DELETE FROM ops.query_failures") - await conn.execute("DELETE FROM ops.panel_query_versions") - await conn.execute("DELETE FROM ops.dashboard_panels") - await conn.execute("DELETE FROM ops.sap_alerts") - await conn.execute("DELETE FROM ops.sap_backups") - await conn.execute("DELETE FROM ops.events_norm") - await conn.execute("DELETE FROM ops.alerts_raw") - await conn.execute("DELETE FROM ops.metrics_norm") - await conn.execute("DELETE FROM ops.resource_edges") - await conn.execute("DELETE FROM ops.resources") + async with conn.transaction(): + # Truncate all telemetry and state tables (order matters for FK constraints) + await conn.execute("DELETE FROM ops.remediation_actions") + await conn.execute("DELETE FROM ops.evidence_runs") + await conn.execute("DELETE FROM ops.incidents") + await conn.execute("DELETE FROM ops.query_failures") + await conn.execute("DELETE FROM ops.panel_query_versions") + await conn.execute("DELETE FROM ops.dashboard_panels") + await conn.execute("DELETE FROM ops.sap_alerts") + await conn.execute("DELETE FROM ops.sap_backups") + await conn.execute("DELETE FROM ops.events_norm") + await conn.execute("DELETE FROM ops.alerts_raw") + await conn.execute("DELETE FROM ops.metrics_norm") + await conn.execute("DELETE FROM ops.resource_edges") + await conn.execute("DELETE FROM ops.resources")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/demo/router.py` around lines 53 - 68, The DELETE sequence executed under async with pool.acquire() should be run inside a single database transaction so any failure rolls back all prior deletes; update the block that acquires a connection (the async with pool.acquire() usage in backend/app/demo/router.py) to open a transaction (e.g., async with conn.transaction():) and move all the await conn.execute("DELETE FROM ...") calls into that transaction scope so the DB will commit only if all deletes succeed and automatically rollback on error.backend/app/demo/router.py-48-51 (1)
48-51:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAwait task cancellation before wiping and reseeding state.
cancel()is fire-and-forget here; the demo task can still run while reset is deleting/reseeding tables, causing interleaved writes.Suggested fix
if _demo_state["task"] and not _demo_state["task"].done(): _demo_state["task"].cancel() + try: + await _demo_state["task"] + except asyncio.CancelledError: + pass _demo_state["running"] = False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/demo/router.py` around lines 48 - 51, The reset currently cancels the background demo task with _demo_state["task"].cancel() but does not await it, so the task may continue running while you wipe/reseed; change the flow to call _demo_state["task"].cancel(), then await _demo_state["task"] (handling asyncio.CancelledError and other exceptions) before clearing or reseeding tables and setting _demo_state["running"]=False; ensure any exceptions from awaiting the task are logged/handled and that _demo_state is only reset after the awaited cancellation completes.backend/app/demo/router.py-24-29 (1)
24-29:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSerialize demo startup state transitions and reject duplicate starts with HTTP 409.
The check/set on
_demo_state["running"]is non-atomic, so concurrent/startrequests can launch multiple demo tasks. Also, returning{"error": ...}with 200 makes client error handling unreliable.Suggested fix
import asyncio -from fastapi import APIRouter, BackgroundTasks +from fastapi import APIRouter, BackgroundTasks, HTTPException ... router = APIRouter() +_demo_lock = asyncio.Lock() ... `@router.post`("/start") async def start_demo(background_tasks: BackgroundTasks): """Start the full 3-incident demo sequence.""" - if _demo_state["running"]: - return {"error": "Demo already running"} - - pool = await get_pool() - _demo_state["running"] = True - _demo_state["phase"] = "starting" + async with _demo_lock: + if _demo_state["running"]: + raise HTTPException(status_code=409, detail="Demo already running") + pool = await get_pool() + _demo_state["running"] = True + _demo_state["phase"] = "starting" ... - _demo_state["task"] = asyncio.create_task(_run()) + _demo_state["task"] = asyncio.create_task(_run())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/demo/router.py` around lines 24 - 29, The non-atomic check/set on _demo_state["running"] can allow concurrent /start requests to race; wrap the startup state transition in a serialized async lock (e.g., an asyncio.Lock or FastAPI dependency) so only one coroutine may check-and-set _demo_state at a time, perform the get_pool() call after acquiring the lock, and set _demo_state["running"]=True and _demo_state["phase"]="starting" while holding the lock; if _demo_state["running"] is already true, return an HTTP 409 response (e.g., raise an HTTPException(status_code=409) or equivalent) instead of returning a 200 JSON error so clients see a proper conflict status.backend/app/config.py-1-34 (1)
1-34:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove duplicate Settings definition.
This entire file duplicates
backend/app/__init__.py. Both files define identicalSettingsclasses andget_settings()functions. Consolidate to a single location to prevent maintenance issues and confusion about which module to import from.♻️ Recommended approach
Keep the configuration in
backend/app/config.py(more explicit module name), and remove the duplicate frombackend/app/__init__.py:backend/app/init.py:
# Application packageFiles importing Settings should use:
from app.config import get_settingsThis eliminates duplication and establishes
app.configas the canonical configuration source.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/config.py` around lines 1 - 34, Remove the duplicate Settings class and get_settings function from backend/app/__init__.py and keep the canonical definitions in backend/app/config.py (Settings, get_settings); then update all imports across the codebase to import the configuration from app.config (e.g. from app.config import get_settings) and simplify backend/app/__init__.py to a minimal package initializer (no Settings/get_settings definitions) so there is a single source of truth.backend/.env.example-1-4 (1)
1-4:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReplace real credentials and endpoints with placeholders.
The example file contains what appear to be real Azure endpoints and database credentials. Example configuration should use placeholder values to prevent accidental credential leakage if this file is committed or shared.
🔒 Proposed fix with placeholder values
-AZURE_OPENAI_ENDPOINT=https://oai-gopoc-prod-northcentralus-001.openai.azure.com/ +AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/ AZURE_OPENAI_KEY=your-key-here AZURE_OPENAI_API_VERSION=2024-12-01-preview -DATABASE_URL=postgresql://insightsql:[email protected]:5432/insightsql +DATABASE_URL=postgresql://username:password@localhost:5432/insightsql🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/.env.example` around lines 1 - 4, The .env.example currently contains real-looking secrets; replace the concrete values for AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY, AZURE_OPENAI_API_VERSION, and DATABASE_URL with generic placeholder values (e.g., use example URLs/keys and a template DB connection string) so no real endpoints/credentials remain in the example file; ensure the placeholders clearly indicate they are fake (like YOUR_AZURE_ENDPOINT, YOUR_AZURE_KEY, YOUR_API_VERSION, and POSTGRESQL://USER:PASSWORD@HOST:PORT/DBNAME) and keep the same variable names so consuming code examples remain valid.backend/app/config.py-9-9 (1)
9-9:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove hardcoded database credentials from code.
The default value contains hardcoded credentials and an internal IP address. Credentials should never appear in source code, even as defaults. Force explicit configuration via environment variables.
🔒 Proposed fix
- database_url: str = "postgresql://insightsql:[email protected]:5432/insightsql" + database_url: str = "" # Required: set via DATABASE_URL environment variable🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/config.py` at line 9, The database_url default in config.py contains hardcoded credentials; remove the literal default and require configuration from the environment instead: change the database_url declaration (symbol: database_url) so it is populated from an environment variable (e.g., os.environ["DATABASE_URL"] or via your settings loader) or left unset and raise a clear error if not provided, ensuring no credential string remains in source; update any usage of database_url to expect an explicitly provided value at startup and add a short startup-time validation that fails fast when DATABASE_URL is missing.backend/app/__init__.py-7-7 (1)
7-7:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove hardcoded database credentials from code.
The default value contains hardcoded credentials and an internal IP address. Credentials should never appear in source code, even as defaults. Force explicit configuration via environment variables by using an empty string or raising an error if DATABASE_URL is not set.
🔒 Proposed fix to require explicit configuration
- database_url: str = "postgresql://insightsql:[email protected]:5432/insightsql" + database_url: str = "" # Required: set via DATABASE_URL environment variableOr use Pydantic's field validation to enforce it:
+from pydantic import Field + class Settings(BaseSettings): # Database - database_url: str = "postgresql://insightsql:[email protected]:5432/insightsql" + database_url: str = Field(..., description="PostgreSQL connection string")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/__init__.py` at line 7, The variable database_url in backend/app/__init__.py currently contains hardcoded credentials; remove the hardcoded default and require explicit configuration from environment or validation instead. Change database_url so it does not include secrets (set to an empty string or None by default) and read the real value from an environment variable like DATABASE_URL at startup; if DATABASE_URL is missing, raise an explicit error or use Pydantic/Field validation on the settings class to fail fast. Update any code that references database_url to expect the new required value and ensure no secrets remain in source.backend/app/ingestion/router.py-12-67 (1)
12-67:⚠️ Potential issue | 🟠 Major | ⚡ Quick winProtect ingestion endpoints with webhook authentication.
All three endpoints accept and process unauthenticated payloads. This allows arbitrary event/metric injection and broadcast abuse.
Use a shared secret signature (e.g., HMAC header) or API key dependency before calling normalizers/broadcast.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ingestion/router.py` around lines 12 - 67, Add webhook authentication to the ingestion endpoints (receive_alert, receive_metrics, receive_compute_event) by verifying a shared secret before calling get_pool, normalize_* functions or manager.broadcast: read the raw request body and a signature header (e.g., X-Signature) or API key header (e.g., X-Api-Key), compute/compare an HMAC using the shared secret from config/env, and return 401/raise an HTTPException when the header is missing or the signature/key is invalid; apply the same verification logic to all three handlers so no unauthenticated payload reaches normalize_alert, normalize_metrics or normalize_compute_event or triggers manager.broadcast.backend/app/validation/executor.py-22-24 (1)
22-24:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd execution guardrails (timeout + row cap) to prevent resource exhaustion.
Current execution can run very expensive queries and materialize unlimited result sets.
At minimum set a local
statement_timeoutin the transaction and cap rows returned (explicit limit or cursor/pagination).Also applies to: 36-50
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/validation/executor.py` around lines 22 - 24, Add execution guardrails inside the transaction around conn.transaction(): set a local statement timeout (e.g. await conn.execute("SET LOCAL statement_timeout = '5000ms'")) before executing EXPLAIN, and enforce a maximum row cap when executing the actual query (either append/ensure a LIMIT n to the SQL or use a server-side cursor/fetching API to read at most N rows). Update both the EXPLAIN call that uses conn.fetch (plan_rows) and the subsequent query execution path (the block around conn.transaction() and the code referenced in lines 36-50) to apply the same statement_timeout and row cap behavior so expensive queries cannot run indefinitely or materialize unlimited result sets.backend/app/db/seed.py-8-22 (1)
8-22:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMake seeding idempotency atomic across concurrent startups.
Line 12’s pre-check and the subsequent inserts are not atomic. Two app instances can both pass the empty check, then both seed;
ops.metrics_normwrites are not conflict-protected, so duplicates are possible.Suggested direction
async def seed_all(pool: asyncpg.Pool): """Run all seed functions.""" - async with pool.acquire() as conn: - # Check if already seeded - count = await conn.fetchval("SELECT count(*) FROM ops.resources") - if count > 0: - print("[SEED] Database already seeded, skipping.") - return - - await seed_resources(pool) - await seed_topology(pool) - await seed_baseline_metrics(pool) - await seed_baseline_backups(pool) - await seed_dashboard_panels(pool) + async with pool.acquire() as conn: + async with conn.transaction(): + # Serialize seeding across instances + await conn.execute("SELECT pg_advisory_xact_lock(984321)") + count = await conn.fetchval("SELECT count(*) FROM ops.resources") + if count > 0: + print("[SEED] Database already seeded, skipping.") + return + # call conn-scoped seed helpers (or pass conn instead of pool) + ... print("[SEED] All seed data loaded.")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/db/seed.py` around lines 8 - 22, The current pre-check in seed_all is not atomic so concurrent app startups can both see an empty ops.resources and run seed_* functions (seed_resources, seed_topology, seed_baseline_metrics, seed_baseline_backups, seed_dashboard_panels) causing duplicate writes (notably to ops.metrics_norm). Make seeding atomic by acquiring a database-wide lock or using an idempotent guard row inside a transaction before running seeds: for example obtain a Postgres advisory lock (or INSERT INTO a dedicated seed_control table with a unique key using INSERT ... ON CONFLICT DO NOTHING and check the result) and only run seed_* when the lock/guard indicates this process owns the seed operation; release the lock after seeding. Ensure ops.metrics_norm inserts are using upsert/unique constraints or guarded by the same atomic check to prevent duplicates.backend/app/ingestion/normalizer.py-54-57 (1)
54-57:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard per-metric parsing so one bad value doesn’t drop the whole payload.
float(metric_value)can raise and abort the entire request. Skip invalid points (or collect errors) and insert valid metrics.Proposed fix
rows = [] for metric_name, metric_value in metrics.items(): unit = _infer_unit(metric_name) - rows.append((event_ts, resource_id, metric_name, float(metric_value), unit, json.dumps(labels))) + try: + value = float(metric_value) + except (TypeError, ValueError): + continue + rows.append((event_ts, resource_id, metric_name, value, unit, json.dumps(labels)))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ingestion/normalizer.py` around lines 54 - 57, The loop that appends rows uses float(metric_value) directly and a single bad metric will abort processing; wrap the per-metric parsing inside a try/except around converting metric_value to float (and around any unit inference if needed) in the for metric_name, metric_value in metrics.items() loop, skip invalid metrics instead of raising, and optionally collect or log parsing errors (e.g., append to an errors list or call a logger) while still appending valid rows with rows.append((event_ts, resource_id, metric_name, float_val, _infer_unit(metric_name), json.dumps(labels))).backend/app/ingestion/normalizer.py-127-135 (1)
127-135:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNormalize naive timestamps to UTC in parser.
datetime.fromisoformat(...)may return a naive datetime. Persisting that can shift event ordering depending on DB/session timezone.Proposed fix
def _parse_ts(ts_str: str | None) -> datetime: @@ try: - return datetime.fromisoformat(ts_str.replace("Z", "+00:00")) + dt = datetime.fromisoformat(ts_str.replace("Z", "+00:00")) + return dt if dt.tzinfo else dt.replace(tzinfo=timezone.utc) except (ValueError, AttributeError): return datetime.now(timezone.utc)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ingestion/normalizer.py` around lines 127 - 135, The _parse_ts function can return naive datetimes from datetime.fromisoformat which can lead to inconsistent ordering; update _parse_ts so that after parsing (in both success and fallback paths) any naive datetime is converted to an aware UTC datetime (e.g., check if parsed_dt.tzinfo is None and set/replace it to timezone.utc) and continue to return timezone-aware datetimes (use timezone.utc for the now() fallback as already used).backend/app/db/seed.py-255-266 (1)
255-266:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInsert panel metadata and version in one transaction.
If Line 258 succeeds and Line 263 fails, you can leave a panel without an active query version.
Proposed fix
async with pool.acquire() as conn: - for p in panels: - await conn.execute( + async with conn.transaction(): + for p in panels: + await conn.execute( """INSERT INTO ops.dashboard_panels (panel_id, panel_name, contract_json, status) VALUES ($1, $2, $3::jsonb, 'active') ON CONFLICT DO NOTHING""", p["panel_id"], p["panel_name"], p["contract_json"], - ) - await conn.execute( + ) + await conn.execute( """INSERT INTO ops.panel_query_versions (panel_id, version_no, sql_text, generated_by, is_active) VALUES ($1, 1, $2, 'human', true) ON CONFLICT DO NOTHING""", p["panel_id"], p["sql"], - ) + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/db/seed.py` around lines 255 - 266, The two INSERTs executed under async with pool.acquire() for each panel must be atomic so a panel row in ops.dashboard_panels cannot be created without a matching active row in ops.panel_query_versions; wrap the pair of statements for each p in a transaction (use conn.transaction() or an explicit BEGIN/COMMIT via the connection) so that both INSERTs (into ops.dashboard_panels with panel_id/panel_name/contract_json and into ops.panel_query_versions with version_no/sql_text/generated_by/is_active) succeed or both roll back on error.backend/app/validation/sqlglot_gate.py-63-65 (1)
63-65:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn plain projected column names too.
extract_columns()only walksexp.Alias, so non-aliased outputs are dropped. For example,SELECT count(*) AS total, host_id ...comes back as["total"], which will make any contract check built on this helper misread the actual result shape.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/validation/sqlglot_gate.py` around lines 63 - 65, extract_columns currently only collects names from exp.Alias nodes so plain projected columns are dropped; update extract_columns to iterate the SELECT projection expressions (use parsed.find_all(exp.Select) or parsed.find(exp.Select).expressions) and for each projection: if it's an exp.Alias use expr.alias, else if it's an exp.Column or exp.Identifier/exp.Named this use the column name (e.g., expr.name or expr.this.name), and as a fallback derive a sensible plain expression string; adjust references to parsed.find_all(exp.Alias) to this new loop so both aliased and non-aliased outputs are returned.backend/app/agent/healer.py-130-160 (1)
130-160:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail the heal if the shadow run fails or the result shape drifts.
This block never checks
shadow_result.success, and it still does not validate the returned columns against the panel contract before broadcasting"complete"and moving on to promotion. A query that passesEXPLAINbut fails at execution, or returns the wrong columns, can still be marked"healed"and pushed to the dashboard.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/agent/healer.py` around lines 130 - 160, The shadow run logic after calling execute_readonly(…) must verify execution succeeded and that the returned columns match the panel contract before broadcasting completion or promoting the healed SQL; update the block using shadow_result.success to detect execution failures (broadcast "validating_fix" status "failed" with error and return an error) and compare shadow_result.columns (or the executed result schema) against the panel's expected columns/contract (e.g., the panel contract object used elsewhere for panel_id); if the shape drifts broadcast a failure with details and return, otherwise continue and include row_count/plan_cost when broadcasting "complete".backend/app/agent/rca.py-32-39 (1)
32-39:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScope the event context to the incident being analyzed.
This query pulls the last 30 minutes of warning/critical events for the whole platform. That means a delayed RCA can miss the incident’s own events, while concurrent incidents can inject unrelated failures into the prompt. Use the incident’s start time and, if the schema supports it, the incident’s tenant/resources to bound this context.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/agent/rca.py` around lines 32 - 39, The current conn.fetch SQL pulls the last 30 minutes of platform-wide events; change it to scope events to the incident being analyzed by replacing the fixed time window with the incident start time (use the incident.start_ts or incident.start_time parameter) and, if available in ops.events_norm, add WHERE clauses to filter by tenant_id and/or resource_id(s) (e.g., incident.tenant_id or incident.resource_ids) using parameterized placeholders passed into conn.fetch; keep ordering and limit but ensure the query binds the incident parameters so only events for that incident’s time range and tenant/resources are returned.backend/app/agent/llm.py-34-95 (1)
34-95:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd retry logic with failover to different deployments for transient OpenAI errors.
_next_deployment()only distributes requests across deployments; it provides no failover for failures. Whenclient.chat.completions.create()encounters a 429, 5xx, or timeout, the request fails immediately without retrying on another deployment. Create a wrapper around the completion calls that catches transient OpenAI errors, retries remaining deployments, and only raises after all have been exhausted.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/agent/llm.py` around lines 34 - 95, The current calls to client.chat.completions.create in generate_sql, generate_narrative, and generate_json have no failover; add a retry wrapper function (e.g., call_with_deployment_failover) that: obtains the client via _get_client(), iterates through deployments returned by repeatedly calling _next_deployment() (or a snapshot list of deployments), calls client.chat.completions.create with the original kwargs, catches transient errors (HTTP 429, any 5xx, and network/timeout exceptions), logs or record the failure, then retries the same request on the next deployment until all deployments are exhausted; only re-raise the last caught error if all attempts fail. Use this wrapper inside generate_sql/generate_narrative/generate_json instead of calling client.chat.completions.create directly so failures automatically failover across deployments.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 89f8202e-aa0f-4a45-9114-a103bc10f6f7
⛔ Files ignored due to path filters (9)
backend/uv.lockis excluded by!**/*.lockfrontend/package-lock.jsonis excluded by!**/package-lock.jsonfrontend/public/favicon.svgis excluded by!**/*.svgfrontend/public/fonts/HPE Graphik-Bold.otfis excluded by!**/*.otffrontend/public/fonts/HPE Graphik-Light.otfis excluded by!**/*.otffrontend/public/fonts/HPE Graphik-Medium.otfis excluded by!**/*.otffrontend/public/fonts/HPE Graphik-Regular.otfis excluded by!**/*.otffrontend/public/fonts/HPE Graphik-Semibold.otfis excluded by!**/*.otffrontend/public/icons.svgis excluded by!**/*.svg
📒 Files selected for processing (57)
.gitignoreREADME.mdbackend/.env.examplebackend/.python-versionbackend/app/__init__.pybackend/app/agent/__init__.pybackend/app/agent/healer.pybackend/app/agent/llm.pybackend/app/agent/prompts.pybackend/app/agent/rca.pybackend/app/agent/text_to_sql.pybackend/app/config.pybackend/app/db/__init__.pybackend/app/db/engine.pybackend/app/db/schema.sqlbackend/app/db/seed.pybackend/app/demo/__init__.pybackend/app/demo/prescripted.pybackend/app/demo/router.pybackend/app/demo/scenarios.pybackend/app/incidents/__init__.pybackend/app/incidents/router.pybackend/app/ingestion/__init__.pybackend/app/ingestion/normalizer.pybackend/app/ingestion/router.pybackend/app/main.pybackend/app/panels/__init__.pybackend/app/panels/router.pybackend/app/topology/__init__.pybackend/app/topology/router.pybackend/app/validation/__init__.pybackend/app/validation/executor.pybackend/app/validation/sqlglot_gate.pybackend/app/ws/__init__.pybackend/app/ws/manager.pybackend/pyproject.tomlfrontend/.gitignorefrontend/README.mdfrontend/eslint.config.jsfrontend/index.htmlfrontend/package.jsonfrontend/src/App.jsxfrontend/src/components/AIReasoningChain.jsxfrontend/src/components/DemoControl.jsxfrontend/src/components/EvidencePanel.jsxfrontend/src/components/Header.jsxfrontend/src/components/IncidentBanner.jsxfrontend/src/components/MetricCards.jsxfrontend/src/components/PanelHealth.jsxfrontend/src/components/RCANarrative.jsxfrontend/src/components/TimelineChart.jsxfrontend/src/components/TopologyGraph.jsxfrontend/src/hooks/useDashboardState.jsfrontend/src/hooks/useWebSocket.jsfrontend/src/index.cssfrontend/src/main.jsxfrontend/vite.config.js
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 6682
The review is still actively processing. Here's what's happening:
The |
Frontend: - useWebSocket: env-driven WS URL with protocol-aware fallback - IncidentBanner: reset elapsed on incident switch, stable dependency - TimelineChart: default metricsTimeline=[] to prevent crash - PanelHealth: default panels=[], preserve error-only healing entries - TopologyGraph: widen useEffect deps to track content changes - index.css: rename keyframes to kebab-case (slide-in, fade-in) Backend: - main.py: fix CORS wildcard+credentials violation, move disconnect to finally - panels/router.py: HTTPException 404, catch asyncpg.PostgresError, atomic version switch - incidents/router.py: HTTPException 404/422 for errors - ws/manager.py: iterate snapshot, catch only expected exceptions - demo/router.py: await task cancel, transaction reset, asyncio.Lock startup, HTTP 409 - demo/scenarios.py: atomic transaction for panel break - config.py: remove hardcoded credentials, require env var - __init__.py: deduplicate Settings (canonical in config.py) - .env.example: replace real credentials with placeholders - ingestion/normalizer.py: guard float parsing, normalize naive timestamps to UTC - ingestion/router.py: add webhook API key authentication - validation/executor.py: add statement_timeout + row cap guardrails - db/seed.py: atomic seeding with advisory lock, transaction for panel inserts - validation/sqlglot_gate.py: reject non-ops schema tables, fix extract_columns - agent/healer.py: atomic promotion with row lock, check shadow result + contract - agent/rca.py: scope events to incident start time - agent/llm.py: retry with failover across deployments on transient errors
|
@CodeRabbit i made changes to pr check |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 65155
The CodeRabbit review has now completed and is marked ✅ SUCCESS. The review surfaced some important items to address before merging: 🔴 Critical (inline comments):
🟠 Major (32 comments), highlights include:
The merge state remains |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
backend/app/agent/llm.py (1)
74-77: ⚡ Quick winRename ambiguous variable
ltoline.The single-letter
lis easily confused with1orI. Ruff flagged this as E741.Proposed fix
if content.startswith("```"): lines = content.split("\n") - lines = [l for l in lines if not l.strip().startswith("```")] + lines = [line for line in lines if not line.strip().startswith("```")] content = "\n".join(lines).strip()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/agent/llm.py` around lines 74 - 77, The list comprehension using the single-letter variable `l` is ambiguous (Ruff E741); update the comprehension in the block that strips triple-backtick code fences so that `l` is renamed to `line` (i.e., change `lines = [l for l in lines if not l.strip().startswith("```")]` to use `line`), leaving the surrounding logic (`if content.startswith("```")`, `lines = content.split("\n")`, and `content = "\n".join(lines).strip()`) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app/agent/healer.py`:
- Around line 31-34: The code currently reads active_query (via conn.fetchrow ->
active_query) before the long LLM/validation flow and then later runs the
promotion transaction which deactivates the prior active version and promotes
the repaired SQL; re-check the active version inside that promotion transaction
to avoid promoting a stale fix: within the same transactional block (the code
that deactivates and promotes), re-query ops.panel_query_versions for the
current active row (checking version_no) and compare it to the original
active_query.version_no, abort or raise/skip the promotion if they differ, or
base the deactivation/promotion on the freshly fetched row; update any use sites
referencing active_query.version_no in the transaction to use the re-fetched
value so you never deactivate/promote against an out-of-date base.
In `@backend/app/agent/llm.py`:
- Around line 39-64: The _call_with_failover function can raise a TypeError when
settings.azure_openai_deployments is empty because last_error stays None; add an
explicit guard at the start of _call_with_failover (after obtaining
settings/num_deployments) to detect num_deployments == 0 and raise a clear
exception (e.g., RuntimeError or ValueError) with a descriptive message about no
deployments configured; ensure the error mentions
settings.azure_openai_deployments so callers and logs can diagnose the
misconfiguration instead of hitting raise last_error later.
In `@backend/app/agent/rca.py`:
- Around line 32-39: The events query in conn.fetch inside rca.py only filters
with incident["started_at"], so later unrelated critical/warning events can leak
in; update the SQL in the conn.fetch call (the code that assigns events) to also
constrain event_ts <= incident["ended_at"] (or a safe fallback if ended_at is
null), and add incident["ended_at"] as the additional query parameter so the RCA
time window is bounded by both incident["started_at"] and incident["ended_at"]
when fetching events.
In `@backend/app/incidents/router.py`:
- Around line 65-67: The handler currently does "question = body.get('question',
'')" and only checks falsiness, allowing non-strings or whitespace-only strings
through; update the validation to first ensure the value is a string
(isinstance(question, str)), then trim it (question = question.strip()) and
raise HTTPException(status_code=422, detail="question is required") if the
trimmed value is empty or not a string before calling the investigation/dispatch
logic (i.e., around the variable question in this router function).
In `@backend/app/ingestion/router.py`:
- Line 29: The webhook handlers for /alerts, /metrics, and /events call
Request.json() directly and don't handle json.JSONDecodeError; add a helper
async function named _read_json(request: Request) that wraps await
request.json() in a try/except catching json.JSONDecodeError and raises
HTTPException(status_code=400, detail="Malformed JSON body"), then replace each
direct call payload = await request.json() in the alert/metric/event handler
functions with payload = await _read_json(request) so malformed JSON returns a
400 instead of an internal 500.
In `@backend/app/validation/executor.py`:
- Around line 21-33: The call site in healer.py that passes the fetched
broken_sql directly into explain_query() must first run the sql through the
existing SQL validation (sqlglot_gate / validate_sql); update the healer.py flow
where broken_sql is used (the call near the explain_query invocation) to call
validate_sql(broken_sql) or sqlglot_gate(broken_sql) and, if validation fails,
short-circuit and return or propagate an ExecutionResult-style failure (include
the validation error message) instead of calling explain_query; only call
explain_query(broken_sql) when validation passes.
In `@frontend/src/index.css`:
- Around line 394-395: Replace the deprecated CSS declaration `word-break:
break-word` (found alongside `white-space: pre-wrap`) with the
standards-compliant `overflow-wrap: anywhere`; remove `word-break: break-word`
and add `overflow-wrap: anywhere` in the same rule so long, unbreakable tokens
wrap correctly in preformatted blocks.
---
Nitpick comments:
In `@backend/app/agent/llm.py`:
- Around line 74-77: The list comprehension using the single-letter variable `l`
is ambiguous (Ruff E741); update the comprehension in the block that strips
triple-backtick code fences so that `l` is renamed to `line` (i.e., change
`lines = [l for l in lines if not l.strip().startswith("```")]` to use `line`),
leaving the surrounding logic (`if content.startswith("```")`, `lines =
content.split("\n")`, and `content = "\n".join(lines).strip()`) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8681230d-84e6-43e9-92b5-2f19c02555c1
📒 Files selected for processing (23)
backend/.env.examplebackend/app/__init__.pybackend/app/agent/healer.pybackend/app/agent/llm.pybackend/app/agent/rca.pybackend/app/config.pybackend/app/db/seed.pybackend/app/demo/router.pybackend/app/demo/scenarios.pybackend/app/incidents/router.pybackend/app/ingestion/normalizer.pybackend/app/ingestion/router.pybackend/app/main.pybackend/app/panels/router.pybackend/app/validation/executor.pybackend/app/validation/sqlglot_gate.pybackend/app/ws/manager.pyfrontend/src/components/IncidentBanner.jsxfrontend/src/components/PanelHealth.jsxfrontend/src/components/TimelineChart.jsxfrontend/src/components/TopologyGraph.jsxfrontend/src/hooks/useWebSocket.jsfrontend/src/index.css
✅ Files skipped from review due to trivial changes (1)
- backend/app/ws/manager.py
🚧 Files skipped from review as they are similar to previous changes (10)
- frontend/src/hooks/useWebSocket.js
- backend/app/config.py
- frontend/src/components/IncidentBanner.jsx
- frontend/src/components/PanelHealth.jsx
- frontend/src/components/TimelineChart.jsx
- frontend/src/components/TopologyGraph.jsx
- backend/app/demo/router.py
- backend/app/main.py
- backend/app/ingestion/normalizer.py
- backend/app/panels/router.py
- healer.py: re-check active version inside promotion transaction to prevent stale fix; validate broken_sql through sqlglot_gate before explain_query - llm.py: guard against empty deployments list; rename l -> line (Ruff E741) - rca.py: bound event query with resolved_at upper limit - incidents/router.py: type-check and strip question input - ingestion/router.py: add _read_json helper for 400 on malformed JSON - index.css: replace deprecated word-break: break-word with overflow-wrap: anywhere
There was a problem hiding this comment.
🧹 Nitpick comments (4)
backend/app/ingestion/router.py (3)
5-5: 💤 Low valueUnused import:
hashlib.The
hashlibmodule is imported but never used.hmac.compare_digestdoesn't require it.🧹 Proposed fix
import json import hmac -import hashlib from fastapi import APIRouter, Request, HTTPException🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ingestion/router.py` at line 5, Remove the unused import of hashlib from the top of the file (it is currently imported but never used); leave the existing usage of hmac.compare_digest intact and only delete the line "import hashlib" so there are no unused imports in router.py.
26-31: ⚡ Quick winUse
raise ... from Noneto suppress exception chaining.Per Ruff B904, raising a new exception inside an
exceptblock should usefromto clarify the exception chain. Here, suppressing the originalJSONDecodeErroris appropriate since you're converting it to an HTTP response.🧹 Proposed fix
async def _read_json(request: Request) -> dict: """Parse JSON body, returning 400 for malformed payloads.""" try: return await request.json() except json.JSONDecodeError: - raise HTTPException(status_code=400, detail="Malformed JSON body") + raise HTTPException(status_code=400, detail="Malformed JSON body") from None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ingestion/router.py` around lines 26 - 31, In _read_json, when catching json.JSONDecodeError and converting it to an HTTPException you should suppress exception chaining; replace the bare raise with raising HTTPException from None (i.e., raise HTTPException(status_code=400, detail="Malformed JSON body") from None) so the original JSONDecodeError is not chained into the HTTP response; update the except block in function _read_json accordingly.
41-42: 💤 Low valueSimplify the await pattern.
The two-line pattern is awkward and appears to be a refactoring artifact. Combine into a single line for clarity.
🧹 Proposed fix
- result = normalize_alert(pool, payload) - result = await result + result = await normalize_alert(pool, payload)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/ingestion/router.py` around lines 41 - 42, The code uses an unnecessary two-step await: first assigning result = normalize_alert(pool, payload) then awaiting it on the next line; replace both lines with a single awaited call by assigning result = await normalize_alert(pool, payload) so the coroutine is awaited immediately (update the occurrence where normalize_alert is called and the variable result is set).backend/app/agent/healer.py (1)
9-9: 💤 Low valueUnused import:
extract_columns.The
extract_columnsfunction is imported but never used in this module.🧹 Proposed fix
-from app.validation.sqlglot_gate import validate_sql, extract_columns +from app.validation.sqlglot_gate import validate_sql🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/agent/healer.py` at line 9, The import line in healer.py brings in extract_columns from app.validation.sqlglot_gate but that symbol is unused; remove extract_columns from the import so only validate_sql is imported (i.e., change the import statement referencing validate_sql, removing extract_columns) to eliminate the unused import warning and keep imports minimal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@backend/app/agent/healer.py`:
- Line 9: The import line in healer.py brings in extract_columns from
app.validation.sqlglot_gate but that symbol is unused; remove extract_columns
from the import so only validate_sql is imported (i.e., change the import
statement referencing validate_sql, removing extract_columns) to eliminate the
unused import warning and keep imports minimal.
In `@backend/app/ingestion/router.py`:
- Line 5: Remove the unused import of hashlib from the top of the file (it is
currently imported but never used); leave the existing usage of
hmac.compare_digest intact and only delete the line "import hashlib" so there
are no unused imports in router.py.
- Around line 26-31: In _read_json, when catching json.JSONDecodeError and
converting it to an HTTPException you should suppress exception chaining;
replace the bare raise with raising HTTPException from None (i.e., raise
HTTPException(status_code=400, detail="Malformed JSON body") from None) so the
original JSONDecodeError is not chained into the HTTP response; update the
except block in function _read_json accordingly.
- Around line 41-42: The code uses an unnecessary two-step await: first
assigning result = normalize_alert(pool, payload) then awaiting it on the next
line; replace both lines with a single awaited call by assigning result = await
normalize_alert(pool, payload) so the coroutine is awaited immediately (update
the occurrence where normalize_alert is called and the variable result is set).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: dda5c78f-5321-4ee4-b11e-daa1b6120eb3
📒 Files selected for processing (6)
backend/app/agent/healer.pybackend/app/agent/llm.pybackend/app/agent/rca.pybackend/app/incidents/router.pybackend/app/ingestion/router.pyfrontend/src/index.css
🚧 Files skipped from review as they are similar to previous changes (3)
- backend/app/incidents/router.py
- backend/app/agent/llm.py
- frontend/src/index.css
|
@CodeRabbit have made changes in pr check now |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 86003
Your two fix commits (
These are low-priority and non-blocking. The PR is ready for a human reviewer to approve and merge! 🐇 |
#1 🔴 DemoControl: store poll interval in useRef, clear on unmount #2 🔴 App: wrap triggerGlow in useCallback, add to all useEffect deps #3 🟡 DemoControl: fetch /api/demo/status on mount to sync completed #4 🟡 App: track lastGlowedRef to scroll to most-recent section #5 🔵 router.py: split long HTTPException raise across lines #6 🔵 DemoControl: restore talking point display in demo dock #7 🔵 index.css: remove redundant scroll-behavior: smooth
Review fixes: #1 metricsTimeline already in initialState (no change needed — reviewer error) #2 retryCount now uses useState for reactive updates (not stale ref) #3 EventLog adds 5s periodic tick to keep relative timestamps fresh #4 /metrics-baseline query capped with LIMIT 5000 #5 Incident 4 → INC-004 (clean numbering: INC-001, INC-002, INC-004) #6 KPI card key includes status → React remounts on transition, re-triggering animation #7 Removed redundant '|| panels.length === 0' from PanelHealth render #8 Impact line hidden when $0, shows 'calculating…' during first seconds
Creating PR to merge feat into main
Summary by CodeRabbit
New Features
Documentation
Chores