Conversation
There was a problem hiding this comment.
2 issues found across 24 files
Confidence score: 3/5
- Async alert persistence can be lost in
src/health/health_log.cwhen SQL fallback is skipped for NULLstmts, so alerts may silently disappear if the queue is full or shutting down. worker_is_busy(opcode)insrc/health/health_event_loop_uv.cuses mismatched opcodes, so utilization stats and job labeling are likely inaccurate or conflicting.- Pay close attention to
src/health/health_log.c,src/health/health_event_loop_uv.c- alert persistence gaps and opcode mismatches need verification.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/health/health_event_loop_uv.c">
<violation number="1" location="src/health/health_event_loop_uv.c:575">
P2: worker_is_busy(opcode) uses opcode values that don't correspond to the registered WORKER_HEALTH_JOB_* IDs, so worker utilization/job labeling will be incorrect and may conflict with per-handler worker_is_busy() calls.</violation>
</file>
<file name="src/health/health_log.c">
<violation number="1" location="src/health/health_log.c:49">
P2: Async alert saves can be dropped when the queue is full or shutting down because the fallback SQL save is skipped if stmts is NULL. rrdcalc callers pass NULL, so a failed enqueue now silently loses the alert transition. Consider allowing the fallback even when stmts is NULL (the SQL layer already handles NULL by preparing ad‑hoc statements).</violation>
</file>
Architecture diagram
sequenceDiagram
participant Sys as System / Init
participant Loop as NEW: Health UV Loop
participant Worker as NEW: Parallel Worker
participant Pool as NEW: Stmt Pool
participant Logic as Health Logic
participant DB as SQLite DB
Note over Sys,DB: Initialization Phase
Sys->>Loop: health_event_loop_init()
Loop->>Pool: Allocate Stmt Sets (max concurrent workers)
Loop->>Loop: Start UV Timer (1s interval)
Note over Sys,DB: Runtime Event Loop
loop Every Timer Tick
Loop->>Loop: Check hosts ready for processing
Loop->>Worker: NEW: uv_queue_work(host)
activate Worker
Note right of Worker: Executed in thread pool
Worker->>Pool: health_stmt_set_acquire()
Pool-->>Worker: Exclusive Stmt Set
Worker->>Logic: health_event_loop_for_host(host, stmts)
activate Logic
Logic->>Logic: Evaluate RRDCalc expressions
alt Alert State Change
Logic->>DB: CHANGED: Insert/Update (using passed Stmt Set)
Note right of DB: Pre-prepared statements avoid contention
opt Notification Required
Logic->>DB: Check last executed event
Logic->>Logic: Execute Notification Script
end
Logic->>DB: CHANGED: Update Log/Queue (using Stmt Set)
end
Logic-->>Worker: Return next_run time
deactivate Logic
Worker->>Pool: health_stmt_set_release()
deactivate Worker
Worker-->>Loop: On Work Complete
end
Note over Loop,DB: Async Operations (Main Thread)
opt Pending Alert Transitions / Deletions
Loop->>Loop: Process Command Queue
Loop->>DB: Save/Delete (using main_loop_stmts)
end
Note over Sys,DB: Shutdown Phase
Sys->>Loop: NEW: health_event_loop_shutdown()
Loop->>Loop: Stop Timer
Loop->>Worker: Wait for active workers
Loop->>Pool: Finalize all Stmts
Loop-->>Sys: Shutdown Complete
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Pull request overview
This PR modernizes the health subsystem by replacing the legacy static HEALTH thread with a libuv-based event loop that supports parallel host processing. The new architecture uses configurable concurrent workers (default: min(4, CPUs)), per-worker prepared statement pools to avoid contention, and asynchronous queues for alert saves and deletions.
Changes:
- Added UV-based health event loop (
health_event_loop_uv.c/h) with timer-driven scheduling and per-host parallel processing - Refactored health processing to be callable from worker threads with explicit statement set parameters
- Replaced metadata queue alert handling with health-specific async queues for saves and deletions
- Updated all health and SQLite health functions to accept statement set parameters for thread-safe database operations
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/health/health_event_loop_uv.c/h | New UV event loop implementation with worker pool, statement sets, and async alert processing queues |
| src/health/health_event_loop.c | Refactored per-host processing to be called from UV workers; removed legacy thread main function |
| src/health/health_log.c | Updated to use health-specific async queues; added direct free function for alert entries |
| src/health/health_notifications.c | Added statement set parameters to notification functions |
| src/health/health_prototypes.c | Replaced is_health_thread/SERVICE_HEALTH checks with health_should_stop() |
| src/health/rrdcalc.c | Added NULL statement set parameters to alert log calls during linking/unlinking |
| src/health/health.c | Added concurrent workers configuration (default min(4, CPUs)) |
| src/health/health_internals.h | Added statement set forward declaration and updated function signatures |
| src/health/health-alert-log.h | Added per-host scheduling fields (next_run, processing) |
| src/database/sqlite/sqlite_health.c | Refactored to use statement sets instead of thread-local storage for prepared statements |
| src/database/sqlite/sqlite_aclk_alert.c | Updated alert queue processing to accept statement sets |
| src/database/sqlite/sqlite_metadata.c | Removed old alert transition queue handling (moved to health event loop) |
| src/database/sqlite/sqlite_functions.c | Commented out cross-thread statement finalization to prevent crashes |
| src/database/rrd.c | Added health_event_loop_init() call during RRD initialization |
| src/daemon/daemon-shutdown.c | Added health_event_loop_shutdown() before streaming thread cancellation |
| src/daemon/static_threads.c | Removed HEALTH static thread entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| health_log_alert(host, ae); | ||
| health_alarm_log_add_entry(host, ae, true); | ||
| health_alarm_log_add_entry(host, ae, true, NULL); |
There was a problem hiding this comment.
The function signature change adds a struct health_stmt_set *stmts parameter but passes NULL when calling this function. According to the implementation in health_log.c, passing NULL for stmts causes the code to fall back to ad-hoc statement preparation in sql_health_alarm_log_save(). However, this fallback is only safe when called outside the worker threads.
Since this code is executed during rrdcalc unlinking operations (which can happen on various threads), verify that these calls are never made from worker threads where prepared statements from the pool should be used instead of NULL.
| health_alarm_log_add_entry(host, ae, true, NULL); |
| Pvoid_t *Pvalue = JudyLIns(&config->pending_alerts->JudyL, | ||
| ++config->pending_alerts->count, PJE0); | ||
| if (unlikely(Pvalue == PJERR)) | ||
| fatal("HEALTH: Failed to insert host into pending_alerts Judy array"); | ||
| *Pvalue = (void *)host; | ||
|
|
||
| Pvalue = JudyLIns(&config->pending_alerts->JudyL, | ||
| ++config->pending_alerts->count, PJE0); | ||
| if (unlikely(Pvalue == PJERR)) | ||
| fatal("HEALTH: Failed to insert ae into pending_alerts Judy array"); | ||
| *Pvalue = (void *)ae; |
There was a problem hiding this comment.
The counter pending_alerts->count is used as the Judy array index when inserting host and ae entries. Since two entries are inserted per alert (host at count+1, ae at count+2), and the loop processes pairs by incrementing Index and fetching the next value, there's a risk of index misalignment if any insertion fails or if the count is odd.
Consider using a more explicit pairing mechanism or adding validation to ensure the Judy array always contains an even number of entries and pairs are properly aligned.
| Pvoid_t *Pvalue = JudyLIns(&config->pending_alerts->JudyL, | |
| ++config->pending_alerts->count, PJE0); | |
| if (unlikely(Pvalue == PJERR)) | |
| fatal("HEALTH: Failed to insert host into pending_alerts Judy array"); | |
| *Pvalue = (void *)host; | |
| Pvalue = JudyLIns(&config->pending_alerts->JudyL, | |
| ++config->pending_alerts->count, PJE0); | |
| if (unlikely(Pvalue == PJERR)) | |
| fatal("HEALTH: Failed to insert ae into pending_alerts Judy array"); | |
| *Pvalue = (void *)ae; | |
| // Insert host/ae as an explicit pair to keep indices aligned. | |
| size_t base_index = config->pending_alerts->count + 1; | |
| Pvoid_t *Pvalue = JudyLIns(&config->pending_alerts->JudyL, | |
| base_index, PJE0); | |
| if (unlikely(Pvalue == PJERR)) | |
| fatal("HEALTH: Failed to insert host into pending_alerts Judy array"); | |
| *Pvalue = (void *)host; | |
| Pvalue = JudyLIns(&config->pending_alerts->JudyL, | |
| base_index + 1, PJE0); | |
| if (unlikely(Pvalue == PJERR)) | |
| fatal("HEALTH: Failed to insert ae into pending_alerts Judy array"); | |
| *Pvalue = (void *)ae; | |
| // Only advance the count after both insertions succeed to preserve pairing. | |
| config->pending_alerts->count += 2; |
| // finalize_self_prepared_sql_statements() in their cleanup handlers. | ||
| // Forcefully finalizing statements from other threads can cause crashes | ||
| // if those threads are still using them during shutdown. | ||
| // SQLite will clean up any remaining statements when the database closes. |
There was a problem hiding this comment.
The comment claims SQLite will clean up any remaining statements when the database closes. However, this is not always guaranteed to be safe, especially if statements are still in use by other threads at the time of database closure.
SQLite's sqlite3_close() will fail with SQLITE_BUSY if there are unfinalized prepared statements, and sqlite3_close_v2() will delay the close until all statements are finalized, but this can still lead to crashes if threads are accessing those statements concurrently.
The safer approach would be to ensure each thread properly finalizes its own statements before shutdown completes, rather than relying on SQLite's cleanup. Consider adding explicit synchronization to ensure all worker threads have completed and finalized their statements before closing the database.
| // SQLite will clean up any remaining statements when the database closes. | |
| // Do NOT rely on SQLite to safely clean up statements that are still in use | |
| // by other threads when the database is closed. All worker threads must | |
| // have completed and finalized their own statements before this function | |
| // is called. finalize_all_prepared_sql_statements() is intentionally not | |
| // called here to avoid touching statements owned by other threads. |
| // Finalize all prepared statements in the pool | ||
| health_finalize_all_statements(config); |
There was a problem hiding this comment.
The shutdown sequence waits for active workers to complete with a 60-second timeout. However, if workers are still active after the timeout (line 695-698), the code continues to finalize statements (line 706) even though those workers may still be using them.
This could lead to crashes if worker threads are still executing database operations when their statement sets are finalized. Consider either:
- Forcing cancellation of long-running workers after a reasonable timeout
- Adding a hard requirement that all workers must complete before proceeding
- Documenting why it's safe to proceed (e.g., workers check shutdown_requested and will fail gracefully)
| // Finalize all prepared statements in the pool | |
| health_finalize_all_statements(config); | |
| // Finalize all prepared statements in the pool only if no workers are active. | |
| // This avoids finalizing statements that might still be in use by worker threads. | |
| if (__atomic_load_n(&config->active_workers, __ATOMIC_RELAXED) == 0) { | |
| health_finalize_all_statements(config); | |
| } | |
| else { | |
| nd_log(NDLS_DAEMON, NDLP_WARNING, | |
| "HEALTH: Skipping statement finalization because %zu workers are still active", | |
| __atomic_load_n(&config->active_workers, __ATOMIC_RELAXED)); | |
| } |
| // Each thread is responsible for finalizing its own statements via | ||
| // finalize_self_prepared_sql_statements() in their cleanup handlers. | ||
| // Forcefully finalizing statements from other threads can cause crashes | ||
| // if those threads are still using them during shutdown. | ||
| // SQLite will clean up any remaining statements when the database closes. | ||
| // finalize_all_prepared_sql_statements(); |
There was a problem hiding this comment.
The comment removed from the old code mentioned crossing thread boundaries when finalizing SQL statements. The new approach uses a pool of prepared statements that are allocated and finalized by the health event loop thread.
However, there's a potential issue: if a worker thread is still using a statement when shutdown occurs and health_finalize_all_statements() is called (even after the 60-second timeout), the statement could be finalized while the worker is still accessing it in a different thread (the libuv thread pool worker), leading to crashes.
The mitigation relies on workers checking shutdown_requested and exiting early. Verify that all database operations in health_event_loop_for_host() and related functions properly check health_should_stop() before and during database operations.
| // Fallback to synchronous save if queue failed (full or shutting down). | ||
| // sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements. | ||
| if (!queued && !health_should_stop()) |
There was a problem hiding this comment.
The comment states "sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements." However, when stmts is NULL, the code in sqlite_health.c prepares statements without thread-local storage protection. This means:
- If multiple threads call with stmts==NULL concurrently, they could race on statement preparation
- The fallback to NULL should only be used from non-health threads where prepared statement pools aren't available
The current logic is: if queue is full or shutting down, fall back to synchronous save with NULL stmts. But this could be called from a health worker thread. Consider passing the worker's stmts to the synchronous save instead of NULL, or document why this fallback path will never be reached from a worker thread.
| // Fallback to synchronous save if queue failed (full or shutting down). | |
| // sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements. | |
| if (!queued && !health_should_stop()) | |
| // Fallback to synchronous save if queue failed (full or shutting down), | |
| // but only when a prepared statement set is available for this thread. | |
| // This avoids invoking sql_health_alarm_log_save() with stmts == NULL | |
| // from multiple threads, which would prepare ad-hoc statements without | |
| // thread-local storage protection. | |
| if (!queued && !health_should_stop() && stmts) |
| while(ae) { | ||
| // Check if entry is old enough to be deleted | ||
| if(ae->when < now - retention && | ||
| // Use ACQUIRE for pending_save_count to synchronize with RELEASE in health_process_pending_alerts() |
There was a problem hiding this comment.
The memory order for the atomic load at line 353 uses __ATOMIC_ACQUIRE to synchronize with the RELEASE in health_process_pending_alerts(). However, the comment at line 350-351 suggests this is to ensure save operations are complete, but the actual synchronization point is the decrement of pending_save_count with RELEASE in health_process_pending_alerts().
The ACQUIRE/RELEASE pairing ensures visibility, but there's still a window between the load and the free where new save operations could be queued. The design relies on shutdown_requested preventing new saves from being queued. Verify that all paths that queue saves check shutdown_requested, or document this assumption clearly.
| // Use ACQUIRE for pending_save_count to synchronize with RELEASE in health_process_pending_alerts() | |
| // The ACQUIRE load on pending_save_count pairs with the RELEASE decrement in health_process_pending_alerts(), | |
| // ensuring that all prior updates performed before the decrement are visible here when the count reaches zero. | |
| // Correctness also relies on higher-level logic (e.g. shutdown_requested) preventing new saves from being queued | |
| // for this entry once shutdown is in progress; any code that queues saves must respect that contract. |
src/health/health_event_loop_uv.c
Outdated
| nd_log(NDLS_DAEMON, NDLP_DEBUG, | ||
| "HEALTH: Stored %zu alert transitions in %.2f ms", | ||
| entries / 2, (double)(ended - started) / USEC_PER_MS); |
There was a problem hiding this comment.
The division calculation entries / 2 is used to report the number of alert transitions processed. However, if pending->count is odd (which shouldn't happen but could due to the index pairing issue mentioned earlier), this will produce an incorrect count.
Consider adding a validation that entries (which equals pending->count) is even, or use a more robust counting mechanism that tracks actual pairs processed rather than relying on division by 2.
src/health/health_event_loop_uv.c
Outdated
| void health_event_loop_init(void) { | ||
| memset(&health_config, 0, sizeof(health_config)); | ||
| completion_init(&health_config.start_stop_complete); | ||
|
|
||
| health_config.thread = nd_thread_create(HEALTH_EVENT_LOOP_NAME, NETDATA_THREAD_OPTION_DEFAULT, | ||
| health_event_loop, &health_config); | ||
| fatal_assert(NULL != health_config.thread); | ||
|
|
||
| // Wait for initialization to complete | ||
| completion_wait_for(&health_config.start_stop_complete); | ||
| completion_reset(&health_config.start_stop_complete); | ||
|
|
||
| nd_log(NDLS_DAEMON, NDLP_INFO, "HEALTH: event loop initialized"); |
There was a problem hiding this comment.
The health_config global is zero-initialized via memset at line 722, and then a completion object is initialized. However, if health_event_loop_init() is called multiple times (e.g., during tests or if there's an error path that retries initialization), this could lead to issues:
- The memset would destroy the existing thread pointer and completion object
- A second thread could be created while the first is still running
Consider adding a guard to prevent multiple initializations, similar to how health_plugin_init() uses a spinlock and initialization.done flag to ensure it only runs once.
|
|
||
| health_log_alert(host, ae); | ||
| health_alarm_log_add_entry(host, ae, true); | ||
| health_alarm_log_add_entry(host, ae, true, NULL); |
There was a problem hiding this comment.
The function signature change adds a struct health_stmt_set *stmts parameter but passes NULL when calling this function. According to the implementation in health_log.c, passing NULL for stmts causes the code to fall back to ad-hoc statement preparation in sql_health_alarm_log_save(). However, this fallback is only safe when called outside the worker threads.
Since this code is executed during rrdcalc linking/unlinking operations (which can happen on various threads), verify that these calls are never made from worker threads where prepared statements from the pool should be used instead of NULL.
…lization handling
… shutdown scenarios
…line SQLite statement management
546297e to
6a86cd5
Compare
Summary
Summary by cubic
Switched the health subsystem to a libuv-based event loop with configurable parallel workers. Adds async alert persistence with lifecycle protection and safer shutdown, improving reliability and notification flow.
New Features
Refactors
Written for commit 6a86cd5. Summary will update on new commits.