Codestin Search App

stelfrag · 2026-01-27T17:52:22Z

Summary

UV based health event loop

Summary by cubic

Switched the health subsystem to a libuv-based event loop with configurable parallel workers. Adds async alert persistence with lifecycle protection and safer shutdown, improving reliability and notification flow.

New Features
- UV health event loop (health_event_loop_uv.c/h) with timer ticks, per-host scheduling, and parallel workers (default min(4, CPUs)).
- Async alert transition pipeline: queued saves, safe deletions, and clean handling during shutdown.
- Per-worker prepared SQLite statement sets reused across health and ACLK alert paths to reduce contention.
Refactors
- Removed the legacy HEALTH thread; health starts in rrd_init and stops via UV sync shutdown (no SERVICE_HEALTH dependency).
- Stopped cross-thread SQLite statement finalization; each thread finalizes its own; finalize_all_prepared_sql_statements() disabled.
- Updated health/SQLite APIs to pass health_stmt_set; switched alert-hash updates and queues to precompiled statements; replaced is_health_thread checks with health_should_stop; removed obsolete metadata queues and orphan cleanup.

^{Written for commit 6a86cd5. Summary will update on new commits.}

cubic-dev-ai

2 issues found across 24 files

Confidence score: 3/5

Async alert persistence can be lost in src/health/health_log.c when SQL fallback is skipped for NULL stmts, so alerts may silently disappear if the queue is full or shutting down.
worker_is_busy(opcode) in src/health/health_event_loop_uv.c uses mismatched opcodes, so utilization stats and job labeling are likely inaccurate or conflicting.
Pay close attention to src/health/health_log.c, src/health/health_event_loop_uv.c - alert persistence gaps and opcode mismatches need verification.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/health/health_event_loop_uv.c">

<violation number="1" location="src/health/health_event_loop_uv.c:575">
P2: worker_is_busy(opcode) uses opcode values that don't correspond to the registered WORKER_HEALTH_JOB_* IDs, so worker utilization/job labeling will be incorrect and may conflict with per-handler worker_is_busy() calls.</violation>
</file>

<file name="src/health/health_log.c">

<violation number="1" location="src/health/health_log.c:49">
P2: Async alert saves can be dropped when the queue is full or shutting down because the fallback SQL save is skipped if stmts is NULL. rrdcalc callers pass NULL, so a failed enqueue now silently loses the alert transition. Consider allowing the fallback even when stmts is NULL (the SQL layer already handles NULL by preparing ad‑hoc statements).</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Sys as System / Init
    participant Loop as NEW: Health UV Loop
    participant Worker as NEW: Parallel Worker
    participant Pool as NEW: Stmt Pool
    participant Logic as Health Logic
    participant DB as SQLite DB

    Note over Sys,DB: Initialization Phase
    Sys->>Loop: health_event_loop_init()
    Loop->>Pool: Allocate Stmt Sets (max concurrent workers)
    Loop->>Loop: Start UV Timer (1s interval)

    Note over Sys,DB: Runtime Event Loop
    loop Every Timer Tick
        Loop->>Loop: Check hosts ready for processing
        Loop->>Worker: NEW: uv_queue_work(host)
        
        activate Worker
        Note right of Worker: Executed in thread pool
        
        Worker->>Pool: health_stmt_set_acquire()
        Pool-->>Worker: Exclusive Stmt Set
        
        Worker->>Logic: health_event_loop_for_host(host, stmts)
        
        activate Logic
        Logic->>Logic: Evaluate RRDCalc expressions
        
        alt Alert State Change
            Logic->>DB: CHANGED: Insert/Update (using passed Stmt Set)
            Note right of DB: Pre-prepared statements avoid contention
            
            opt Notification Required
                Logic->>DB: Check last executed event
                Logic->>Logic: Execute Notification Script
            end
            
            Logic->>DB: CHANGED: Update Log/Queue (using Stmt Set)
        end
        
        Logic-->>Worker: Return next_run time
        deactivate Logic

        Worker->>Pool: health_stmt_set_release()
        deactivate Worker
        Worker-->>Loop: On Work Complete
    end

    Note over Loop,DB: Async Operations (Main Thread)
    opt Pending Alert Transitions / Deletions
        Loop->>Loop: Process Command Queue
        Loop->>DB: Save/Delete (using main_loop_stmts)
    end

    Note over Sys,DB: Shutdown Phase
    Sys->>Loop: NEW: health_event_loop_shutdown()
    Loop->>Loop: Stop Timer
    Loop->>Worker: Wait for active workers
    Loop->>Pool: Finalize all Stmts
    Loop-->>Sys: Shutdown Complete

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/health/health_event_loop_uv.c

src/health/health_log.c

Copilot

Pull request overview

This PR modernizes the health subsystem by replacing the legacy static HEALTH thread with a libuv-based event loop that supports parallel host processing. The new architecture uses configurable concurrent workers (default: min(4, CPUs)), per-worker prepared statement pools to avoid contention, and asynchronous queues for alert saves and deletions.

Changes:

Added UV-based health event loop (health_event_loop_uv.c/h) with timer-driven scheduling and per-host parallel processing
Refactored health processing to be callable from worker threads with explicit statement set parameters
Replaced metadata queue alert handling with health-specific async queues for saves and deletions
Updated all health and SQLite health functions to accept statement set parameters for thread-safe database operations

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/health/health_event_loop_uv.c/h	New UV event loop implementation with worker pool, statement sets, and async alert processing queues
src/health/health_event_loop.c	Refactored per-host processing to be called from UV workers; removed legacy thread main function
src/health/health_log.c	Updated to use health-specific async queues; added direct free function for alert entries
src/health/health_notifications.c	Added statement set parameters to notification functions
src/health/health_prototypes.c	Replaced `is_health_thread`/`SERVICE_HEALTH` checks with `health_should_stop()`
src/health/rrdcalc.c	Added NULL statement set parameters to alert log calls during linking/unlinking
src/health/health.c	Added concurrent workers configuration (default min(4, CPUs))
src/health/health_internals.h	Added statement set forward declaration and updated function signatures
src/health/health-alert-log.h	Added per-host scheduling fields (next_run, processing)
src/database/sqlite/sqlite_health.c	Refactored to use statement sets instead of thread-local storage for prepared statements
src/database/sqlite/sqlite_aclk_alert.c	Updated alert queue processing to accept statement sets
src/database/sqlite/sqlite_metadata.c	Removed old alert transition queue handling (moved to health event loop)
src/database/sqlite/sqlite_functions.c	Commented out cross-thread statement finalization to prevent crashes
src/database/rrd.c	Added health_event_loop_init() call during RRD initialization
src/daemon/daemon-shutdown.c	Added health_event_loop_shutdown() before streaming thread cancellation
src/daemon/static_threads.c	Removed HEALTH static thread entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T18:54:55Z

src/health/rrdcalc.c


            health_log_alert(host, ae);
-            health_alarm_log_add_entry(host, ae, true);
+            health_alarm_log_add_entry(host, ae, true, NULL);


The function signature change adds a struct health_stmt_set *stmts parameter but passes NULL when calling this function. According to the implementation in health_log.c, passing NULL for stmts causes the code to fall back to ad-hoc statement preparation in sql_health_alarm_log_save(). However, this fallback is only safe when called outside the worker threads.

Since this code is executed during rrdcalc unlinking operations (which can happen on various threads), verify that these calls are never made from worker threads where prepared statements from the pool should be used instead of NULL.

Suggested change

health_alarm_log_add_entry(host, ae, true, NULL);

Copilot · 2026-01-27T18:54:56Z

src/health/health_event_loop_uv.c

+                    Pvoid_t *Pvalue = JudyLIns(&config->pending_alerts->JudyL,
+                                               ++config->pending_alerts->count, PJE0);
+                    if (unlikely(Pvalue == PJERR))
+                        fatal("HEALTH: Failed to insert host into pending_alerts Judy array");
+                    *Pvalue = (void *)host;
+
+                    Pvalue = JudyLIns(&config->pending_alerts->JudyL,
+                                      ++config->pending_alerts->count, PJE0);
+                    if (unlikely(Pvalue == PJERR))
+                        fatal("HEALTH: Failed to insert ae into pending_alerts Judy array");
+                    *Pvalue = (void *)ae;


The counter pending_alerts->count is used as the Judy array index when inserting host and ae entries. Since two entries are inserted per alert (host at count+1, ae at count+2), and the loop processes pairs by incrementing Index and fetching the next value, there's a risk of index misalignment if any insertion fails or if the count is odd.

Consider using a more explicit pairing mechanism or adding validation to ensure the Judy array always contains an even number of entries and pairs are properly aligned.

Suggested change

Pvoid_t *Pvalue = JudyLIns(&config->pending_alerts->JudyL,

++config->pending_alerts->count, PJE0);

if (unlikely(Pvalue == PJERR))

fatal("HEALTH: Failed to insert host into pending_alerts Judy array");

*Pvalue = (void *)host;

Pvalue = JudyLIns(&config->pending_alerts->JudyL,

++config->pending_alerts->count, PJE0);

if (unlikely(Pvalue == PJERR))

fatal("HEALTH: Failed to insert ae into pending_alerts Judy array");

*Pvalue = (void *)ae;

// Insert host/ae as an explicit pair to keep indices aligned.

size_t base_index = config->pending_alerts->count + 1;

Pvoid_t *Pvalue = JudyLIns(&config->pending_alerts->JudyL,

base_index, PJE0);

if (unlikely(Pvalue == PJERR))

fatal("HEALTH: Failed to insert host into pending_alerts Judy array");

*Pvalue = (void *)host;

Pvalue = JudyLIns(&config->pending_alerts->JudyL,

base_index + 1, PJE0);

if (unlikely(Pvalue == PJERR))

fatal("HEALTH: Failed to insert ae into pending_alerts Judy array");

*Pvalue = (void *)ae;

// Only advance the count after both insertions succeed to preserve pairing.

config->pending_alerts->count += 2;

Copilot · 2026-01-27T18:54:56Z

src/database/sqlite/sqlite_functions.c

+    // finalize_self_prepared_sql_statements() in their cleanup handlers.
+    // Forcefully finalizing statements from other threads can cause crashes
+    // if those threads are still using them during shutdown.
+    // SQLite will clean up any remaining statements when the database closes.


The comment claims SQLite will clean up any remaining statements when the database closes. However, this is not always guaranteed to be safe, especially if statements are still in use by other threads at the time of database closure.

SQLite's sqlite3_close() will fail with SQLITE_BUSY if there are unfinalized prepared statements, and sqlite3_close_v2() will delay the close until all statements are finalized, but this can still lead to crashes if threads are accessing those statements concurrently.

The safer approach would be to ensure each thread properly finalizes its own statements before shutdown completes, rather than relying on SQLite's cleanup. Consider adding explicit synchronization to ensure all worker threads have completed and finalized their statements before closing the database.

Suggested change

// SQLite will clean up any remaining statements when the database closes.

// Do NOT rely on SQLite to safely clean up statements that are still in use

// by other threads when the database is closed. All worker threads must

// have completed and finalized their own statements before this function

// is called. finalize_all_prepared_sql_statements() is intentionally not

// called here to avoid touching statements owned by other threads.

Copilot · 2026-01-27T18:54:57Z

src/health/health_event_loop_uv.c

+    // Finalize all prepared statements in the pool
+    health_finalize_all_statements(config);


The shutdown sequence waits for active workers to complete with a 60-second timeout. However, if workers are still active after the timeout (line 695-698), the code continues to finalize statements (line 706) even though those workers may still be using them.

This could lead to crashes if worker threads are still executing database operations when their statement sets are finalized. Consider either:

Forcing cancellation of long-running workers after a reasonable timeout

Adding a hard requirement that all workers must complete before proceeding

Documenting why it's safe to proceed (e.g., workers check shutdown_requested and will fail gracefully)

Suggested change

// Finalize all prepared statements in the pool

health_finalize_all_statements(config);

// Finalize all prepared statements in the pool only if no workers are active.

// This avoids finalizing statements that might still be in use by worker threads.

if (__atomic_load_n(&config->active_workers, __ATOMIC_RELAXED) == 0) {

health_finalize_all_statements(config);

}

else {

nd_log(NDLS_DAEMON, NDLP_WARNING,

"HEALTH: Skipping statement finalization because %zu workers are still active",

__atomic_load_n(&config->active_workers, __ATOMIC_RELAXED));

}

Copilot · 2026-01-27T18:54:57Z

src/database/sqlite/sqlite_functions.c

+    // Each thread is responsible for finalizing its own statements via
+    // finalize_self_prepared_sql_statements() in their cleanup handlers.
+    // Forcefully finalizing statements from other threads can cause crashes
+    // if those threads are still using them during shutdown.
+    // SQLite will clean up any remaining statements when the database closes.
+    // finalize_all_prepared_sql_statements();


The comment removed from the old code mentioned crossing thread boundaries when finalizing SQL statements. The new approach uses a pool of prepared statements that are allocated and finalized by the health event loop thread.

However, there's a potential issue: if a worker thread is still using a statement when shutdown occurs and health_finalize_all_statements() is called (even after the 60-second timeout), the statement could be finalized while the worker is still accessing it in a different thread (the libuv thread pool worker), leading to crashes.

The mitigation relies on workers checking shutdown_requested and exiting early. Verify that all database operations in health_event_loop_for_host() and related functions properly check health_should_stop() before and during database operations.

Copilot · 2026-01-27T18:54:58Z

src/health/health_log.c

+        // Fallback to synchronous save if queue failed (full or shutting down).
+        // sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements.
+        if (!queued && !health_should_stop())


The comment states "sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements." However, when stmts is NULL, the code in sqlite_health.c prepares statements without thread-local storage protection. This means:

If multiple threads call with stmts==NULL concurrently, they could race on statement preparation

The fallback to NULL should only be used from non-health threads where prepared statement pools aren't available

The current logic is: if queue is full or shutting down, fall back to synchronous save with NULL stmts. But this could be called from a health worker thread. Consider passing the worker's stmts to the synchronous save instead of NULL, or document why this fallback path will never be reached from a worker thread.

Suggested change

// Fallback to synchronous save if queue failed (full or shutting down).

// sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements.

if (!queued && !health_should_stop())

// Fallback to synchronous save if queue failed (full or shutting down),

// but only when a prepared statement set is available for this thread.

// This avoids invoking sql_health_alarm_log_save() with stmts == NULL

// from multiple threads, which would prepare ad-hoc statements without

// thread-local storage protection.

if (!queued && !health_should_stop() && stmts)

Copilot · 2026-01-27T18:54:58Z

src/health/health_log.c

    while(ae) {
        // Check if entry is old enough to be deleted
-        if(ae->when < now - retention && 
+        // Use ACQUIRE for pending_save_count to synchronize with RELEASE in health_process_pending_alerts()


The memory order for the atomic load at line 353 uses __ATOMIC_ACQUIRE to synchronize with the RELEASE in health_process_pending_alerts(). However, the comment at line 350-351 suggests this is to ensure save operations are complete, but the actual synchronization point is the decrement of pending_save_count with RELEASE in health_process_pending_alerts().

The ACQUIRE/RELEASE pairing ensures visibility, but there's still a window between the load and the free where new save operations could be queued. The design relies on shutdown_requested preventing new saves from being queued. Verify that all paths that queue saves check shutdown_requested, or document this assumption clearly.

Suggested change

// Use ACQUIRE for pending_save_count to synchronize with RELEASE in health_process_pending_alerts()

// The ACQUIRE load on pending_save_count pairs with the RELEASE decrement in health_process_pending_alerts(),

// ensuring that all prior updates performed before the decrement are visible here when the count reaches zero.

// Correctness also relies on higher-level logic (e.g. shutdown_requested) preventing new saves from being queued

// for this entry once shutdown is in progress; any code that queues saves must respect that contract.

Copilot · 2026-01-27T18:54:59Z

src/health/health_event_loop_uv.c

+    nd_log(NDLS_DAEMON, NDLP_DEBUG,
+           "HEALTH: Stored %zu alert transitions in %.2f ms",
+           entries / 2, (double)(ended - started) / USEC_PER_MS);


The division calculation entries / 2 is used to report the number of alert transitions processed. However, if pending->count is odd (which shouldn't happen but could due to the index pairing issue mentioned earlier), this will produce an incorrect count.

Consider adding a validation that entries (which equals pending->count) is even, or use a more robust counting mechanism that tracks actual pairs processed rather than relying on division by 2.

Copilot · 2026-01-27T18:54:59Z

src/health/health_event_loop_uv.c

+void health_event_loop_init(void) {
+    memset(&health_config, 0, sizeof(health_config));
+    completion_init(&health_config.start_stop_complete);
+
+    health_config.thread = nd_thread_create(HEALTH_EVENT_LOOP_NAME, NETDATA_THREAD_OPTION_DEFAULT,
+                                            health_event_loop, &health_config);
+    fatal_assert(NULL != health_config.thread);
+
+    // Wait for initialization to complete
+    completion_wait_for(&health_config.start_stop_complete);
+    completion_reset(&health_config.start_stop_complete);
+
+    nd_log(NDLS_DAEMON, NDLP_INFO, "HEALTH: event loop initialized");


The health_config global is zero-initialized via memset at line 722, and then a completion object is initialized. However, if health_event_loop_init() is called multiple times (e.g., during tests or if there's an error path that retries initialization), this could lead to issues:

The memset would destroy the existing thread pointer and completion object

A second thread could be created while the first is still running

Consider adding a guard to prevent multiple initializations, similar to how health_plugin_init() uses a spinlock and initialization.done flag to ensure it only runs once.

Copilot · 2026-01-27T18:54:59Z

src/health/rrdcalc.c


    health_log_alert(host, ae);
-    health_alarm_log_add_entry(host, ae, true);
+    health_alarm_log_add_entry(host, ae, true, NULL);


The function signature change adds a struct health_stmt_set *stmts parameter but passes NULL when calling this function. According to the implementation in health_log.c, passing NULL for stmts causes the code to fall back to ad-hoc statement preparation in sql_health_alarm_log_save(). However, this fallback is only safe when called outside the worker threads.

Since this code is executed during rrdcalc linking/unlinking operations (which can happen on various threads), verify that these calls are never made from worker threads where prepared statements from the pool should be used instead of NULL.

…lization handling

… shutdown scenarios

…line SQLite statement management

github-actions bot added area/health area/daemon area/database area/aclk area/build Build system (autotools and cmake). labels Jan 27, 2026

cubic-dev-ai bot reviewed Jan 27, 2026

View reviewed changes

src/health/health_event_loop_uv.c Outdated Show resolved Hide resolved

src/health/health_log.c Outdated Show resolved Hide resolved

ilyam8 requested a review from Copilot January 27, 2026 18:43

Copilot started reviewing on behalf of ilyam8 January 27, 2026 18:43 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

stelfrag added 5 commits February 15, 2026 12:46

Switch to a UV based event loop for health and improve statement fina…

ec996f9

…lization handling

Address first code review comments

d9feec7

Rework health alert transition handling (async save)

38401ff

Protect health alert lifecycle during async saving process and handle…

a904b9b

… shutdown scenarios

Refactor alert hash handling to use precompiled statements and stream…

6a86cd5

…line SQLite statement management

stelfrag force-pushed the health_uv_loop branch from 546297e to 6a86cd5 Compare February 15, 2026 10:46

-    // SQLite will clean up any remaining statements when the database closes.
+    // Do NOT rely on SQLite to safely clean up statements that are still in use
+    // by other threads when the database is closed. All worker threads must
+    // have completed and finalized their own statements before this function
+    // is called. finalize_all_prepared_sql_statements() is intentionally not
+    // called here to avoid touching statements owned by other threads.

		// Finalize all prepared statements in the pool
		health_finalize_all_statements(config);

-    // Finalize all prepared statements in the pool
-    health_finalize_all_statements(config);
+    // Finalize all prepared statements in the pool only if no workers are active.
+    // This avoids finalizing statements that might still be in use by worker threads.
+    if (__atomic_load_n(&config->active_workers, __ATOMIC_RELAXED) == 0) {
+        health_finalize_all_statements(config);
+    }
+    else {
+        nd_log(NDLS_DAEMON, NDLP_WARNING,
+               "HEALTH: Skipping statement finalization because %zu workers are still active",
+               __atomic_load_n(&config->active_workers, __ATOMIC_RELAXED));
+    }

-        // Fallback to synchronous save if queue failed (full or shutting down).
-        // sql_health_alarm_log_save() handles stmts==NULL by preparing ad-hoc statements.
-        if (!queued && !health_should_stop())
+        // Fallback to synchronous save if queue failed (full or shutting down),
+        // but only when a prepared statement set is available for this thread.
+        // This avoids invoking sql_health_alarm_log_save() with stmts == NULL
+        // from multiple threads, which would prepare ad-hoc statements without
+        // thread-local storage protection.
+        if (!queued && !health_should_stop() && stmts)

-        // Use ACQUIRE for pending_save_count to synchronize with RELEASE in health_process_pending_alerts()
+        // The ACQUIRE load on pending_save_count pairs with the RELEASE decrement in health_process_pending_alerts(),
+        // ensuring that all prior updates performed before the decrement are visible here when the count reaches zero.
+        // Correctness also relies on higher-level logic (e.g. shutdown_requested) preventing new saves from being queued
+        // for this entry once shutdown is in progress; any code that queues saves must respect that contract.

Conversation

stelfrag commented Jan 27, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stelfrag commented Jan 27, 2026 •

edited by cubic-dev-ai bot

Loading