Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Prevent race condition during pluginsd array operations#21628

Open
stelfrag wants to merge 19 commits intonetdata:masterfrom
stelfrag:improve_parser_dimension_lifecycle
Open

Prevent race condition during pluginsd array operations#21628
stelfrag wants to merge 19 commits intonetdata:masterfrom
stelfrag:improve_parser_dimension_lifecycle

Conversation

@stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Jan 24, 2026

Summary
  • Improve parser dimension caching

Summary by cubic

Prevents races and use-after-free in pluginsd dimension caching with a refcounted PRD_ARRAY, explicit collector ownership, and lock-free collector reads. Tightens lifecycle handling with spinlock-serialized growth and teardown, fixes chart-slot memory accounting, and adds a lifecycle stress test (-W prd-array-stress).

  • Bug Fixes

    • Refcounted PRD_ARRAY with create/acquire_locked/acquire/get_unsafe/replace/release; tracks header bytes in memory accounting.
    • Collector path: set collector_tid before access, preserve on same-chart re-scope, lock-free get/replace, atomic pos; never clear another thread’s ownership.
    • Array growth: pre-allocate, serialize under st->pluginsd.spinlock, copy entries and null old pointers to transfer ownership; atomic replace; defer free when refcount > 1.
    • Unslot: if a different collector is active, only clear the host slot mapping; otherwise detach under spinlock and release outside; guard against unexpected refcounts; detailed logs for refcount underflow and lifecycle violations.
    • Cleanup: require collector fully stopped; atomically swap array with NULL; reset pos; free entries only with exclusive ownership or defer with logs.
    • Chart slots and errors: fix memory accounting to use sizeof(RRDSET*); pre-clear last_slot to avoid recursive locking; clearer errors and null checks when refreshing slot cache and validating dimension IDs.
  • New Features

    • PRD_ARRAY lifecycle stress test (-W prd-array-stress) verifying non-overlapping writer/cleanup phases and refcount correctness.

Written for commit 3fca1dc. Summary will update on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Reader as "Parser Thread"
    participant State as "RRDSET (Shared Memory)"
    participant Writer as "Cleanup Thread"

    Note over Reader,Writer: Critical Section: Handling Array Lifecycle

    par Concurrent Read Operation
        Reader->>State: CHANGED: Read array pointer & size to LOCAL variables
        Note right of Reader: "Snapshot" state immediately.<br/>Avoids double-dereference race.

        alt NEW: Local Pointer is NULL or Size is 0
            Reader->>Reader: Log error / Return NULL
        else Local Pointer is Valid
            Reader->>Reader: Bounds check using LOCAL size
            Reader->>State: Access memory via LOCAL pointer
            Note right of Reader: Iterates safely using snapshot
        end

    and Concurrent Cleanup Operation
        Writer->>State: Lock pluginsd.spinlock
        Writer->>State: NEW: Copy address to 'old_array' local var

        Note right of Writer: 1. Invalidate Global State
        Writer->>State: NEW: Set pluginsd.prd_array = NULL
        Writer->>State: NEW: Set pluginsd.size = 0

        Writer->>State: Unlock pluginsd.spinlock

        Note right of Writer: 2. Release Memory (Safe)
        Writer->>Writer: NEW: freez(old_array)
        Note right of Writer: Frees memory only AFTER<br/>global pointer is NULL
    end
Loading

@stelfrag stelfrag marked this pull request as ready for review January 28, 2026 17:05
thiagoftsm
thiagoftsm previously approved these changes Jan 29, 2026
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugins are running as expected, LGTM!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aims to fix race conditions during pluginsd array operations by implementing a lock-free reader pattern where cleanup nullifies pointers before freeing, and readers snapshot pointers/sizes before use.

Changes:

  • Modified cleanup to save old pointer/size, NULL them under spinlock, then free after unlocking
  • Updated readers to snapshot prd_array and size into local variables before accessing
  • Changed realloc logic to use local pointer variable before updating the shared pointer

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/database/rrdset-slots.c Implements safe cleanup by NULLing prd_array and zeroing size under spinlock before freeing memory
src/plugins.d/pluginsd_internals.h Updates dimension acquisition and slot management to snapshot prd_array/size locally and check for NULL
src/plugins.d/pluginsd_parser.c Modifies chart cleanup in pluginsd_end_v2 to snapshot prd_array/size before iteration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/plugins.d/pluginsd_internals.h:132

  • Data race on st->pluginsd.dims_with_slots: This field is written non-atomically here but is read without holding the spinlock in multiple reader functions (e.g., pluginsd_end_v2 line 1100, pluginsd_acquire_dimension line 191). If the realloc path is taken, this could create a race where readers see an inconsistent state. Either protect writes to dims_with_slots with the spinlock and require readers to acquire it, or use atomic operations for this field as well.
        st->pluginsd.dims_with_slots = true;
        wanted_size = slot;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/database/rrdset-slots.c:67

  • The function rrdset_pluginsd_receive_unslot modifies array elements (lines 52-54) without holding the spinlock, yet it can be called from multiple contexts including collector threads. This creates a race condition where:
  1. A collector thread could be in pluginsd_clear_scope_chart() or pluginsd_rrdset_cache_put_to_slot() calling this function
  2. Meanwhile, the cleanup thread in rrdset_pluginsd_receive_unslot_and_cleanup() could have checked collector_tid, found it to be 0, and proceeded to free the array

Even though the cleanup code now uses atomic operations to NULL the array pointer, there's still a window where a collector thread could have loaded the array pointer before it was NULLed, then the cleanup thread frees it, and then the collector thread tries to modify the freed memory.

The atomics for loading prd_array and prd_size protect against use-after-free when reading, but not when modifying the array contents. This function should either:

  1. Be called only while holding the spinlock, OR
  2. The modifications to array elements should be made atomic/synchronized, OR
  3. The collector_tid should remain set during the entire time this function is executing
void rrdset_pluginsd_receive_unslot(RRDSET *st) {
    // Use atomic loads with ACQUIRE semantics to synchronize with cleanup code
    // that uses RELEASE semantics when freeing the array
    struct pluginsd_rrddim *prd_array = __atomic_load_n(&st->pluginsd.prd_array, __ATOMIC_ACQUIRE);
    size_t prd_size = __atomic_load_n(&st->pluginsd.size, __ATOMIC_ACQUIRE);

    for(size_t i = 0; i < prd_size && prd_array; i++) {
        rrddim_acquired_release(prd_array[i].rda); // can be NULL
        prd_array[i].rda = NULL;
        prd_array[i].rd = NULL;
        prd_array[i].id = NULL;
    }

    RRDHOST *host = st->rrdhost;

    if(st->pluginsd.last_slot >= 0 &&
        (uint32_t)st->pluginsd.last_slot < host->stream.rcv.pluginsd_chart_slots.size &&
        host->stream.rcv.pluginsd_chart_slots.array[st->pluginsd.last_slot] == st) {
        host->stream.rcv.pluginsd_chart_slots.array[st->pluginsd.last_slot] = NULL;
    }

    st->pluginsd.last_slot = -1;
    st->pluginsd.dims_with_slots = false;
}

src/plugins.d/pluginsd_internals.h:140

  • The non-atomic writes to st->pluginsd.dims_with_slots (lines 136, 140) create a data race with the cleanup code that reads this field (line 1100 in pluginsd_parser.c and elsewhere). Since dims_with_slots is accessed without synchronization by both the collector and cleanup threads, this is a data race under the C memory model.

While the field itself may not cause crashes since it's a boolean, it should be accessed atomically to avoid undefined behavior. The cleanup code sets it to false (line 103 in rrdset-slots.c) without atomics, while readers check it without atomics (line 1100 in pluginsd_parser.c, line 161 in this file, line 196 in this file).

Consider using atomic operations for this field or ensuring it's only modified/read while holding the spinlock.

        st->pluginsd.dims_with_slots = true;
        wanted_size = slot;
    }
    else {
        st->pluginsd.dims_with_slots = false;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/plugins.d/pluginsd_internals.h:140

  • After atomically loading prd_array and prd_size, the code still directly modifies st->pluginsd.dims_with_slots without atomic operations or lock protection. This creates inconsistency - a reader might load the old array but see the new dims_with_slots value (or vice versa), leading to undefined behavior.

The dims_with_slots flag should either be set atomically with appropriate memory ordering, or the entire pluginsd_rrddim_put_to_slot operation should be protected by st->pluginsd.spinlock.

    if(slot >= 1) {
        st->pluginsd.dims_with_slots = true;
        wanted_size = slot;
    }
    else {
        st->pluginsd.dims_with_slots = false;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thiagoftsm
Copy link
Contributor

@stelfrag , we have collectors running as expected, but we still have some additional suggestions for your PR. Please, take a look on them.

@stelfrag stelfrag marked this pull request as draft January 30, 2026 15:05
@stelfrag stelfrag force-pushed the improve_parser_dimension_lifecycle branch from 2688f77 to 804575c Compare January 30, 2026 15:05
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/database/rrdset-slots.c">

<violation number="1" location="src/database/rrdset-slots.c:107">
P2: The new `netdata_log_error` call removes the prior rate limiting. If cleanup is retried while the collector is still active, this will log an error every time and can flood logs. Prefer keeping the previous `nd_log_limit` throttling here.</violation>
</file>

<file name="src/database/rrdset-pluginsd-array.h">

<violation number="1" location="src/database/rrdset-pluginsd-array.h:50">
P1: Potential use-after-free race condition in `prd_array_acquire`. Between loading the array pointer and accessing `arr->refcount`, another thread could replace the array pointer and release the old array (freeing it if refcount was 1). The CAS loop cannot protect against this because the pointer itself may become stale before we read the refcount.

Consider using hazard pointers, RCU-style deferred reclamation, or ensuring callers always hold an external reference during replacement operations.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

stelfrag added 3 commits March 2, 2026 11:23
- Add detailed logging for refcount underflow and unexpected lifecycle violations.
- Ensure proper cleanup of PRD_ARRAY and detached references with spinlock protection.
- Refactor slot clearing logic to prevent races and double-releases during unslotting.
- Track memory being freed only when exclusive ownership is confirmed.
- Refine comments for better clarity on lifecycle handling and concurrency safeguards.
…a` references.

- Reacquire dimensions during slot cache growth for independent lifecycle management.
- Add detailed logging for slot cleanup and delayed free scenarios.
- Track memory only when exclusive ownership of old arrays is confirmed.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

thiagoftsm
thiagoftsm previously approved these changes Mar 2, 2026
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netdata ran during hours without any issues with collectors. LGTM!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… PRD_ARRAY and chart slots. Fix thread safety issues, optimize slot growth logic, and enhance error logging.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

static inline void pluginsd_clear_scope_chart(PARSER *parser, const char *keyword) {
static inline void pluginsd_clear_scope_chart(PARSER *parser, const char *keyword, RRDSET *preserve_collector_tid) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name preserve_collector_tid is misleading because it is an RRDSET * (chart pointer), not a TID. Renaming it to something like preserve_st / preserve_chart would better reflect its meaning and reduce confusion in call sites.

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +215
spinlock_lock(&st->pluginsd.spinlock);

current_arr = prd_array_get_unsafe(&st->pluginsd.prd_array);
current_size = current_arr ? current_arr->size : 0;

// Re-check under lock in case another path changed the array.
if(wanted_size > current_size) {
// Copy existing entries from old array (if any) and transfer ownership
// to the new array by nulling old pointers.
if(current_arr) {
memcpy(new_arr->entries, current_arr->entries, current_size * sizeof(struct pluginsd_rrddim));
for(size_t i = 0; i < current_size; i++) {
current_arr->entries[i].rda = NULL;
current_arr->entries[i].rd = NULL;
current_arr->entries[i].id = NULL;
}
}

// Initialize the new slots (callocz already zeroed them, but be explicit)
for(size_t i = current_size; i < wanted_size; i++) {
new_arr->entries[i].rda = NULL;
new_arr->entries[i].rd = NULL;
new_arr->entries[i].id = NULL;
}

// initialize the empty slots
for(ssize_t i = (ssize_t) wanted_size - 1; i >= (ssize_t) st->pluginsd.size; i--) {
st->pluginsd.prd_array[i].rda = NULL;
st->pluginsd.prd_array[i].rd = NULL;
st->pluginsd.prd_array[i].id = NULL;
// Atomically replace the old array with the new one
PRD_ARRAY *old_arr = prd_array_replace(&st->pluginsd.prd_array, new_arr);

// Release the old array if there was one.
if(old_arr) {
// Release the old array - it will be freed when refcount reaches 0
prd_array_release(old_arr);
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This releases old_arr while holding st->pluginsd.spinlock. Since prd_array_release() can free memory and update memory accounting, doing it in the critical section can unnecessarily extend lock hold time and risks lock inversion if the allocator/accounting paths take locks. Prefer to store old_arr and release it after spinlock_unlock(&st->pluginsd.spinlock).

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +69
static inline PRD_ARRAY *prd_array_acquire_locked(PRD_ARRAY **array_ptr) {
PRD_ARRAY *arr = *array_ptr;
if (arr) {
__atomic_fetch_add(&arr->refcount, 1, __ATOMIC_ACQ_REL);
}
return arr;
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array pointer is updated elsewhere via __atomic_exchange_n() / __atomic_load_n(), but here it’s read via a plain load (*array_ptr). To keep the access model consistent and avoid accidental data races if lifecycle assumptions change, prefer loading the pointer with an atomic load (e.g., __atomic_load_n(array_ptr, __ATOMIC_ACQUIRE)) even when the caller holds a spinlock.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants