Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Conversation

@arin-mirza
Copy link
Contributor

@arin-mirza arin-mirza commented Jan 20, 2026

Why I'm doing:

There are currently no backend metrics reporting for memory pools.

I previously tried to add them by extending the workgroup metrics, but this turned out to be an incorrect approach:

What I'm doing:

This PR implements metric reporting for MemTrackerManager and adds the following new metrics:

  • mem_pool_mem_limit_bytes
  • mem_pool_mem_usage_bytes
  • mem_pool_mem_usage_ratio
  • mem_pool_workgroup_count

The implementation follows the same locking structure that is present in WorkGroupManager.

  • It was necessary to add a new mutex for MemTrackerManager because the update_metrics callback hook passed to MetricRegistry needs to be a closure which captures a write lock.
  • The unlocked gap inside add_metrics method is unavoidable to AB-BA deadlock scenario with the metrics collector.
  • Metrics entries are never deleted as it would complicate the thread synchronization even further. This is also the case for the existing implementation in WorkGroupManager.

Minor: Changed list_mem_trackers() method to not return the default memory pool name.

Tests and Docs

  • I did not add any test cases as there were not any for workgroup metrics either. Let me know if this is necessary.
  • I did not verify that the new metrics are being reported correctly by building and running the starrocks fe/be, as I am currently unable to build the engine locally.
  • I updated the user documentation.
    • I am not a Chinese or Japanese speaker so I used AI for the translation. I would appreciate it if a native speaker could review my additions to ensure the tone is correct. :)

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

@arin-mirza arin-mirza requested a review from a team as a code owner January 20, 2026 11:06
@github-actions github-actions bot added behavior_changed documentation Improvements or additions to documentation labels Jan 20, 2026
@StarRocks-Reviewer
Copy link

@cursor review

@StarRocks-Reviewer
Copy link

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

metrics->mem_limit->set_value(0);
metrics->mem_usage_bytes->set_value(0);
metrics->mem_usage_ratio->set_value(0);
metrics->workgroup_count->set_value(0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition causes null pointer dereference in metrics update

High Severity

A race condition exists where _update_metrics_unlocked can dereference nullptr metric pointers. When two threads concurrently call _add_metrics_unlocked for the same mem_pool, both pass the contains check at line 87 before either adds an entry. If the thread that failed to register metrics (registry returns false) acquires the lock first at line 115, it creates a MemTrackerMetrics entry with all nullptr members but doesn't move any metrics in. If the metrics collector runs before the successful thread moves its metrics, it will crash dereferencing nullptr at lines 155-164.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I thought the update method was protected against this scenario but it's not.

Since any thread which successfully registered at least one metric must run to completion (otherwise some metrics might never get initialized) we cannot solve this inside _add_metrics_unlocked.

I added if guards to _update_metrics_unlocked, which prevents any nullptr dereference. Since all metrics are registered and initialized by some thread, we eventually have a complete metrics object.

Fixed in f7a13d9

The same issue exists in the update method of WorkGroupManager so it should be fixed there too.

@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[BE Incremental Coverage Report]

pass : 57 / 68 (83.82%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/workgroup/mem_tracker_manager.cpp 55 66 83.33% [120, 150, 154, 155, 156, 158, 159, 161, 162, 164, 165]
🔵 be/src/exec/workgroup/mem_tracker_manager.h 2 2 100.00% []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4.1 documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants