-
Notifications
You must be signed in to change notification settings - Fork 5k
Log error to host #114944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log error to host #114944
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds enhanced error reporting for GC initialization failures by logging detailed error messages to the host. It also exposes additional region configuration values to the runtimeconfig and removes private testing code while updating internal GC logging and promotion/demotion logic.
- Updated the signature and usage of functions managing GC region promotion/demotion, including new parameters in decide_on_demotion_pin_surv.
- Revised configuration variables for GCRegionRange and GCRegionSize to expose them via runtimeconfig.
- Replaced many direct LogErrorToHost calls with a new log_init_error_to_host function for improved diagnostic output.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/coreclr/gc/gcpriv.h | Updated function signatures (decide_on_demotion_pin_surv and overloads) for region logging. |
src/coreclr/gc/gcconfig.h | Changed configuration strings for region settings to expose them to runtimeconfig. |
src/coreclr/gc/gccommon.cpp | Added the log_init_error_to_host implementation and updated error handling in log file setup. |
src/coreclr/gc/gc.h | Declared the new log_init_error_to_host function. |
src/coreclr/gc/gc.cpp | Replaced direct logging calls with log_init_error_to_host and updated promotion/demotion logic. |
Comments suppressed due to low confidence (4)
src/coreclr/gc/gcpriv.h:32294
- Please add documentation or inline comments to clearly describe the purpose and expected values for the new bool parameters 'promote_gen1_pins_p' and 'large_pins_p', to aid future maintainers.
void decide_on_demotion_pin_surv (heap_segment* region, int* no_pinned_surv_region_count, bool promote_gen1_pins_p, bool large_pins_p)
src/coreclr/gc/gcconfig.h:104
- With the configuration strings for GCRegionRange and GCRegionSize now exposed via runtimeconfig, please ensure that the associated documentation is updated to reflect these new settings.
INT_CONFIG (GCRegionRange, "GCRegionRange", "System.GC.RegionRange", 0, ...
src/coreclr/gc/gc.cpp:14550
- The new error logging calls replacing GCToEEInterface::LogErrorToHost improve diagnostic output. Please consider adding a brief inline comment to explain the threshold logic and usage of the gib() helper for clarity.
log_init_error_to_host ("Reserving %zd bytes (%zd GiB) for the regions range failed, do you have a virtual memory limit set on this process?", reserve_size, gib (reserve_size));
src/coreclr/gc/gccommon.cpp:277
- Since log_init_error_to_host uses a static buffer for formatting the error message, consider potential thread-safety issues if this function can be called concurrently. It might be beneficial to either allocate a buffer on the stack or protect access with a lock.
char error_buf[256];
Tagging subscribers to this area: @dotnet/gc |
…region size related configs in runtimeconfig
f598621
to
962a3b1
Compare
I will be making the doc change for the Region* configs. I think the msg above should mention the RegionRange config (when the doc change happens). |
@@ -49584,12 +49610,10 @@ HRESULT GCHeap::Initialize() | |||
uint8_t* numa_mem = (uint8_t*)GCToOSInterface::VirtualReserve (hb_info_size_per_node, 0, 0, (uint16_t)numa_node_index); | |||
if (!numa_mem) | |||
{ | |||
GCToEEInterface::LogErrorToHost("Reservation of numa_mem failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these dont need to be logged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's under #ifdef HEAP_BALANCE_INSTRUMENTATION
and that's only for a specific local analysis and is normally not defined.
@@ -49691,7 +49715,6 @@ HRESULT GCHeap::Initialize() | |||
|
|||
if (seg_mem == nullptr) | |||
{ | |||
GCToEEInterface::LogErrorToHost("STRESS_REGIONS couldn't allocate ro segment"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to remove logs that we already had? I know this failure would be extremely rare, but I was trying to cover cases where the coreclr initialization could return E_FAIL and so users would have no idea where it came from and having the log at these places costs nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you made the choices you made. why not log places that return other HRESULTs like E_OUTOMEMORY - those are actually more interesting because just telling someone there's OOM most likely doesn't help.
my choices are for places that are likely hit by normal users and hard to debug currently. while it costs nothing - there are many, many places that return a non S_OK hr and I really didn't want to log in that many places so I just picked the ones I thought would be helpful. STRESS_REGIONS
is not even defined normally - it's only used in private testing and in those cases it wouldn't be a problem for the person to look at where it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
lately I've seen a couple of customer reports where the GC initialization failed to reserve the default 256GB of virtual memory for the regions range, to make this easier to diagnose I've added this as an error communicated to host. so you would see something like this
also added a few other places where we might hit and got rid of some that are only for private testing. I'm not adding this for every single case where it could fail as they are really unlikely to be hit.
since the solution is to adjust some region configs I also exposed them to the runtimeconfig.
and fixed a typo I had for a config for private testing.