datas tuning fix #98743

Maoni0 · 2024-02-21T08:51:41Z

Change the HC (heap count) adjustment based on history and how successful the previous adjustment was -
- looking at the trending of this buffer and using it to detect if things look stable or if they are
  trending up/down (and if so how fast is that trend) and make a decision if we want to grow/shrink according to our calculation
- previous we barely ever shrank the HC, with this change we shrink as needed
- if we just grew and the calculation says to grow again, we grow more aggressively
- if we just shrink but the tcp didn't come down, and the calculation says to shrink again, we should avoid shrinking for a while
One of the reasons for outliers is something temporarily affected GC work. We pick the min tcp if the survival is very stable to avoid counting these outliers.
Added simple gen2 handling for BGC.
Bug fixes -
- When we change the heap count, we should not be refreshing all new heaps' budget which will cause a spike in heap size. If the budget is already partially used we should use up the existing budget and let the next GC will refresh it.
- Don't carry stcp over when HC is changed - it doesn't make sense since the estimated stcp is bogus
- Don't add the first sample as it's artificially skewed by startup time

There are a few issues with these that will be addressed in future checkins -

the aggressiveness factor needs to be capped and also it needs to discard history if history is too distant
growth is too aggressive for large tcps which causes an initial spike when we look at heap counts
recognize when the slope direction changes, ie, trending upward <-> downward and discard older entries as appropriate

ghost · 2024-02-21T08:51:51Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

will add description soon.

Author:	Maoni0
Assignees:	Maoni0
Labels:	`area-GC-coreclr`
Milestone:	-

mrsharm · 2024-02-21T12:58:53Z

src/coreclr/gc/gc.cpp

+}
+
+float mean (float* arr, int size)
+{


Worth checking if size > 0 as a precondition check?

I've added an assert in slope which makes more sense.

kind of similar to log_with_base, is the assertion/condition intended for mean or callers of mean? If it's a precondition for mean, then I would expect the precondition check to be in mean (or both mean and the callers).

Or if mean is supposed to support some callers with a negative size, then the final return probably needs to be something like return (size > 0) ? (sum / size) : 0

mrsharm · 2024-02-21T12:59:57Z

src/coreclr/gc/gc.cpp


-size_t gc_heap::get_num_completed_gcs ()
+float log_with_base (float x, float base)
 {


Is it worth asserting if x and base > 0?

it's actually meant to have x > base and should be enforced. but I can still add an assert.

log_b(x) is fine for x <= base (e.g., log_2(2) = 1, log_4(2) = 1/2)

I think you're saying (by "should be enforced") that current call site(s) expect x > base. log_with_base is a very reasonable helper function that could get used elsewhere without such a restriction. Or maybe you want to rename it to show that it is intended as a helper for a specific context rather than a general log helper?

src/coreclr/gc/gc.cpp

Maoni0 · 2024-02-22T10:33:27Z

v2-base is the baseline and v2-rc3 is this change, 4 runs each.

markples · 2024-02-23T19:34:48Z

src/coreclr/gc/gcpriv.h

            uint64_t    elapsed_between_gcs;    // time between gcs in microseconds (this should really be between_pauses)
            uint64_t    gc_pause_time;          // pause time for this GC
            uint64_t    msl_wait_time;
+            size_t      gc_survived_size;


Suggested change

size_t gc_survived_size;

size_t gc_survived_size; // total survived size across all relevant generations for this GC

i.e., it's -not- gen0 to be consistent in what is being recorded

markples · 2024-02-23T19:50:28Z

src/coreclr/gc/gcpriv.h

+        //
+        // We need to observe the history of tcp's so record them in a small buffer.
+        //
+        float           recorded_tcp_rearranged[recorded_tcp_array_size];


You've mentioned this before, but this is doable without copying the data (though I think the real concern would be avoiding the additional concept of "rearranged" data rather than the copy of a small amount data, which easily could be negligible in cost.

Encapsulating the data in a circular buffer with an iterator would probably accomplish this - probably makes sense to this as a follow-up PR which I can do.

markples · 2024-02-23T19:51:24Z

src/coreclr/gc/gcpriv.h

+        float           recorded_tcp_rearranged[recorded_tcp_array_size];
+        float           recorded_tcp[recorded_tcp_array_size];
+        int             recorded_tcp_index;
+        int             total_recorded_tcp;


Suggested change

int total_recorded_tcp;

int total_recorded_tcp; // can exceed the array size

markples · 2024-02-23T19:53:41Z

src/coreclr/gc/gcpriv.h

+            recorded_tcp_index++;
+            if (recorded_tcp_index == recorded_tcp_array_size)
+            {
+                recorded_tcp_index = 0;
+            }


Suggested change

recorded_tcp_index++;

if (recorded_tcp_index == recorded_tcp_array_size)

{

recorded_tcp_index = 0;

}

recorded_tcp_index = (recorded_tcp_index + 1) % recorded_tcp_array_size;

markples · 2024-02-23T19:59:15Z

src/coreclr/gc/gcpriv.h

+            if (total_recorded_tcp >= recorded_tcp_array_size)
+            {
+                int earlier_entry_size = recorded_tcp_array_size - recorded_tcp_index;
+                memcpy (recorded_tcp_rearranged, (recorded_tcp + recorded_tcp_index), (earlier_entry_size * sizeof (float)));


Can we use std::copy in this project to avoid the manual byte size computation?

markples · 2024-02-23T20:04:39Z

src/coreclr/gc/gcpriv.h

+            return copied_count;
+        }
+
+        int highest_avg_recorded_tcp (int count, float avg, float* highest_avg)


This name is a bit confusing to me. It looks like it returns the average and count of the elements above a limit (which happens to be the average, given the name of the parameter, but it isn't relevant to this function that it's the average).

markples · 2024-02-23T20:07:14Z

src/coreclr/gc/gcpriv.h

+            float highest_sum = 0.0;
+            int highest_count = 0;
+
+            for (int i = 0; i < count; i++)


I think this is using the count oldest elements in the buffer - should it be newest?

note - count is the entire buffer (as returned by the rearrange method and passed back in here), so there isn't a correctness issue here

markples · 2024-02-23T20:07:47Z

src/coreclr/gc/gcpriv.h

+        float           recorded_tcp_rearranged[recorded_tcp_array_size];
+        float           recorded_tcp[recorded_tcp_array_size];
+        int             recorded_tcp_index;
+        int             total_recorded_tcp;


recorded_tcp_count to be consistent with other naming?

markples · 2024-02-23T20:10:08Z

src/coreclr/gc/gcpriv.h

+        // each time our calculation tells us to shrink.
+        int             dec_failure_count;
+        int             dec_failure_recheck_threshold;
+


For later - I think it would be interesting to share the increment/decrement cases to avoid some duplication. It would have to be parameterized in some way so that the behavior could be customized. Anyways, there's no requested change here right now.

markples · 2024-02-23T20:10:38Z

src/coreclr/gc/gcpriv.h

+        float           below_target_accumulation;
+        float           below_target_threshold;
+
+        // Currently only used for dprintf.


#ifdef this?

markples · 2024-02-23T20:15:23Z

src/coreclr/gc/gcpriv.h

+            // Recording the gen2 GC indices so we know how far apart they are. Currently unused
+            // but we should consider how much value there is if they are very far apart.
+            size_t gc_index;
+            // This is (gc_elapsed_time / time inbetween this and the last gen2 GC)


nit - "in between" or even just "between"

markples · 2024-02-23T20:17:05Z

src/coreclr/gc/gcpriv.h

    // at the beginning of a BGC and the PM triggered full GCs
    // fall into this case.
    PER_HEAP_ISOLATED_FIELD_DIAG_ONLY uint64_t suspended_start_time;
+    // Right now this is diag only but may be used functionally later.


I don't think this comment really adds anything

markples · 2024-02-23T20:23:54Z

src/coreclr/gc/gc.cpp

+            dynamic_heap_count_data.sample_index = (dynamic_heap_count_data.sample_index + 1) % dynamic_heap_count_data_t::sample_size;
+            (dynamic_heap_count_data.current_samples_count)++;


It bugs me a bit that the sample and recorded tcp handling are different (one inline here, the other in helper methods), but I think that's for another day.

markples · 2024-02-23T20:41:06Z

src/coreclr/gc/gc.cpp

+        }
+    }
+
+    float avg_x = (float)sum_x / n;


Suggested change

float avg_x = (float)sum_x / n;

float avg_x = ((float)sum_x) / n;

or the static_cast<float>(sum_x) / n format requires parenthesis

also below, though I don't think those explicit casts are needed since avg_x is a float. fine to be careful though of course.

also this is just (n+1) / 2.0f, though the loop is still needed for dprintf

markples · 2024-02-23T20:49:43Z

src/coreclr/gc/gc.cpp

+// Change it to a desired number if you want to print.
+int max_times_to_print_tcp = 0;
+
+// Return the slope, and the average values in the avg arg.


Is there a name for the slope that is being calculated here? I see that it's a weighted sum based on distance from the middle, but I'm not familiar with that. For example, I don't think this is the slope of a typical regression line? (which is fine, though I guess I'm a bit curious about the mathematical properties of this)

markples · 2024-02-23T20:51:28Z

src/coreclr/gc/gc.cpp

    }

    float median_throughput_cost_percent = median_of_3 (throughput_cost_percents[0], throughput_cost_percents[1], throughput_cost_percents[2]);
+    float avg_throughput_cost_percent = (float)((throughput_cost_percents[0] + throughput_cost_percents[1] + throughput_cost_percents[2]) / 3.0);


nit - might be able to drop the (float) if you used 3.0f

markples · 2024-02-23T20:56:53Z

src/coreclr/gc/gc.cpp

+                        if (dynamic_heap_count_data.dec_failure_count)
+                        {
+                            (dynamic_heap_count_data.dec_failure_count)++;
+                        }
+                        else
+                        {
+                            dynamic_heap_count_data.dec_failure_count = 1;
+                        }


I don't think this if is necessary.

markples · 2024-02-23T21:05:55Z

src/coreclr/gc/gc.cpp

+
+                if (shrink_p && step_down_int && (new_n_heaps > step_down_int))
+                {
+                    // TODO - if we see that it wants to shrink by 1 heap too many times, we do want to shrink.


also if n_heaps is small, then 1 is significant

(well, significant to the heap count, if the gc heap is a small fraction of overall memory, which it might be if the heap count is small, then the memory savings could still be insignificant)

markples

My review is very late for the preview release. These aren't necessary right now and can be addressed in a future PR.

sebastienros · 2024-04-05T22:19:35Z

/cc @MichalStrehovsky @eerhardt

Just for visibility as people started asking about this, I believe this introduced a slight RPS regression in the native aot benchmarks. Windows and Linux.

And we can see an improvement in max working set

NB: The unstable results are unrelated and were tracked in #98021

ghost added the area-GC-coreclr label Feb 21, 2024

ghost assigned Maoni0 Feb 21, 2024

datas tuning fix

a4e795f

Maoni0 force-pushed the datas_tuning branch from faeb52f to a4e795f Compare February 21, 2024 10:52

mrsharm reviewed Feb 21, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved

mangod9 reviewed Feb 21, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Show resolved Hide resolved

mangod9 reviewed Feb 21, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Show resolved Hide resolved

mangod9 reviewed Feb 21, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved

mangod9 reviewed Feb 21, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Show resolved Hide resolved

code cleanup and cr feedback

102086c

Maoni0 changed the title ~~[WIP] datas tuning fix~~ datas tuning fix Feb 22, 2024

mangod9 approved these changes Feb 23, 2024

View reviewed changes

markples reviewed Feb 23, 2024

View reviewed changes

markples approved these changes Feb 23, 2024

View reviewed changes

Maoni0 merged commit b07134a into dotnet:main Feb 23, 2024

build-analysis bot mentioned this pull request Feb 28, 2024

profiler\\multiple\\multiple\\multiple.cmd failing on windows arm64 #98817

Closed

github-actions bot locked and limited conversation to collaborators Mar 25, 2024

	size_t gc_survived_size;
	size_t gc_survived_size; // total survived size across all relevant generations for this GC

	int total_recorded_tcp;
	int total_recorded_tcp; // can exceed the array size

		dynamic_heap_count_data.sample_index = (dynamic_heap_count_data.sample_index + 1) % dynamic_heap_count_data_t::sample_size;
		(dynamic_heap_count_data.current_samples_count)++;

	float avg_x = (float)sum_x / n;
	float avg_x = ((float)sum_x) / n;

datas tuning fix #98743

datas tuning fix #98743

Uh oh!

Conversation

Maoni0 commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Feb 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Maoni0 commented Feb 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markples Feb 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markples Feb 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markples left a comment

Choose a reason for hiding this comment

Uh oh!

sebastienros commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maoni0 commented Feb 21, 2024 •

edited

Loading

markples Feb 23, 2024 •

edited

Loading

markples Feb 23, 2024 •

edited

Loading

sebastienros commented Apr 5, 2024 •

edited

Loading