-
Notifications
You must be signed in to change notification settings - Fork 5k
Pauseless Garbage Collector (Question) #96213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsIs there any reason why something like the Pauseless Garbage Collector wich exists for Java from Azul never was implemented for Dotnet? https://www.azul.com/products/components/pgc/
|
I'll add the generic response that the .NET GC supports functionality that Java's GCs do not and this can complicate or invalidate optimizations that other GCs can take advantage of. For examples, Java's GC doesn't support interior pointers while the .NET GC does |
I'd like to see the GC team to write some insight about benefits and challenges/drawbacks of these options. What's theoretically possible but just complex/low priority for implementation? Which features/goals have fundamentally conflict? |
These types of GCs typically trade throughput for shorter pause times. For example, they often use GC read barriers that make accessing object reference fields significantly slower. If you would like to understand the problem space, read the The Garbage Collection Handbook. It has a full chapter dedicated to real-time garbage collectors. There is nothing fundamental preventing building these types of garbage collectors for .NET. It is just a lot of work to build a production quality garbage collector. We do not see significant demand for these types of garbage collectors in .NET. Building alternative garbage collectors with very different performance tradeoffs has not been at the top of the core .NET team priority list. I would love to see .NET community experimenting with alternative garbage collectors with very different performance tradeoffs. It is how Azul came to be - Azul's garbage collector that you have linked to is not built by the core Java team. |
I guess that having more and more official, simultaneously supported GCs would also increase the amount of work needed to maintain and improve them all, putting even more burden on the GC team which would mean that the existing GCs would be improved slower. |
Yeah, but a pauseless collector for me seams to open a whole new area where .NET could be used. Even if it may be slower, but deterministic, without pauses every now and then, for some applications this could be a huge benefit. |
Right. If it was to follow the Azul model, it would not impact the core GC team much. I believe that the core Java GC team does not spend any cycles on the Azul GC. The Azul GC is maintained by Azul that is a company with a closed source business model.
It comes down to numbers and opportunity costs. For example, how many new developers can pauseless GC bring to .NET? It is hard to make the numbers work. |
For what's it worth, beware that buying the book as eBook from the official publisher only gives access to the book through the VitalSource service. There is no way to download the book except through the DRM encumbered software, and they managed to block my account before I was even able to read a single page (no explanation given, the service just responds with 401 error and logs me out). If you want to get the book, get it as a physical book or through Amazon Kindle and save yourself the trouble. |
I am rather surprised to hear that! I'd love to see an experimental GC, so I'm certainly quite biased. But I'd imagine predictability of latency is a very significant concern for a number of large user bases. Game development (Unity) comes to mind of course. As do many areas in finance & algorithmic trading. Sorry, that's it for my sales pitch; but in short, I'd definitely love to see experimentation in this area. |
Latency of a GC can be a concern in many of the same ways that latency of RAII can be a concern. Having a GC, including a GC that can "stop the world", is not itself strictly a blocker and it may be of interest to note that many of the broader/well known game engines do themselves use GCs (many, but not all, of which are incremental rather than pauseless). Most people's experience with .NET and a GC in environments like game dev, up until this point, has been with either the legacy Mono GC or the Unity GC, neither of which can really be compared with the performance, throughput, latency, or various other metrics of the precise GC that ships with RyuJIT. Having some form of incremental GC is likely still interesting, especially if it can be coordinated to run more so in places where the CPU isn't doing "important" work (such as when you're awaiting a dispatched GPU task to finish executing), but its hardly a requirement with an advanced modern GC, especially if you're appropriately taking memory management into consideration by utilizing pools, spans/views, and other similar techniques (just as you'd have to use in C++ to limit RAII or free overhead). |
In essence, using a pool is no different from manually allocating memory. It does not reduce the mental burden of manual management required to allocate and reclaim memory. Of course, it is also necessary to explore safe programming methods similar to rust. This is a popular article about GC. |
Would be nice to see how this changes in newer versions of .NET and also how JAVA compares against (with the default and the here mentioned pauseless collector) |
Maybe an option like Incremental GC which is being adopted by Unity is feasible here, where it breaks up a "full GC" into several "partial GC" sequence (i.e. doing GC incrementally), so that although the total pausing time doesn't change, each pausing time of a single GC can be minimized to a nearly pauseless one. cc: @Maoni0 |
I'm a game/engine dev on osu!. We've used C# throughout all of .NET 3.5 to .NET 8, and have fully rewritten the game over the years which has brought new challenges in terms of balancing features that wouldn't have been possible prior and what works best with the .NET GC. By far our greatest fight has been with the GC - it is definitely a felt presence and at the forefront of everything we do. I've personally gone pretty deep in minimising pauses with issues such as #48937, #12717, and #76290, but as a team we've always been very conscious about allocations because our main loop is running at potentially 1000Hz, or historically even more than that. What we've found works best for us is turning on
Where it breaks down, however, is areas that require allocs such as menus. This GC mode will cause terrible stutters when doing anything remotely intensive, meaning that we have to very carefully switch GC modes at opportune moments to get the best of both worlds, and sometimes those worlds are intertwined.
|
@smoogipoo it seems to me that you should be working directly with MS folks on this. Your expertise on gamedev would help so many people out. Stuttering in Unity for example has been a blemish on C# for a very long time. It gives people the impression C# is just a bad language which is absolutely disastrous to the community as a whole as more people move away from these tools and end up using other languages. |
Industrial automation and real-time data fusion are sectors that grow a lot and would require more deterministic behavior/latency. |
I hope C# can enter the field of industrial robot control, but this field requires hard real-time performance, and long interruption times may lead to safety accidents. It would be great if users could create real-time threads that are not affected by GC |
Linus has merged the final real-time code (PREEMPT_RT) into the Linux mainline, and for programming languages, supporting hard real-time will be one of the most outstanding features of the language! |
This is an interesting discussion. I actually wonder - is there really a demand for a low-pause Garbage Collector on .NET? Are there large enough classes of Apps that are like "If only we could have GC pauses under NOTE: |
With the Unity slowly inching to completing their Sisyphean effort to move onto CoreCLR, that could be one of the major consumers of low-pause collectors. It's a bit arbitrary, but Minecraft as a game represents a very good example where low-pause collector (Shenandoah) has really high value as it manages to offer relatively stutter-free experience despite multi-gigabyte allocation rates. Another example are highly-loaded but latency-sensitive services. Consider a hypothetical scenario of authoring a dynamic NGINX module with NativeAOT and thin C shim. A 20% loss in performance around allocatey parts may be acceptable, especially as guaranteeing completely allocation-free implementation may prove challenging, but incurring a sudden multi-millisecond pause may prohibit the use of .NET in such a (arguably exotic) scenario completely. Lastly, it may be acceptable to have some application threads run into significant pauses but not the other. To my knowledge, .NET does not have the ability to scavenge thread/core-local allocation context for dead objects when attempting an allocation which exceeds the budget before incurring application-wide pause, so having that as a stop gap would have been nice. I think the move of the Java ecosystem to ZGC and the design choices in Go point to an appetite for this type of GC design despite its downsides. |
To be honest, Java always had multiple GC. I think right now there are 5 or so Java GCs just officially supported and can be selected via command line switch. I assume it is relatively easy to add another one. Such possibility must require a well abstracted API where "substantially different" GC can plug in. By "substantially different" I mean a GC that would have a different write barrier, a different heap layout, perhaps a different approach to generations and large objects. An alternative GC might not have anything like segments or per-core heaps either. Current CLR does not have such interface. There are some EE-to-GC APIs, but they seem to be mostly designed for scenarios like "the same GC, but with a few tweaks or fixes". If you want a "substantially different" GC, you may quickly find yourself changing things on both sides of these interfaces. |
I expect we would be happy to consider proposals for GC-EE interface generalizations to enable substantially different GCs.
It is no different in Java once you look at details what it took to add substantially different GCs there. It required modifications throughout the JVM to make the substantial different GCs work. It is not like that you add a new directory with a substantially different GC and it will just work without changing anything else.
There is established pattern for this one: GC-EE interface API to communicate the shape of the write barrier that the GC wants to use. We have been adding more write barrier shapes over time even for the current GC. |
Rather than a different write barrier, a low-latency GC would probably need a read barrier, for which the runtime has no support (afaict) at the moment. In any case, if a brand new GC should be born, I believe it makes more sense to fork the runtime then adapt the API if it gains traction rather than the other way around.
Just to add one data point to the conversation, a few years ago I was working for Criteo, which serves targeted ad on the web. We had high-throughput low-latency .NET services, every GC pause meant that the pending requests would timeout (which translates into lost revenue). Those services would definitely have benefitted from a low-latency GC. But Criteo is probably the exception rather than the norm. |
This sounds like a trade-off between latency and throughput. If we have more flexibility to configure these, would it be possible to also support a reference counting GC? (For PoC and very specialized scenarios). |
Right, it is the best to run experiments. Create a fork, try to prototype different ideas, and worry about doing it properly only once you find something with promising results. |
Not necessarily. I had an old pet project around (https://github.com/VSadov/Satori). There is a small GC that can do low pauses. In Satori GC has classic Dijkstra-style write barries, so throughput is comparable to the default CoreCLR GC. Everything allocation-sensitive gets eventually gated by how fast you zero-out memory, so this is unsurprising. Satori GC has a generational incremental concurrent GC design. All major GC phases that are proportional to the heap size run concurrently with the application threads. That is - all except compaction, which would require read barrier to run concurrently. But compaction is just an optional thing that GC can do, but does not have to. Native allocators do not compact and do just fine. Some other GCs (i.e. GO lang GC) do not compact either. In Generally though there is some degree of space-for-latency trade in Unlike GO, Satori can flip compaction on/off dynamically. |
As I see earlier in this discussion, there was a reference to the Unity "hiccupometer" sample. https://gist.github.com/jechter/2730225240163a806fcc15c44c5ac2d6 I tried that with Satori. It is interesting. static void Main(string[] args)
{
GCSettings.LatencyMode = GCLatencyMode.LowLatency;
for (; ; )
{
float maxMs = 0;
UpdateLinkedLists(kNumLinkedLists);
Stopwatch totalStopWatch = new Stopwatch();
Stopwatch frameStopWatch = new Stopwatch();
totalStopWatch.Start();
for (int i = 0; i < kNumFrames; i++)
{
frameStopWatch.Start();
UpdateLinkedLists(kNumLinkedListsToChangeEachFrame);
frameStopWatch.Stop();
if (frameStopWatch.ElapsedMilliseconds > maxMs)
maxMs = frameStopWatch.ElapsedMilliseconds;
frameStopWatch.Reset();
}
totalStopWatch.Stop();
Console.WriteLine($"Max Frame: {maxMs}, Avg Frame: {(float)totalStopWatch.ElapsedMilliseconds / kNumFrames}");
}
} When I run this locally (MacBook Pro \w Apple M1 arm64), I see:
It looks like the benchmark simulates a game that draws frames one after another and measures how long it takes max/avg per frame. However the benchmark does not really draw anything and each frame just allocates a bunch of garbage data structures. What we see here is that the "game" runs at ~14000 frames per second! GC seems to be able to keep up with the garbage and longest frames are consistently at 2 msec. |
For the standard GC I see on the same machine:
Looks like default GC OOMs on this benchmark. |
Tried the Unity sample on a bigger machine which is more like server (32 logical cores ryzen 7950X, win-x64).
For comparison, default GC with
There are half-second pauses, but the throughput is a bit better than with Satori. |
Right. Tuning for this benchmark is possible, but it may be at cost to more general apps. Like - to improve throughput, at some cost to heap size, one can dial aggressiveness of collections
Also Satori Gen0 collects only thread-local objects, but in this benchmark every single object escapes its allocating thread, so Gen0 is pointless.
In theory, I could make it to turn off Gen0 automatically if seeing a workload not benefiting from it. Not sure how common that would be. Just one of many NYIs ideas... So, - it may not be a good idea for a real app, but for this benchmark turning the above knobs results in throughput comparable to default GC with
Commit size while running this benchmark: |
Yes. That is the only way really. If I search for |
BTW: If someone is interested to try Satori, here are some things to know:
There could be other ways, but this is what I do. |
@VSadov watch out for notification spam... you might want to just edit your last post sometimes instead of adding a new one to avoid notifying everyone again (which my post ironically does as well, but you know...). |
Very glad to see this thread come up in my emails once more, have been following for the last couple of days. I tried the Satori GC and can corroborate the low pause time statistic with our game. |
Not to pile on, but I am also pretty excited to test this out. My domain is financial transaction processing and this seems promising to lower our P99 latency |
I have to say the Satori GC is incredible (credits to @VSadov). This is the benchmark result on my machine:
All tests were running on .NET 8 on a Windows x64 machine with 24 CPU cores (i7-13700K) and 64G of RAM. Server GC:
DATAS:
Workstation GC:
Workstation GC + LowLatency:
Satori GC:
Satori GC + LowLatency:
Satori GC managed to achieve better throughput than Server GC, and the latency is really low (even 10x~20x better than DATAS), while the memory footprint is even smaller than DATAS! I can't wait to see Satori being productized as the default GC of .NET. cc: @Maoni0 |
@VSadov fantastic work on Satori!
We've already seen a number of great examples of applications that'd benefit from a GC like that. I'd like to offer one more, just for the sake of the debate - as of course, many different somewhat individually niche case scenarios could ultimately add up to overall significant demand/use! I work a lot on backends serving UIs, but everything is fully streaming based. So much of the backend boils down to joining & transforming streaming data sets. By nature of things, that's pretty memory hungry - eg streaming joins means you need to keep lots of data readily accessible (i.e. often: in memory). It's very common for apps to run north of 10GB of memory, and at times multiples of that. If you're unlucky, you're dealing with fast moving streams too, so allocation rates could be significant too. Overall, C#/dotnet is an absolute workhorse here - in spite of the overall demanding problem, median response rates in the tens to low 100ms range are often quite achievable even for rather complex queries spanning/joining dozens of systems. P99s are a much bigger challenge though. I've seen a bunch of cases where responses took >5 seconds, just because one of the involved systems was blocked by a massive GC (and can't really fault there GC either, compacting a gen2 that totals to tens of GBs full of mostly surviving objects.. not a lightweight task really). And in some ways, we're even more unlucky than that, because if our query spans 10 different services, oftentimes the overall response is just as fast as the weakest link here - so you really add up the probabilities of every involved service to have a p99-style response. A zero or near-zero pause GC could be very interesting to try here. Hard to say without being able to try of course, but I'd expect I'd very much be able to tolerate lower throughput in favor of avoiding huge pauses. Lower overall throughput could likely often be mitigated by just throwing a few more pods at it, which is cheap. And even increasing p50 response times is perfectly ok - adding a few ms to the p50 will be hardly noticeable, but having fewer or no cases that take multiple seconds for some UI to load would be very noticeable! |
It should be noted that a single benchmark isn't the type of thing that would drive that decision. Much like with locks, collection types, and other scenarios; there are a lot of considerations into picking what the default memory management library (in this case the GC) should be. This is why Java has so many different GC and why .NET has as many knobs and configuration options as it does. That is, a given GC being ideal for low latency scenarios like games won't necessarily be ideal for other types of applications. The default needs to strike a balance for all applications. The current GC has years of investment into being such an ideal default for most applications. It's also worth noting that the benchmark as given isn't exactly representative or realistic. That is, while it does help showcase some worst possible potential for a GC, it isn't necessarily going to cleanly map to how a GC will perform in a real game. Even a raw C++ game using OGL, Vulkan, or D3D12 and doing nothing but clearing the render target (with proper frame buffering, etc) will be limited to around 10k fps on a modern machine; just due to general overhead of the message pump, dispatching work to the GPU, etc. A real game will be running at a fraction of that speed, because it will be doing real logic, data management, rendering of thousands to millions of vertices, etc. While naive logic might do the worst possible case of creating tons of new allocations every frame, with some throwaway. A real game and game engine is more likely to be using pooling, arenas, being generally mindful of wasteful garbage, etc. It won't do this completely throughout the entire app, but it will do it in the most crucial sections and those will be highlighted by basic profiling and hot spot analysis. This will reduce the impact from the worst case scenario and often keep things to manageable levels. These are the same kind of considerations you have when writing a game and/or game engine in C/C++ because That isn't to say that a low latency, incremental, and/or pauseless GC wouldn't still benefit such scenarios; just that it may not actually pan out to the type of savings that the worst case scenario is highlighting. |
On my machine this Satori easily wins in particular benchmark. It is amazing how it can maintain lower memory footprint that server gc and have faster throughput and have much faster collections. It would be good to have it benchmarked with different kinds of applications, especially webservers, where short-lived few-frame allocations mix with long lasting ones that span through network calls. I believe such behavior can cause gen0 fragmentation and it's important to have GC that can handle such scenario. I wonder if someone can make a repo with different kinds of GC-intensive workloads that can showcase more differences between these GCs. Also I'm wondering why workstation GC does perform really poor/slow (7-10 times slower than Satori) while maintaining much lower memory footprint. Given single threaded nature of this benchmark I would have assumed that workstation would be a fine competitor. Continuing @tannergooding response a little bit, I think that people should understand that GC is "just" a memory allocator. It is not magic. It has it's own pros and cons, it's own success and failure patterns and while different GCs can be suitable for different kinds of tasks it doesn't mean that it will magically solve your problems. It's much better to start caring about your code and in this particular case memory management rather than hope for the miracle that might never come. |
@tannergooding perhaps you missed a comment from @smoogipoo: #96213 (comment) As long as @smoogipoo meant osu! (as mentioned earlier #96213 (comment)), osu! sounds like one of the best gamedev representatives and its developers are willing to experiment. (I'm just a random gamedev guy that is concerned about GC's unpredictable impact on a bit weak devices like Nintendo Switch.) |
@tannergooding It seems that sunk costs have affected your judgment. Rational decision-making should not only be based on who has been using it for a long time |
It's not a question of "sunk costs". It's a question of what a good default for the majority of the ecosystem is. Having multiple different GC's can be goodness, it is why Java has many. It is why .NET has the Workstation and the Server version, why it has had other splits in the past, why there are various configuration knobs for such GCs, etc. Adding in another GC that is one or more of |
-- Notably that also isn't to say that such a GC cannot be a good default. Just that it would require significantly more investigation to prove such a point. Rather simply, other ecosystems have almost universally decided that such GCs are not a good default for them. I would expect that .NET would come to the same conclusion for much the same reasons. Domains, such as games and certain types of services might benefit from such a GC and would then need to opt-in. If it was somehow found to be the right choice for the .NET default, then the other GCs would likely need to continue existing so that apps where it isn't the right choice can opt into the one of the other GC's that is correct for them. |
Java has gone through multiple GC implementations throughout its history, and is in the process of migrating onto a new default, being ZGC which is a low-pause design. Satori GC results are really impressive, hopefully I'll get to test them with a couple of our workloads later because much lower pause latency at no (or minor) throughput hit sounds like a no-brainer. Even if runtime team decides not to entertain this further, there are enough people in this discussion to possibly maintain an out of tree implementation. Thank you @VSadov for publishing it, I really hope this will help with pushing whatever GC choices .NET offers into a more competitive position. We didn't know this was actually feasible, but now that we do everyone is hungry for more 😄 |
Very thankful to see movement on this thread. The results look promising (and as already mentioned by @smoogipoo, we are already considering deploying to production with what we're seeing). I'd be very interested to see benchmarks crafted to perform worse on the new GC implementation, ie showing its limitations or drawback, if they exist. That would be helpful in assessing what could potentially go wrong as a product owner, but also could help driving the ongoing discussion of whether it could be considered a "replacement" or exist as an "alternative" for specific use cases like games. |
Is there any reason why something like the Pauseless Garbage Collector wich exists for Java from Azul never was implemented for Dotnet?
https://www.azul.com/products/components/pgc/
https://www.artima.com/articles/azuls-pauseless-garbage-collector
The text was updated successfully, but these errors were encountered: