-
Notifications
You must be signed in to change notification settings - Fork 5k
[mono-runtime] runtime build on mainline is taking very long time on s390x #114389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi All, Please take a look at this issue.. Thanks.. |
Attaching the console log txt file which has the complete execution logs.. Please have a look.. Thanks.. |
I tried building the runtime on mainline branch, using the cross build SDK (i.e. using Stage1 workspace).. Even with that the execution was kind of halted/stopped for 3.5hrs at the determining the dotnet restore step..
<<<<<<<< Execution was halted at this point for 3.5hrs, before continuing. >>>>>>>>>>>>>>>>>>>>>>> Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Diagnostics.TraceSource/src/System.Diagnostics.TraceSource.csproj (in 1.2 sec). |
I tried the dotnet runtime build again on the mainline.. There is some change in the behaviour this time.. The execution still hangs at restore part for around 3.5hrs.. But soon after this the build is failing with the below error now.. I am attaching the build console log for your reference.. Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.Sockets/src/System.Net.Sockets.csproj (in 29 ms). |
use /p:RestoreEnablePackagePruning=false to get rid of Pruning related errors. |
I ran the runtime build with the flag set to false i.e. "/p:RestoreEnablePackagePruning=false", as suggested by Giri.. Post doing that, I am running into new issues as below.. /home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25216.2/tools/TargetFrameworkFilters.BeforeCommonTargets.targets(86,5): error : - System.Threading.Tasks [/home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj::TargetFramework=net10.0] /home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25216.2/tools/TargetFrameworkFilters.BeforeCommonTargets.targets(86,5): error : Consult the project.assets.json files to find the parent dependencies. [/home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj::TargetFramework=net10.0] |
@ViktorHofer do you have a suggestion on how we can get some visibility on what goes on during this long delay? note: we're seeing the issue on s390x/ppc64le, both use Mono. |
We now produce a repo static graph restore binlog as part of the build. You should find it under |
@ViktorHofer I've captured a log. This is on ppc64le where the operation takes 30+ minutes on the machine I used (for .NET 9 builds it takes a minute or two). I didn't see anything that stood out in the log. Can you take a look and see if it tells you something? log.tar.gz |
These numbers are waaaaaay off from the common path. Super weird. I don't understand how collecting these items can take that long with static graph. Given that all these are msbuild tasks I wonder if msbuild is somehow slow on s390x. cc @jeffkl for static graph restore and @rainersigwald for msbuild |
That's also how I read the log: nothing is slow in particular, everything is much slower than what it is supposed to be. Since this used to be fast, I want see if I can find what made it slow. From looking at the CI results, I think it's something that landed in the vmr Feb 26th or some days after that. I'm not a 100% sure about this date yet. |
Is it just the restore step or is the build significantly slower as well? |
I think the build itself is not or far less affected than what we see during restore. I haven't measured this. The pause after |
This contains a sizable msbuild change: https://github.com/dotnet/msbuild/compare/405618191cdc903ccf2bbf23e239b7dc369bdb3e..63aefc3dc0984823dee39864b6d825681fd33801. @rainersigwald do you have some thoughts on which of these changes may be causing this issue? |
I've now had a look at the backtraces captured in the original dotnet-mainline-runtime-execution-info-on-s390x.txt file. First of all, this looks completely different than the In the backtrace, the only thread that is actually doing anything is thread 10 of process 145253 (which is running Some observations:
@Vishwanatha-HD or @giritrivedi can you confirm (e.g. by running the build under Can someone familiar with the Mono GC logic comment on whether this routine is known to be inefficient? |
Yes, the This method is called for gathering metrics. Perhaps we can (as a workaround) avoid calling it on Mono based runtimes. @rainersigwald wdyt?
This could mean the slow down becomes more significant when the system has more RAM. Mono has some envvars that control the heap size, we can see if setting those have an effect.
@ViktorHofer do you have some suggestion who this might be? Perhaps we can also get some inputs on how this is implemented on CoreCLR and whether that may be an option for Mono as well. |
@akoeplinger would you know the right contact on the Mono runtime team? |
@JanKrivanek I wonder if it wouldn't make more sense to call GC.GetTotalAllocatedBytes instead of GC.GetTotalMemory? And, that one would presumably be much faster as I assume it counts as allocations are made instead of having to determine the current heap size on request. |
I tried running with |
I think those are the expected semantics for this metric so I made a PR for it: dotnet/msbuild#11788. And as a consequence of using |
@tmds you can disable the telemetry in your CI see hints in |
@uweigand I have run the perf on restore. Majority of the time is spent in "RequestBuilder" command which uses Attaching the screen shot of perf report. ![]() |
When the msbuild change made its way to the vmr, the slow down will be fixed. |
Description
runtime is building successfully on s390x, but its taking very very long time (~10hrs or 12hrs).. Seeing this happening only on s390x and other archs are fine..
As soon as the build.sh command gets trigerred, it gets stuck at "Determining projects to restore.." stage for almost (~4hrs) before the next print is seen on the console..
Determining projects to restore...
Restored /home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25203.1/tools/Tools.proj (in 1.05 sec).
Determining projects to restore...
<<<<< console will be active here, but no prints come out for almost 4hrs to 4.5hrs at this stage >>>>>>>>>>>>>>>>>>>
Attaching a txt file which shows processess and thread information and their state when the execution gets kind of halted for around 4hrs at the "Determining projects to restore.." stage..
dotnet-mainline-runtime-execution-info-on-s390x.txt
After this stage, its taking another 6 or 8 hrs to complete the runtime build and to run the tests..
Reproduction Steps
This is easily getting reproduced as the same behaviour is seen everytime the runtime is tried to build on s390x.
Command:
Expected behavior
The runtime should get built within 2hrs as its happening on other archs.
Actual behavior
The time taken to build the runtime is extreamly long (~10hrs or 12hrs)..
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
Attaching the runtime build time taken trend, which clearly shows an increase in the time taken after Feb 15th 2025.. The time taken is as below:
Attaching the screenshot of time taken trend information both on s390x and on x86 machines..
When tried building manually on the Redhat beaker machines, its now taking ~12hrs now.. The same behaviour is seen on different machines as well...
s390x job trends: (Post Feb 17th 2025)

s390x job trends: (Before Feb 15th 2025)

On x86-64 mono based runtime build is taking less than 2hrs to complete it..
x86-64 (mono) job trend:

The text was updated successfully, but these errors were encountered: