Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[mono-runtime] runtime build on mainline is taking very long time on s390x #114389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Vishwanatha-HD opened this issue Apr 8, 2025 · 23 comments
Closed

Comments

@Vishwanatha-HD
Copy link

Vishwanatha-HD commented Apr 8, 2025

Description

runtime is building successfully on s390x, but its taking very very long time (~10hrs or 12hrs).. Seeing this happening only on s390x and other archs are fine..

As soon as the build.sh command gets trigerred, it gets stuck at "Determining projects to restore.." stage for almost (~4hrs) before the next print is seen on the console..

  • ./build.sh /p:UsingToolMicrosoftNetCompilers=false /p:NoPgoOptimize=true --portablebuild false /p:DotNetBuildFromSource=true /p:DotNetBuildSourceOnly=true /p:DotNetBuildTests=true --cmakeargs -DCLR_CMAKE_USE_SYSTEM_BROTLI=true --cmakeargs -DCLR_CMAKE_USE_SYSTEM_ZLIB=true --runtimeconfiguration Release --librariesConfiguration Debug /p:PrimaryRuntimeFlavor=Mono --warnAsError false --subset clr+mono+libs+host+packs+libs.tests
    Determining projects to restore...
    Restored /home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25203.1/tools/Tools.proj (in 1.05 sec).
    Determining projects to restore...

<<<<< console will be active here, but no prints come out for almost 4hrs to 4.5hrs at this stage >>>>>>>>>>>>>>>>>>>

Attaching a txt file which shows processess and thread information and their state when the execution gets kind of halted for around 4hrs at the "Determining projects to restore.." stage..

dotnet-mainline-runtime-execution-info-on-s390x.txt

After this stage, its taking another 6 or 8 hrs to complete the runtime build and to run the tests..

Reproduction Steps

This is easily getting reproduced as the same behaviour is seen everytime the runtime is tried to build on s390x.

Command:

  • ./build.sh /p:UsingToolMicrosoftNetCompilers=false /p:NoPgoOptimize=true --portablebuild false /p:DotNetBuildFromSource=true /p:DotNetBuildSourceOnly=true /p:DotNetBuildTests=true --cmakeargs -DCLR_CMAKE_USE_SYSTEM_BROTLI=true --cmakeargs -DCLR_CMAKE_USE_SYSTEM_ZLIB=true --runtimeconfiguration Release --librariesConfiguration Debug /p:PrimaryRuntimeFlavor=Mono --warnAsError false --subset clr+mono+libs+host+packs+libs.tests

Expected behavior

The runtime should get built within 2hrs as its happening on other archs.

Actual behavior

The time taken to build the runtime is extreamly long (~10hrs or 12hrs)..

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

Attaching the runtime build time taken trend, which clearly shows an increase in the time taken after Feb 15th 2025.. The time taken is as below:

  1. Until Feb 15th: ~ 1hr 40mins...
  2. Feb 17th: Increased to 4hrs 50mins..
  3. Mar 1st onwards: ~ 10hrs

Attaching the screenshot of time taken trend information both on s390x and on x86 machines..

When tried building manually on the Redhat beaker machines, its now taking ~12hrs now.. The same behaviour is seen on different machines as well...

s390x job trends: (Post Feb 17th 2025)
Image

s390x job trends: (Before Feb 15th 2025)
Image

On x86-64 mono based runtime build is taking less than 2hrs to complete it..

x86-64 (mono) job trend:
Image

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 8, 2025
@Vishwanatha-HD
Copy link
Author

Vishwanatha-HD commented Apr 8, 2025

Hi All, Please take a look at this issue.. Thanks..
@uweigand @omajid @giritrivedi @saitama951 @medhatiwari @tmds @iii-i

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 8, 2025
@Vishwanatha-HD
Copy link
Author

Attaching the console log txt file which has the complete execution logs.. Please have a look.. Thanks..

console_log.txt

@Vishwanatha-HD
Copy link
Author

I tried building the runtime on mainline branch, using the cross build SDK (i.e. using Stage1 workspace).. Even with that the execution was kind of halted/stopped for 3.5hrs at the determining the dotnet restore step..

  • ./build.sh /p:UsingToolMicrosoftNetCompilers=false /p:NoPgoOptimize=true --portablebuild false /p:DotNetBuildFromSource=true /p:DotNetBuildSourceOnly=tru
    e /p:DotNetBuildTests=true --cmakeargs -DCLR_CMAKE_USE_SYSTEM_BROTLI=true --cmakeargs -DCLR_CMAKE_USE_SYSTEM_ZLIB=true --runtimeconfiguration Release --lib
    rariesConfiguration Debug /p:PrimaryRuntimeFlavor=Mono --warnAsError false --subset clr+mono+libs+host+packs+libs.tests
    Determining projects to restore...
    Restored /home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25206.1/tools/Tools.proj (in 1.18 sec).
    Determining projects to restore...

<<<<<<<< Execution was halted at this point for 3.5hrs, before continuing. >>>>>>>>>>>>>>>>>>>>>>>

Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Diagnostics.TraceSource/src/System.Diagnostics.TraceSource.csproj (in 1.2 sec).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Diagnostics.TraceSource/ref/System.Diagnostics.TraceSource.csproj (in 1.19 sec).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Diagnostics.TextWriterTraceListener/src/System.Diagnostics.TextWriterTraceListener.
csproj (in 4 ms).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Diagnostics.TextWriterTraceListener/ref/System.Diagnostics.TextWriterTraceListener.
csproj (in 3 ms).

@am11 am11 removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 15, 2025
@Vishwanatha-HD
Copy link
Author

I tried the dotnet runtime build again on the mainline.. There is some change in the behaviour this time..

The execution still hangs at restore part for around 3.5hrs.. But soon after this the build is failing with the below error now.. I am attaching the build console log for your reference..

Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.Sockets/src/System.Net.Sockets.csproj (in 29 ms).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.Sockets/ref/System.Net.Sockets.csproj (in 1 ms).
/home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.ServerSentEvents/tests/System.Net.ServerSentEvents.Tests.csproj : error NU1511: Warning As Error: A ProjectReference cannot be pruned, System.Net.ServerSentEvents. [/home/redhat/jenkins-scripts/ci/runtime/Build.proj]
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj (in 8 ms).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.ServerSentEvents/src/System.Net.ServerSentEvents.csproj (in 2 ms).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.ServerSentEvents/ref/System.Net.ServerSentEvents.csproj (in 2 ms).
Restored /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.Security/tests/UnitTests/System.Net.Security.Unit.Tests.csproj (in 33 ms).
Failed to restore /home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Net.ServerSentEvents/tests/System.Net.ServerSentEvents.Tests.csproj (in 7 ms).

dotnet_runtime_mainline_build_failure_Apr_21_2025.txt

@giritrivedi
Copy link
Contributor

use /p:RestoreEnablePackagePruning=false to get rid of Pruning related errors.

@Vishwanatha-HD
Copy link
Author

I ran the runtime build with the flag set to false i.e. "/p:RestoreEnablePackagePruning=false", as suggested by Giri.. Post doing that, I am running into new issues as below..

/home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25216.2/tools/TargetFrameworkFilters.BeforeCommonTargets.targets(86,5): error : - System.Threading.Tasks [/home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj::TargetFramework=net10.0]​

/home/redhat/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25216.2/tools/TargetFrameworkFilters.BeforeCommonTargets.targets(86,5): error : Consult the project.assets.json files to find the parent dependencies. [/home/redhat/jenkins-scripts/ci/runtime/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj::TargetFramework=net10.0]​

@tmds
Copy link
Member

tmds commented Apr 24, 2025

  Determining projects to restore...
  Restored /home/tester/.nuget/packages/microsoft.dotnet.arcade.sdk/10.0.0-beta.25217.1/tools/Tools.proj (in 2.06 sec).
  Determining projects to restore...
<-- HOURS GO BY -->
  Restored Xxx
  Restored Yyy

@ViktorHofer do you have a suggestion on how we can get some visibility on what goes on during this long delay?

note: we're seeing the issue on s390x/ppc64le, both use Mono.

@ViktorHofer
Copy link
Member

We now produce a repo static graph restore binlog as part of the build. You should find it under artifacts\log\<config>\Restore-....binlog. Do you have access to that? This requires passing the -bl switch in of course.

@tmds
Copy link
Member

tmds commented Apr 25, 2025

@ViktorHofer I've captured a log. This is on ppc64le where the operation takes 30+ minutes on the machine I used (for .NET 9 builds it takes a minute or two). I didn't see anything that stood out in the log. Can you take a look and see if it tells you something? log.tar.gz

@ViktorHofer
Copy link
Member

Image

These numbers are waaaaaay off from the common path. Super weird. I don't understand how collecting these items can take that long with static graph. Given that all these are msbuild tasks I wonder if msbuild is somehow slow on s390x.

cc @jeffkl for static graph restore and @rainersigwald for msbuild

@tmds
Copy link
Member

tmds commented Apr 25, 2025

That's also how I read the log: nothing is slow in particular, everything is much slower than what it is supposed to be.

Since this used to be fast, I want see if I can find what made it slow.

From looking at the CI results, I think it's something that landed in the vmr Feb 26th or some days after that. I'm not a 100% sure about this date yet.

@ViktorHofer
Copy link
Member

Is it just the restore step or is the build significantly slower as well?

@tmds
Copy link
Member

tmds commented Apr 25, 2025

I think the build itself is not or far less affected than what we see during restore. I haven't measured this. The pause after Determining projects to restore... is very clear and once the Restoring ... messages start, the build seems happily on its way.

@tmds
Copy link
Member

tmds commented Apr 30, 2025

git bisect tells me the regression occurred in dotnet/dotnet@24b7d62.

This contains a sizable msbuild change: https://github.com/dotnet/msbuild/compare/405618191cdc903ccf2bbf23e239b7dc369bdb3e..63aefc3dc0984823dee39864b6d825681fd33801.

@rainersigwald do you have some thoughts on which of these changes may be causing this issue?

@uweigand
Copy link
Contributor

uweigand commented May 2, 2025

I've now had a look at the backtraces captured in the original dotnet-mainline-runtime-execution-info-on-s390x.txt file. First of all, this looks completely different than the mono_loader_lock deadlocks we were seeing here: #93686 . In fact, it doesn't look like a deadlock at all.

In the backtrace, the only thread that is actually doing anything is thread 10 of process 145253 (which is running sdk/10.0.100-preview.4.25207.1/NuGet.Build.Tasks.Console.dll). This thread is inside major_iterate_objects in src/mono/mono/sgen/sgen-marksweep.c:941, which is called from a C# System.GC:GetTotalMemory routine. The logic seems to iterate over every allocated block.

Some observations:

  • The GetTotalMemory call is issued from Stats:ExecutionStopped in Microsoft.Build.Execution.TaskRegistry:RegisteredTaskRecord, which in turn is called from Microsoft.Build.BackEnd.TaskBuilder:ExecuteBucket
  • That call was in fact inserted into MSBuild via one of the commits in the range identified above, which explains why this wasn't seen earlier
  • I have no idea how long this GetTotalMemory call is supposed to take in Mono. I'm also not sure how frequently this call is now issued from this new MSBuild code. Maybe the call simply is much more efficiently implemented in CoreCLR?

@Vishwanatha-HD or @giritrivedi can you confirm (e.g. by running the build under perf) that the majority of time is spent in the major_iterate_objects routine? Maybe you can also find out (either in the debugger or via instrumentation) how frequently it is called and what typical iteration counts within the routine are (number of blocks).

Can someone familiar with the Mono GC logic comment on whether this routine is known to be inefficient?

@tmds
Copy link
Member

tmds commented May 5, 2025

Yes, the GetTotalMemory that was added in dotnet/msbuild#11359 is the source of the slow-down.

This method is called for gathering metrics. Perhaps we can (as a workaround) avoid calling it on Mono based runtimes. @rainersigwald wdyt?

The logic seems to iterate over every allocated block.

This could mean the slow down becomes more significant when the system has more RAM.
(With more RAM, there may be less GCs, leading to more blocks to be iterated over)

Mono has some envvars that control the heap size, we can see if setting those have an effect.

Can someone familiar with the Mono GC logic comment on whether this routine is known to be inefficient?

@ViktorHofer do you have some suggestion who this might be? Perhaps we can also get some inputs on how this is implemented on CoreCLR and whether that may be an option for Mono as well.

@ViktorHofer
Copy link
Member

do you have some suggestion who this might be? Perhaps we can also get some inputs on how this is implemented on CoreCLR and whether that may be an option for Mono as well.

@akoeplinger would you know the right contact on the Mono runtime team?

@tmds
Copy link
Member

tmds commented May 5, 2025

Yes, the GetTotalMemory that was added in dotnet/msbuild#11359 is the source of the slow-down.

@JanKrivanek I wonder if it wouldn't make more sense to call GC.GetTotalAllocatedBytes instead of GC.GetTotalMemory?

And, that one would presumably be much faster as I assume it counts as allocations are made instead of having to determine the current heap size on request.

@tmds
Copy link
Member

tmds commented May 5, 2025

Mono has some envvars that control the heap size, we can see if setting those have an effect.

I tried running with export MONO_GC_PARAMS=max-heap-size=4g but there was no difference.

@tmds
Copy link
Member

tmds commented May 5, 2025

I wonder if it wouldn't make more sense to call GC.GetTotalAllocatedBytes instead of GC.GetTotalMemory?

I think those are the expected semantics for this metric so I made a PR for it: dotnet/msbuild#11788.

And as a consequence of using GetTotalAllocatedBytes, the slow-down should also be fixed.

@pavelsavara
Copy link
Member

@tmds you can disable the telemetry in your CI see hints in

dotnet/msbuild#11337 (comment)

cc @JanProvaznik

@giritrivedi
Copy link
Contributor

@uweigand I have run the perf on restore. Majority of the time is spent in "RequestBuilder" command which uses major_get_used_size which in turn invokes major_iterate_objects.

Attaching the screen shot of perf report.

Image

@tmds
Copy link
Member

tmds commented May 6, 2025

When the msbuild change made its way to the vmr, the slow down will be fixed.

@tmds tmds closed this as completed May 6, 2025
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants