Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Mono: ManualResetEventSlim.Wait() unexpectedly returns without blocking #115178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Liangjia0411 opened this issue Apr 30, 2025 · 6 comments
Open
Labels
area-System.Threading untriaged New issue has not been triaged by the area owner

Comments

@Liangjia0411
Copy link

Issue Description

When using ManualResetEventSlim.Wait() to synchronize multi-threaded task execution, I've discovered a critical issue in the Mono environment: under high CPU load conditions, the Wait() method sometimes returns immediately without blocking, even though the associated event has not been set. This causes thread synchronization mechanisms to fail and leads to task execution errors.

Test Code

using System;
using System.Collections.Concurrent;
using System.Threading;

namespace UnrealEngine.Game;

/// <summary>
/// Task scheduler that manages a pool of worker threads for parallel execution of tasks
/// </summary>
internal class TestTaskScheduler : IDisposable
{
	// Counter used to generate unique thread names
	private static uint _ThreadNameCounter;
	// Calculate thread count as half of processor count, minimum 2
	private static readonly int _ThreadCount = Math.Max(Environment.ProcessorCount >> 1, 2);

	// Thread-safe queue for pending tasks
	private readonly ConcurrentQueue<Action> _TaskQueue = new();

	// Events to signal worker threads to start processing
	private readonly ManualResetEventSlim[] _MainResetEvent = new ManualResetEventSlim[_ThreadCount];
	// Event to signal the main thread when all tasks are complete
	private readonly ManualResetEventSlim _MainThreadResetEvent = new(false);
	// Stores the first exception encountered during task execution
	private Exception _Exceptions;

	// Flag to control worker thread lifetime
	private bool _IsActivate = true;
	// Counter for pending tasks - synchronized access via Interlocked
	private volatile int _TaskCount;
	// State flag: 0 = tasks in progress, 1 = all tasks complete
	private volatile int _SetCount = 1;
	// Flag to ensure we only capture the first exception
	private volatile int _ExceptionCount;

	/// <summary>
	/// Initializes the task scheduler and creates worker threads
	/// </summary>
	public TestTaskScheduler()
	{
		var ThreadNameCounter = Interlocked.Increment(ref _ThreadNameCounter);
		for (var I = 0; I < _ThreadCount; ++I)
		{
			_MainResetEvent[I] = new ManualResetEventSlim(false);
			var CoreThread = new Thread(_ExecFunc)
			{
				Name = $"EntityForeachThread_{ThreadNameCounter}_{I}",
				IsBackground = true, // Background threads won't prevent application exit
			};
			CoreThread.Start(I);
		}
	}

	/// <summary>
	/// Worker thread function that processes tasks from the queue
	/// </summary>
	private void _ExecFunc(object Param)
	{
		var Index = (int)Param;
		var ResetEvent = _MainResetEvent[Index];
		while (_IsActivate)
		{
			if (_TaskQueue.TryDequeue(out var Task))
			{
				try
				{
					Task.Invoke();
				}
				catch (Exception Exp)
				{
					// Only capture the first exception that occurs
					if (Interlocked.CompareExchange(ref _ExceptionCount, 1, 0) == 0)
						_Exceptions = Exp;
				}

				// Decrement task counter and signal completion if this was the last task
				if (Interlocked.Decrement(ref _TaskCount) == 0)
				{
					Interlocked.Increment(ref _SetCount); // Mark all tasks as complete
					_MainThreadResetEvent.Set(); // Signal the main thread
				}
				else
					continue; // More tasks remain, keep working
			}

			// No more tasks in queue, wait for new batch
			ResetEvent.Reset();
			ResetEvent.Wait();
		}

		ResetEvent.Dispose();
	}

	/// <summary>
	/// Adds a task to the queue and increments task counter
	/// </summary>
	protected void QueueTask(Action Exec)
	{
		Interlocked.Increment(ref _TaskCount);
		_TaskQueue.Enqueue(Exec);
	}

	/// <summary>
	/// Cleans up resources and signals worker threads to terminate
	/// </summary>
	public void Dispose()
	{
		_IsActivate = false;
		_TaskQueue.Clear();
		Interlocked.Exchange(ref _TaskCount, 0);
		for (var Index = 0; Index < _MainResetEvent.Length; Index++)
		{
			ref var ResetEvent = ref _MainResetEvent[Index];
			ResetEvent.Set(); // Wake up threads so they can exit
			ResetEvent = null;
		}

		_MainThreadResetEvent.Dispose();
	}

	/// <summary>
	/// Executes a batch of vector calculation tasks in parallel
	/// </summary>
	public void Exec()
	{
		lock (_TaskQueue)
		{
			// Initialize task count to 1 (sentinel value)
			// This prevents premature completion signal before all tasks are enqueued
			Interlocked.Exchange(ref _TaskCount, 1);

			var Rand = new Random();
			var TaskCount = Rand.Next(10, 100);
			for (var j = 0; j < TaskCount; ++j)
			{
				QueueTask(() =>
				{
					var Count = Rand.Next(1000, 5000);
					for (var k = 0; k < Count; ++k)
					{
						var V0 = new CustomFVector(Rand.NextDouble() * 1000.0, Rand.NextDouble() * 1000.0, Rand.NextDouble() * 1000.0);
						_ = V0.Length2D();
					}
				});
			}

			// Mark execution state as active (0 = tasks in progress)
			// This ensures the main thread waits until all tasks complete
			Interlocked.Decrement(ref _SetCount);
			_MainThreadResetEvent.Reset();

			// Wake all worker threads to start processing
			foreach (var ThreadEvent in _MainResetEvent)
				ThreadEvent.Set();

			// Main thread also processes tasks
			while (_TaskQueue.TryDequeue(out var Task))
			{
				Task.Invoke();
				Interlocked.Decrement(ref _TaskCount);
			}

			// Remove the sentinel task count
			// If this was the last task, signal completion
			if (Interlocked.Decrement(ref _TaskCount) == 0)
			{
				Interlocked.Increment(ref _SetCount);
				_MainThreadResetEvent.Set();
			}

			// Wait for all tasks to complete
			_MainThreadResetEvent.Wait();

			// Validation: ensure all tasks were processed correctly
			var Tc = Interlocked.CompareExchange(ref _TaskCount, 0, 0);
			var Sc = Interlocked.CompareExchange(ref _SetCount, 0, 0);
			if (Tc != 0)
				throw new Exception($"thread wait event error: TaskCount:{Tc}, SetCount:{Sc}");

			// Re-throw any exception that occurred during task execution
			if (_Exceptions == null)
				return;

			Interlocked.Exchange(ref _ExceptionCount, 0);
			var Exp = _Exceptions;
			_Exceptions = null;
			throw new Exception("Foreach Task Error", Exp);
		}
	}

	/// <summary>
	/// Custom vector implementation for computation tasks
	/// </summary>
	public struct CustomFVector
	{
		public double X;
		public double Y;
		public double Z;

		public CustomFVector(double X, double Y, double Z)
		{
			this.X = X;
			this.Y = Y;
			this.Z = Z;
		}

		public double Length2D()
		{
			return Math.Sqrt(X * X + Y * Y);
		}
	}
}

Steps to Reproduce

  1. Use the TestTaskScheduler class and call Exec() method on each frame
  2. Simultaneously run CPU-intensive tasks (like compiling projects) to increase CPU load
  3. Observe the exception: when _TaskCount > 0, _MainThreadResetEvent.Wait() returns abnormally

Error Symptoms

When the issue occurs, the main thread returns from the Wait state without blocking, even though tasks in other threads are not completed, triggering the following exception:

//Tick 1
System.Exception: thread wait event error: TaskCount:10, SetCount:0
//Tick 2
System.Exception: thread wait event error: TaskCount:6, SetCount:0
//Tick 3
System.Exception: thread wait event error: TaskCount:8, SetCount:0
//...

Analysis

Through debugging, I've found:

  1. At the critical code point _MainThreadResetEvent.Wait(), even though the event has not been set (confirmed by state check)
  2. The Wait() method still returns, causing subsequent code to execute, but at this time _TaskCount is not 0 because tasks have not all completed
  3. When CPU usage is high, the Wait() method returns false instead of blocking as expected, but the code does not check the return value
  4. When repeatedly calling Wait() after it returns false, it eventually returns true, at which point _TaskCount and _SetCount recover to their expected values

Environment Information

  • Runtime: .NET 9.0.3
  • Platform: Windows
  • Comparison environment: CoreCLR (works correctly)

Expected Behavior

According to documentation and behavior in CoreCLR, ManualResetEventSlim.Wait() should block the calling thread until the event is set, unless using an overload with timeout parameters.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 30, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Apr 30, 2025
@Liangjia0411 Liangjia0411 reopened this Apr 30, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 30, 2025
@srxqds
Copy link
Contributor

srxqds commented May 6, 2025

@lateralusX Can you help analyze this problem?

@lateralusX
Copy link
Member

Thanks for detailed repro and investigation. I will take a look (need to finish some other things first). Wait can return false and on Mono we support alertable waits that could be interrupted and return false, but in case void Wait is called, it doesn't handle that support and assume it will never return false, but currently implementation uses the same underlying API that will handle interrupts/alerts and return false in that case.

Possible to run the repro with a custom build runtime and set LOCK_DEBUG? That will create a bunch of debugging around monitor locking and if we hit case of it to return false from wait, we will see what scenario we hit.

@srxqds
Copy link
Contributor

srxqds commented May 6, 2025

I have dump the LOCK_DEBUG, but I can't kown why it return false

the full log add below:

thread.log

@lateralusX
Copy link
Member

lateralusX commented May 6, 2025

Thanks, I would expect some more logging from, mono_monitor_wait, and the logging in the file doesn't seem to fully correspond to checked in source code. It should be on the form:

LOCK_DEBUG (g_message ("%s: (%d) Trying to wait for %p with timeout %dms", __func__, id, obj, ms));

and then unless lock is not owned (that shouldn't be the case ), that should not be the case it should at least log:

LOCK_DEBUG (g_message ("%s: (%d) queuing handle %p", __func__, id, event));
LOCK_DEBUG (g_message ("%s: (%d) Unlocked %p lock %p", __func__, id, obj, mon));
LOCK_DEBUG (g_message ("%s: (%d) Regained %p lock %p", __func__, id, obj, mon));
LOCK_DEBUG (g_message ("%s: (%d) Success", __func__, id));

but none of that logging is present in the file.

I will run this later this week and see if I can get a local repro. If you can figure out why we don't get additional logging or add asserts into failures when mono_monitor_wait to get hard failures if we hit those scenarios it would be great.

@srxqds
Copy link
Contributor

srxqds commented May 7, 2025

Thanks, I would expect some more logging from, mono_monitor_wait, and the logging in the file doesn't seem to fully correspond to checked in source code. It should be on the form:

LOCK_DEBUG (g_message ("%s: (%d) Trying to wait for %p with timeout %dms", __func__, id, obj, ms));

and then unless lock is not owned (that shouldn't be the case ), that should not be the case it should at least log:

LOCK_DEBUG (g_message ("%s: (%d) queuing handle %p", __func__, id, event));
LOCK_DEBUG (g_message ("%s: (%d) Unlocked %p lock %p", __func__, id, obj, mon));
LOCK_DEBUG (g_message ("%s: (%d) Regained %p lock %p", __func__, id, obj, mon));
LOCK_DEBUG (g_message ("%s: (%d) Success", __func__, id));

but none of that logging is present in the file.

I will run this later this week and see if I can get a local repro. If you can figure out why we don't get additional logging or add asserts into failures when mono_monitor_wait to get hard failures if we hit those scenarios it would be great.

Yes, it's a bit strange, I'll continue to look at it today

[2025.05.07-10.20.59:226][487]MonoVM: Display: MonoLog [][message]mono_monitor_wait: (64) managedid (1) Trying to wait for 00000160B5218C88 with timeout -1ms
[2025.05.07-10.20.59:228][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA678910 (0 ms)
[2025.05.07-10.20.59:228][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA677850 (0 ms)
[2025.05.07-10.20.59:229][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA6871F0 (0 ms)
[2025.05.07-10.20.59:328][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (76) Trying to lock object 00000160B5214760 (0 ms)
[2025.05.07-10.20.59:328][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (73) Trying to lock object 00000160B5211760 (0 ms)
[2025.05.07-10.20.59:358][487]MonoVM: Display: MonoLog [][message]mono_monitor_wait: (76) managedid (16) Trying to wait for 00000160B5214760 with timeout -1ms
[2025.05.07-10.20.59:358][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (78) Trying to lock object 00000160B5216760 (0 ms)
[2025.05.07-10.20.59:358][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (69) Trying to lock object 00000160B520D830 (0 ms)
[2025.05.07-10.20.59:359][487]MonoVM: Display: MonoLog [][message]mono_monitor_wait: (78) managedid (18) Trying to wait for 00000160B5216760 with timeout -1ms
[2025.05.07-10.20.59:359][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA678910 (0 ms)
[2025.05.07-10.20.59:359][487]MonoVM: Display: MonoLog [][message]mono_monitor_wait: (69) managedid (10) Trying to wait for 00000160B520D830 with timeout -1ms
[2025.05.07-10.20.59:359][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (79) Trying to lock object 00000160B5217760 (0 ms)
[2025.05.07-10.20.59:359][487]MonoVM: Display: MonoLog [][message]mono_monitor_wait: (79) managedid (19) Trying to wait for 00000160B5217760 with timeout -1ms
[2025.05.07-10.20.59:372][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA677850 (0 ms)
[2025.05.07-10.20.59:452][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA6871F0 (0 ms)
[2025.05.07-10.20.59:452][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA678910 (0 ms)
[2025.05.07-10.20.59:452][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA677850 (0 ms)
[2025.05.07-10.20.59:452][487]MonoVM: Display: MonoLog [][message]mono_monitor_try_enter_internal: (64) Trying to lock object 00000160BA6871F0 (0 ms)
[2025.05.07-10.20.59:452][487]MonoVM: Display: MonoLog [][message]mono_monitor_wait: (73) managedid (13) Trying to wait for 00000160B5211760 with timeout -1ms
[2025.05.07-10.20.59:452][487]MonoVM: Error: UnhandledException: System.Exception thread wait event error: TaskCount:3, SetCount:0 thread id 1
   at UnrealEngine.Game.TestTaskScheduler.Exec() in D:\UnrealMono\ue5\UEDemo\Content\Script\Game\Tests\Runtime\TestTaskScheduler.cs:line 179
   at UnrealEngine.Game.UDemoGameInstance.OnTick(Single deltaTime) in D:\UnrealMono\ue5\UEDemo\Content\Script\Game\DemoGameInstance.cs:line 378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Threading untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

3 participants