Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lambdageek
Copy link
Member

@lambdageek lambdageek commented Apr 6, 2018

Summary

This is a new suspend mode where threads that are running managed or native runtime code (what we call "GC Unsafe") are suspended cooperatively using safepoints, but threads running either embedder code or P/Invoke code or blocking syscalls (what we call "GC Safe") are suspended preemptively using signals.

The motivation is to make an embedder-friendly coop suspend mechanism where the runtime will do GC Safe -> GC Unsafe transitions on calls to MONO_API functions, but the embedders themselves don't have to have any special knowledge of coop safepoints or state transitions.

To try this out, set both MONO_ENABLE_COOP and MONO_ENABLE_HYBRID_COOP environment variables.

Implementation

Here's a picture of the new state machine. There's two new states Blocking_Suspend_Requested and Blocking_Async_Suspended and one updated transition req_s (request suspension).

coop-state-machine

State machine states and transitions

The hybrid cooperative suspend mechanism works like this:

  • when a thread is executing in GC Unsafe mode, we use cooperative suspend and expect
    the thread to periodically checkpoint its execution (as in the ordinary full
    coop mode).
  • when a thread is executing in GC Safe mode, we use a preemptive signal-based
    suspend. This is new - previously in full coop we would allow BLOCKING
    threads to continue executing and only suspend them when they wanted to go
    from GC Safe to GC Unsafe (via the BLOCKING_SELF_SUSPENDED state).

There are two new states and one updated transition - the transition will
service both running and blocking threads: The idea is that suspend mechanism
is determined by the suspend initiator, and is not embodied in the the state
machine. The state machine just has to return distinctive enough values from
the transitions for the initiator to select a policy.

Resume mechanism is determined by the state that a thread finds itself in when
it is resumed, and the original suspend policy.

The primary differences between suspend mechanisms are: the preemptive
suspend is two phase - the finish_async_suspension does not
apply for cooperative suspend; the cooperative policy uses the poll transition when the thread is running.

We renamed mono_threads_transition_request_async_suspension to mono_threads_transition_request_suspension.

  • It must be called by a suspend initiator on a victim thread that is not
    itself.

  • There can only be one suspend initiator for a victim at a time - if the
    suspend initiator initiates a suspension it must follow through with the
    whole protocol.

  • If a victim is RUNNING we transition it to ASYNC_SUSPEND_REQUSTED and return
    ReqSuspendInitSuspendRunning which signals that the caller must initiate
    suspension for a running thread. (Same as the old AsyncSuspendInitSuspend).

  • If a victim is BLOCKING we transition it to BLOCKING_SUSPEND_REQUESTED.
    Note that this is different from the old request_async_suspension which just
    incremented the suspend count and returned AsyncSuspendBlocking. Now we
    return ReqSuspendInitSuspendBlocking. The return of
    ReqSuspendInitSuspendBlocking means the initiator may signal the victim and
    must wait to be notified of the suspension (in preemptive suspend it will be
    notified by the signal handler, in cooperative when the blocking thread
    attempts to exit from blocking mode).

  • In BLOCKING_SUSPEND_REQUESTED we
    may increment the suspend count. This is slightly too lax if we're going to
    be using preemptive suspend on blocking threads: in that case we're in the
    middle of a two-phase suspend and since there is only one suspend initiator,
    we wouldn't expect another suspend to be initiated until we have a chance to
    finish suspending. We leave it to the suspend initiator to rule out by returning
    ReqSuspendAlreadySuspendedBlocking from this state.

  • In all the not executing states (SELF_SUSPENDED, ASYNC_SUSPENDED,
    BLOCKING_SELF_SUSPENDED, BLOCKING_ASYNC_SUSPENDED) we just
    increment the suspend count.

  • All other transitions are illegal.

  • Note that BLOCKING must always have a suspend count of 0 and
    BLOCKING_SUSPEND_REQUESTED must have suspend_count > 0.

We add two new states:

  • STATE_BLOCKING_SUSPEND_REQUESTED - this is a new state that indicates that a
    suspend initiator wants to suspend this victim thread.
    • The thread is still executing in this state.
    • We only transition into this state with
      mono_threads_transition_request_suspension.
    • The only legal transitions out from this state are:
      • resume - decrements the suspend_count. If the suspend_count becomes 0 we
        go back to BLOCKING, otherwise we stay in BLOCKING_SUSPEND_REQUESTED.
      • finish_async_suspension (executed by the suspend signal handler to
        finish the two phase preemptive suspend request and notify the suspend initiator),
      • done_blocking and abort_blocking. It means that the a
        suspend initiator requested a suspend but we got to done (or abort)
        blocking before the signal was delivered. We return NotifyAndWait and the
        caller is supposed to send a notification to the suspend initiator and
        wait for a resume(with suspend count == 0) which will return to
        RUNNING (the victim thread goes to STATE_BLOCKING_SELF_SUSPENDED).
  • STATE_BLOCKING_ASYNC_SUSPENDED
    • The thread is not executing in this state (it is waiting for a resume).
    • a thread transitions into this state from STATE_BLOCKING_SUSPEND_REQUESTED
      on a finish_async_suspension transition (see above).
    • on a resume transition we decrement the suspend count. if it's still
      positive we stay in this state. if it's 0 the thread transitions out of
      this state. Unlike other resume transitions, in this case the thread goes
      back to BLOCKING and resumes executing in GC Safe mode. The resume caller
      must notify the resume initiator (ie this is a ResumeInitAsyncResume).
    • on a request_suspension we stay in this state and just increment the
      suspend count.
    • all other transitions from this state are illegal.

We add two new return values for abort_blocking and done_blocking:

  • DoneBlockingNotifyAndWait and AbortBlockingNotifyAndWait. Similar to
    SelfSuspendNotifyAndWait - there was a race between the second phase of a
    preemptive suspend and another operation (in that case a poll, in this case a
    done or abort blocking) and the other operation won, so its caller should
    notify the suspend initiator. Since the thread is now suspended, the caller
    should wait for the resume signal.

We add new results for request_initiate_suspension MonoRequestSuspendResult
(formerly MonoRequestAsyncSuspendResult):

  • ReqSuspendAlreadySuspended means the thread is not executing and we just
    incremented its suspend count and there's nothing else for the suspend
    initiator to do. (The old AsyncSuspendAlreadySuspended)
  • ReqSuspendAlreadySuspendedBlocking means the thread is in
    BLOCKING_SUSPEND_REQUESTED and we initiated another suspend request. This
    should only happen with full cooperative suspend - with hybrid suspend we
    will use a two phase preemptive suspend on a blocking thread and since only
    a single suspend initiator is active, this state should not be visible to
    another suspend initiator.
  • ReqSuspendInitSuspendRunning - Thread is executing in GC Unsafe mode and the
    caller should suspend it. This is the old AsyncSuspendInitSuspend. In
    full preemptive suspend this is the only suspend initiation action and the caller
    should begin the preemptive suspend procedure. In full coop we expect the
    victim thread to reach a safepoint and poll to finish suspending.
  • ReqSuspendInitSuspendBlocking - Thread is executing in GC Safe mode and
    the caller should initiate a suspend. In full coop the caller has nothing
    to do - a thread executing in blocking mode is assumed to be suspended. In
    hybrid suspend, the caller has to initiate a preemptive suspend.

Suspend initiator, signal handler, self suspender

Basically we call mono_threads_transition_request_suspension instead of mono_threads_transition_request_async_suspension.
Then if it says to suspend a running thread we call begin_suspend_for_running_thread, if it says initiate a suspension of a blocking thread, we call begin_suspend_for_blocking_thread, and if it says do nothing we do nothing.

Questions

  1. We have an assumption in safe_interrupt_thread that we can return from a self interrupt if we're using coop suspend. I'm not sure if that makes sense for hybrid suspend, and not clear what to do about it.
  2. In mono_thread_info_safe_suspend_and_run we expect that the callback won't return KeepSuspended if coop is enabled. I'm not sure what this means for hybrid.
  3. A bunch of tests like monitor-abort and appdomain-threadpool-unload are failing if I enable hybrid suspend. Don't know why yet.

@lambdageek
Copy link
Member Author

Attn @kumpera @luhenry

Copy link
Contributor

@luhenry luhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I don't see why we would need different state transitions for the hybrid suspend and the preemptive/cooperative suspend cases. The thread state machine should not have any knowledge of how the threads are going to be suspended, and it should work the same regardless. Another sign there is no need for different async suspend transitions is that the resume transitions are the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a --enable-hybrid-suspend, and it will eventually become the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just renumber STATE_BLOCKING in the enum. These values are neither part of the API, neither live anywhere else than in non-persistent memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename that to STATE_BLOCKING_SELF_SUSPENDED so it's clear that there is a parallel between STATE_RUNNING and STATE_BLOCKING. It should be:

		"RUNNING",
		"DETACHED",
		"ASYNC_SUSPENDED",
		"SELF_SUSPENDED",
		"ASYNC_SUSPEND_REQUESTED",
		"BLOCKING",
		"BLOCKING_ASYNC_SUSPENDED",
		"BLOCKING_SELF_SUSPENDED",
		"BLOCKING_ASYNC_SUSPEND_REQUESTED",

The corresponding enum at https://github.com/mono/mono/pull/8068/files#diff-6fa4afbec40d6b93ebd74cf63b06fe93R127 should also be modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a separate code path? The fact that we use cooperative or preemptive suspend doesn't depend on the state machine, as we want to keep the possibility to use both cooperative and preemptive suspend whatever the state of the thread is. The cases we want to support are:

  • preemptive = preemptive for running + preemptive for blocking
  • hybrid = cooperative for running + preemptive for blocking
  • cooperative = cooperative for running + cooperative for blocking

The only necessary thing should be either to pass a parameter to begin_async_suspend for blocking or running, or to replace the calls to begin_async_suspend by begin_safepoint_suspend/begin_preemptive_suspend.

Copy link
Member Author

@lambdageek lambdageek Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I'm getting this: have a single "begin suspending" transition that subsumes both request_async_suspension and request_hybrid_suspension; give it enough return codes so that we can switch on the results and in all cases do the correct begin_safepoint_suspend/begin_preemptive_suspend?

That makes sense but there are a couple problems:

  1. (minor) right now ASYNC_SUSPENDED has a request_async_suspension selfloop, but no legal request_hybrid_suspension transition (if we decided to suspend that thread via safepointing, it coudn't have preemptively suspended and shouldn't try to preemptively suspend for suspend_count >1. This one is probably harmless to add.). The symmetric situation on BLOCKING_ASYNC_SUSPENDED - it has a request_hybrid_suspension selfloop but no legal request_async_suspension transition - if we decided to suspend it preemptively we shouldn't try to suspend it again via checkpointing. I think it's better when more things are illegal.
  2. (serious) in BLOCKING state, request_async_suspension loops back to BLOCKING, but request_hybrid_suspension goes to BLOCKING_SUSPEND_REQUESTED - this decision is driven purely by external considerations, not by the state machine state. So we need two distinct transitions for this case.

In general, the suspend transitions represent policy - different suspend techniques starting from the same states want different decisions. On the resume side it's different - the resume policy is encoded in the state that the thread entered when we suspended it.

Alternately, maybe we could split the blocking state into, let's say BLOCKING_NONPREEMPTABLE (kind of like the original meaning of BLOCKING - it's in a syscall or other code where we know for a fact it can never touch managed at all which will suspend at safepoints on the done blocking transition (and maybe we even disallow it abort blocking)) and BLOCKING_PREEMPTABLE (for unknown foreign code which might turn out to be embedders who can do abort blocking or hold on to managed pointers) which is preempted. In that case we push all the complexity into runtime code (we now need to do let's say - BEGIN_SYSCALL/END_SYSCALL and BEGIN_FOREGIN/END_FOREIGN). On the bright side we won't have to restart syscalls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the morning things look better. The solution is to do a two-phase suspend in blocking always, the same way that we do it in the running case (ie: request_suspension should move from Blocking to Blocking_Suspend_Requested, always, which communicates enough information to the suspend initiator to then decide on a suspend policy). The previous request_async_suspension transition from Blocking to Blocking was just an artifact of not having an explicit Blocking_Suspend_Requested state.

And there's a benefit for the state machine, too: we can establish the invariant that in Blocking the suspend count is always equal to 0 and in Blocking_Suspend_Requested it's always strictly positive.

I'm going to update the PR to make it work this way.

Thanks @luhenry !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO: begin_cooperative_suspend conveys more the meaning that it is the "other way" of begin_preemptive_suspend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commentary should be the same as for STATE_ASYNC_SUSPENDED and STATE_SELF_SUSPENDED.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commentary should be the same as for STATE_ASYNC_SUSPENDED and STATE_SELF_SUSPENDED.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just have it in a different case statement to ease undestanding the code and keep it consistent with the rest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lambdageek lambdageek force-pushed the coop-hybrid-suspend branch 7 times, most recently from 50888b8 to fd0c32d Compare April 9, 2018 20:19
@marek-safar marek-safar mentioned this pull request Apr 9, 2018
23 tasks
@lambdageek lambdageek requested a review from marek-safar as a code owner April 9, 2018 21:31
@lambdageek
Copy link
Member Author

@luhenry updated:

  1. Use a single unified mono_threads_transition_request_initiate_suspension
  2. Add --with-hybrid-suspend configure option. Defaults to 'no'. To enable requires '--with-cooperative-gc'.

@lambdageek
Copy link
Member Author

The monitor-abort and appdomain-threadpool-unload tests are still failing if hybrid coop is enabled. They work with full preemptive suspend and full cooperative suspend.

@lambdageek lambdageek force-pushed the coop-hybrid-suspend branch 2 times, most recently from 9c15fc3 to 332114a Compare April 9, 2018 21:52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want it for both USE_COOP_GC and USE_HYBRID_COOP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Hybrid Coop doesn't make much sense, Hybrid Suspend would be more meaningful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just duplicate the code for AbortBlockingWait and AbortBlockingNotifyAndWait to keep it consistent across the board.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should merge this case with the previous one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should duplicate the code for ReqSuspendAlreadySuspendedBlocking and ReqSuspendInitSuspendBlocking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mono_threads_transition_request_suspension should be enough and provides a good counter-part to mono_threads_transition_request_resume.

@luhenry
Copy link
Contributor

luhenry commented Apr 10, 2018

The individual comments are nitpick and are not blocking the PR in general, just the requested change at #8068 (comment) is.

@lambdageek lambdageek force-pushed the coop-hybrid-suspend branch 2 times, most recently from 56ec20c to bfc4545 Compare April 10, 2018 17:21
The hybrid cooperative suspend mechanism works like this:
 - when a thread is executing in GC Unsafe mode, we use cooperative suspend and expect
   the thread to periodically checkpoint its execution (as in the ordinary full
   coop mode).
 - when a thread is executing in GC Safe mode, we use a preemptive signal-based
   suspend.  This is new - previously in full coop we would allow BLOCKING
   threads to continue executing and only suspend them when they wanted to go
   from GC Safe to GC Unsafe (via the BLOCKING_SELF_SUSPENDED state).

There are two new states and one updated transition - the transition will
service both running and blocking threads: The idea is that suspend mechanism
is determined by the suspend initiator, and is not embodied in the the state
machine.  The state machine just has to return distinctive enough values from
the transitions for the initiator to select a policy.

Resume mechanism is determined by the state that a thread finds itself in when
it is resumed, and the original suspend policy.

The primary differences between suspend mechanisms are: the preemptive
suspend is two phase - the finish_async_suspension transition only does not
apply for cooperative suspend; the cooperative policy uses the poll transition
when the thread is running.

We renamed mono_threads_transition_request_async_suspension to mono_threads_transition_request_suspension.

 - It must be called by a suspend initiator on a victim thread that is not
   itself.

 - There can only be one suspend initiator for a victim at a time - if the
   suspend initiator initiates a suspension it must follow through with the
   whole protocol.

 - If a victim is RUNNING we transition it to ASYNC_SUSPEND_REQUSTED and return
   ReqSuspendInitSuspendRunning which signals that the caller must initiate
   suspension for a running thread.  (Same as the old AsyncSuspendInitSuspend).

 - If a victim is BLOCKING we transition it to BLOCKING_SUSPEND_REQUESTED.
   Note that this is different from the old request_async_suspension which just
   incremented the suspend count and returned AsyncSuspendBlocking.  Now we
   return ReqSuspendInitSuspendBlocking.  The return of
   ReqSuspendInitSuspendBlocking means the initiator may signal the victim and
   must wait to be notified of the suspension (in preemptive suspend it will be
   notified by the signal handler, in cooperative when the blocking thread
   attempts to exit from blocking mode).

 - In BLOCKING_SUSPEND_REQUESTED we may increment the suspend count.  This is
   slightly too lax if we're going to be using preemptive suspend on blocking
   threads: in that case we're in the middle of a two-phase suspend and since
   there is only one suspend initiator, we wouldn't expect another suspend to
   be initiated until we have a chance to finish suspending.  We leave it to
   the suspend initiator to rule out by returning
   ReqSuspendAlreadySuspendedBlocking from this state.

 - In all the not executing states (SELF_SUSPENDED, ASYNC_SUSPENDED,
   BLOCKING_SELF_SUSPENDED, BLOCKING_ASYNC_SUSPENDED) we just increment the
   suspend count.

 - All other transitions are illegal.

 - Note that BLOCKING must always have a suspend count of 0 and
   BLOCKING_SUSPEND_REQUESTED must have suspend_count > 0.

We add two new states:
 - STATE_BLOCKING_SUSPEND_REQUESTED - this is a new state that indicates that a
   suspend initiator wants to suspend this victim thread.
   - The thread is still executing in this state.
   - We only transition into this state with
     mono_threads_transition_request_suspension.
   - The only legal transitions out from this state are:
     - resume - decrements the suspend_count. If the suspend_count becomes 0 we
       go back to BLOCKING, otherwise we stay in BLOCKING_SUSPEND_REQUESTED.
     - finish_async_suspension (executed by the suspend signal handler to
       finish the two phase preemptive suspend request and notify the suspend initiator),
     - done_blocking and abort_blocking.  It means that the a
       suspend initiator requested a suspend but we got to done (or abort)
       blocking before the signal was delivered. We return NotifyAndWait and the
       caller is supposed to send a notification to the suspend initiator and
       wait for a resume(with suspend count == 0) which will return to
       RUNNING (the victim thread goes to STATE_BLOCKING_SELF_SUSPENDED).
 - STATE_BLOCKING_ASYNC_SUSPENDED
   - The thread is not executing in this state (it is waiting for a resume).
   - a thread transitions into this state from STATE_BLOCKING_SUSPEND_REQUESTED
     on a finish_async_suspension transition (see above).
   - on a resume transition we decrement the suspend count. if it's still
     positive we stay in this state. if it's 0 the thread transitions out of
     this state.  Unlike other resume transitions, in this case the thread goes
     back to BLOCKING and resumes executing in GC Safe mode.  The resume caller
     must notify the resume initiator (ie this is a ResumeInitAsyncResume).
   - on a request_suspension we stay in this state and just increment the
     suspend count.
   - all other transitions from this state are illegal.

We add two new return values for abort_blocking and done_blocking:
  - DoneBlockingNotifyAndWait and AbortBlockingNotifyAndWait. Similar to
    SelfSuspendNotifyAndWait - there was a race between the second phase of a
    preemptive suspend and another operation (in that case a poll, in this case a
    done or abort blocking) and the other operation won, so its caller should
    notify the suspend initiator. Since the thread is now suspended, the caller
    should wait for the resume signal.

We add new results for request_suspension MonoRequestSuspendResult
(formerly MonoRequestAsyncSuspendResult):
  - ReqSuspendAlreadySuspended means the thread is not executing and we just
    incremented its suspend count and there's nothing else for the suspend
    initiator to do. (The old AsyncSuspendAlreadySuspended)
  - ReqSuspendAlreadySuspendedBlocking means the thread is in
    BLOCKING_SUSPEND_REQUESTED and we initiated another suspend request.  This
    should only happen with full cooperative suspend - with hybrid suspend we
    will use a two phase preemptive suspend on a blocking thread and since only
    a single suspend initiator is active, this state should not be visible to
    another suspend initiator.
  - ReqSuspendInitSuspendRunning - Thread is executing in GC Unsafe mode and the
    caller should suspend it.  This is the old AsyncSuspendInitSuspend.  In
    full preemptive suspend this is the only suspend initiation action and the caller
    should begin the preemptive suspend procedure.  In full coop we expect the
    victim thread to reach a safepoint and poll to finish suspending.
  - ReqSuspendInitSuspendBlocking - Thread is executing in GC Safe mode and
    the caller should initiate a suspend.  In full coop the caller has nothing
    to do - a thread executing in blocking mode is assumed to be suspended.  In
    hybrid suspend, the caller has to initiate a preemptive suspend.
When the environment variable is set, Mono will use cooperative safepoint
suspend for threads running in GC Unsafe mode (running managed code or native
runtime code), and preemptive suspend for threads running in GC Safe
mode (running native embedder or P/Invoke code, or blocking system calls)

The environment variable is visible through
mono_threads_is_hybrid_suspension_enabled ().

This commit just adds the env var, it doesn't turn on hybrid coop suspend.
If a thread in blocking mode needs to be suspended, suspend it using preemptive suspend.
Defaults to 'no'.

To use `--with-hybrid-suspend=yes`, require `--with-cooperative-gc=yes`.
@lambdageek lambdageek force-pushed the coop-hybrid-suspend branch from bfc4545 to f214256 Compare April 10, 2018 17:31
Change begin_cooperative_suspend and begin_preemptive_suspend to return one of
three results: suspend fail, suspend succeeded using cooperative suspend,
suspend succeeded using preemptive suspend.

This is used by check_async_suspend to decide whether to check the
MonoThreadInfo for the result of a suspend.  (If a thread is suspended
cooperatively, it doesn't make sense to check).

Fixes sporadic failures in mono/tests/monitor-abort.exe
Move it from mono-threads-state-machine.c to mono-threads.c and make it
check how blocking threads are suspended before returning the saved state.
@lambdageek
Copy link
Member Author

[threads] Return suspend sort from begin_suspend_for_running_thread fixed monitor-abort.exe and appdomain-threadpool-unload.exe under hybrid suspend.

Now the only one that's broken is thread-suspend-suspended.exe.


The mono_thread_info_get_suspend_state change fixed the massive CI breakage of coop: STATE_BLOCKING_SUSPEND_REQUESTED is only a suspend state for full coop.

@lambdageek
Copy link
Member Author

lambdageek commented Apr 10, 2018

@luhenry I pushed a commit that wraps all the mono_os_event_* calls in threads.c with MONO_ENTER_GC_SAFE/MONO_EXIT_GC_SAFE - it makes appdomain-threadpool-unload.exe work under hybrid suspend. I don't really have a good sense whether this was the right fix.
(MonoOSEvent on non-Windows is backed by a mutex, so it seems right to do transitions - but it's in the suspend machinery itself, so...)


I still see the occasional appdomain-threadpool-unload.exe failure, but I also see that we disabled that test on CI PR builds - so it seems like it's generally flaky. Hybrid suspend seems to pass all the runtime tests now.

@luhenry
Copy link
Contributor

luhenry commented Apr 11, 2018

@lambdageek the longer term fix would be to have mono_coop_event_* which would wrap the mono_os_event_* functions with MONO_{ENTER,EXIT}_GC_SAFE and use these instead, but that can be done in a follow-up PR.

@luhenry
Copy link
Contributor

luhenry commented Apr 11, 2018

And maybe the test is flaky just because we have a lower chance of hitting what you observe with hybrid suspend.

@luhenry
Copy link
Contributor

luhenry commented Apr 11, 2018

@monojenkins build Linux ARMv5

@lambdageek
Copy link
Member Author

@monojenkins merge

@monojenkins monojenkins merged commit 529e486 into mono:master Apr 11, 2018
@lambdageek lambdageek deleted the coop-hybrid-suspend branch April 18, 2018 21:03
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
[coop] Hybrid suspend

## Summary

This is a new suspend mode where threads that are running managed or native runtime code (what we call "GC Unsafe") are suspended cooperatively using safepoints, but threads running either embedder code or P/Invoke code or blocking syscalls (what we call "GC Safe") are suspended preemptively using signals.

The motivation is to make an embedder-friendly coop suspend mechanism where the runtime will do GC Safe -> GC Unsafe transitions on calls to `MONO_API` functions, but the embedders themselves don't have to have any special knowledge of coop safepoints or state transitions.

To try this out, set **both** `MONO_ENABLE_COOP` and `MONO_ENABLE_HYBRID_COOP` environment variables.

## Implementation

Here's a picture of the new state machine.  There's two new states `Blocking_Suspend_Requested` and `Blocking_Async_Suspended` and one updated transition `req_s` (request suspension).

![coop-state-machine](https://user-images.githubusercontent.com/480437/38572948-9fe879e2-3cc2-11e8-8f27-9493f05f03bf.png)

### State machine states and transitions

The hybrid cooperative suspend mechanism works like this:
 - when a thread is executing in GC Unsafe mode, we use cooperative suspend and expect
   the thread to periodically checkpoint its execution (as in the ordinary full
   coop mode).
 - when a thread is executing in GC Safe mode, we use a preemptive signal-based
   suspend.  This is new - previously in full coop we would allow `BLOCKING`
   threads to continue executing and only suspend them when they wanted to go
   from GC Safe to GC Unsafe (via the `BLOCKING_SELF_SUSPENDED` state).

There are two new states and one updated transition - the transition will
service both running and blocking threads: The idea is that suspend mechanism
is determined by the suspend initiator, and is not embodied in the the state
machine.  The state machine just has to return distinctive enough values from
the transitions for the initiator to select a policy.

Resume mechanism is determined by the state that a thread finds itself in when
it is resumed, and the original suspend policy.

The primary differences between suspend mechanisms are: the preemptive
suspend is two phase - the `finish_async_suspension` does not
apply for cooperative suspend; the cooperative policy uses the `poll` transition when the thread is running.

We renamed `mono_threads_transition_request_async_suspension` to `mono_threads_transition_request_suspension`.

 - It must be called by a suspend initiator on a victim thread that is not
   itself.

 - There can only be one suspend initiator for a victim at a time - if the
   suspend initiator initiates a suspension it must follow through with the
   whole protocol.

 - If a victim is `RUNNING` we transition it to `ASYNC_SUSPEND_REQUSTED` and return
   `ReqSuspendInitSuspendRunning` which signals that the caller must initiate
   suspension for a running thread.  (Same as the old `AsyncSuspendInitSuspend`).

 - If a victim is `BLOCKING` we transition it to `BLOCKING_SUSPEND_REQUESTED`.
   Note that this is different from the old `request_async_suspension` which just
   incremented the suspend count and returned `AsyncSuspendBlocking`.  Now we
   return `ReqSuspendInitSuspendBlocking`.  The return of
   `ReqSuspendInitSuspendBlocking` means the initiator may signal the victim and
   must wait to be notified of the suspension (in preemptive suspend it will be
   notified by the signal handler, in cooperative when the blocking thread
   attempts to exit from blocking mode).

 - In `BLOCKING_SUSPEND_REQUESTED` we
   may increment the suspend count.  This is slightly too lax if we're going to
   be using preemptive suspend on blocking threads: in that case we're in the
   middle of a two-phase suspend and since there is only one suspend initiator,
   we wouldn't expect another suspend to be initiated until we have a chance to
   finish suspending.  We leave it to the suspend initiator to rule out by returning
  `ReqSuspendAlreadySuspendedBlocking` from this state.

 - In all the not executing states (`SELF_SUSPENDED`, `ASYNC_SUSPENDED`,
  `BLOCKING_SELF_SUSPENDED`, `BLOCKING_ASYNC_SUSPENDED`) we just
   increment the suspend count.

 - All other transitions are illegal.

- Note that `BLOCKING` must always have a suspend count of 0 and
   `BLOCKING_SUSPEND_REQUESTED` must have suspend_count > 0.

We add two new states:

 - `STATE_BLOCKING_SUSPEND_REQUESTED` - this is a new state that indicates that a
   suspend initiator wants to suspend this victim thread.
   - The thread is still executing in this state.
   - We only transition into this state with
     `mono_threads_transition_request_suspension`.
   - The only legal transitions out from this state are:
     - `resume` - decrements the `suspend_count`. If the `suspend_count` becomes 0 we
       go back to `BLOCKING`, otherwise we stay in `BLOCKING_SUSPEND_REQUESTED`.
     - `finish_async_suspension` (executed by the suspend signal handler to
       finish the two phase preemptive suspend request and notify the suspend initiator),
     - `done_blocking` and `abort_blocking`.  It means that the a
       suspend initiator requested a suspend but we got to done (or abort)
       blocking before the signal was delivered. We return `NotifyAndWait` and the
       caller is supposed to send a notification to the suspend initiator and
       wait for a resume(with suspend count == 0) which will return to
       `RUNNING` (the victim thread goes to `STATE_BLOCKING_SELF_SUSPENDED`).
 - `STATE_BLOCKING_ASYNC_SUSPENDED`
   - The thread is not executing in this state (it is waiting for a `resume`).
   - a thread transitions into this state from `STATE_BLOCKING_SUSPEND_REQUESTED`
     on a `finish_async_suspension` transition (see above).
   - on a `resume` transition we decrement the suspend count. if it's still
     positive we stay in this state. if it's 0 the thread transitions out of
     this state.  Unlike other resume transitions, in this case the thread goes
     back to `BLOCKING` and resumes executing in GC Safe mode.  The resume caller
     must notify the resume initiator (ie this is a `ResumeInitAsyncResume`).
   - on a `request_suspension` we stay in this state and just increment the
     suspend count.
   - all other transitions from this state are illegal.

We add two new return values for `abort_blocking` and `done_blocking`:
  - `DoneBlockingNotifyAndWait` and `AbortBlockingNotifyAndWait`. Similar to
    `SelfSuspendNotifyAndWait` - there was a race between the second phase of a
    preemptive suspend and another operation (in that case a poll, in this case a
    done or abort blocking) and the other operation won, so its caller should
    notify the suspend initiator. Since the thread is now suspended, the caller
    should wait for the resume signal.

We add new results for `request_initiate_suspension` `MonoRequestSuspendResult`
(formerly `MonoRequestAsyncSuspendResult`):
  - `ReqSuspendAlreadySuspended` means the thread is not executing and we just
    incremented its suspend count and there's nothing else for the suspend
    initiator to do. (The old `AsyncSuspendAlreadySuspended`)
  - `ReqSuspendAlreadySuspendedBlocking` means the thread is in
    `BLOCKING_SUSPEND_REQUESTED` and we initiated another suspend request.  This
    should only happen with full cooperative suspend - with hybrid suspend we
    will use a two phase preemptive suspend on a blocking thread and since only
    a single suspend initiator is active, this state should not be visible to
    another suspend initiator.
  - `ReqSuspendInitSuspendRunning` - Thread is executing in GC Unsafe mode and the
    caller should suspend it.  This is the old `AsyncSuspendInitSuspend`.  In
    full preemptive suspend this is the only suspend initiation action and the caller
    should begin the preemptive suspend procedure.  In full coop we expect the
    victim thread to reach a safepoint and poll to finish suspending.
  - `ReqSuspendInitSuspendBlocking` - Thread is executing in GC Safe mode and
    the caller should initiate a suspend.  In full coop the caller has nothing
    to do - a thread executing in blocking mode is assumed to be suspended.  In
    hybrid suspend, the caller has to initiate a preemptive suspend.

### Suspend initiator, signal handler, self suspender

Basically we call `mono_threads_transition_request_suspension` instead of `mono_threads_transition_request_async_suspension`.
Then if it says to suspend a running thread we call `begin_suspend_for_running_thread`, if it says initiate a suspension of a blocking thread, we call `begin_suspend_for_blocking_thread`, and if it says do nothing we do nothing.

**Questions**
1. We have an assumption in `safe_interrupt_thread` that we can return from a self interrupt if we're using coop suspend.  I'm not sure if that makes sense for hybrid suspend, and not clear what to do about it.
2. In `mono_thread_info_safe_suspend_and_run` we expect that the callback won't return `KeepSuspended` if coop is enabled.  I'm not sure what this means for hybrid.
3. A bunch of tests like `monitor-abort` and `appdomain-threadpool-unload` are failing if I enable hybrid suspend.  Don't know why yet.


Commit migrated from mono/mono@529e486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants