Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support spinning / spinlocks / busy-loops #1792

@sporksmith

Description

@sporksmith

This is a tracking issue for supporting "spin" waiting with sched_yield. Today it doesn't work at all - executing a sched_yield is effectively a no-op inside of a Shadow simulation. If it's used in a loop that's waiting for another thread on the same host to do something (like release a spin lock), it'll never happen.

Typically such spin loops are waiting for a condition that will be satisfied by another thread running on the same host - e.g. releasing a spin lock. It's also possible to end up in a spin loop that's doing non-blocking IO calls, but that's usually a bug (such as smol-rs/async-io#73)

Spinlock fallback types

Spinlocks often have some type of fallback to a less aggressive waiting mechanism if the condition they're waiting for isn't met for a while.

  • No fallback: Without some fallback, currently under Shadow A spinlock that has no never falls back to something less aggressive currently deadlocks, since other threads never get a chance to run.

  • Wallclock fallback: A spinlock that spins for some maximum wallclock duration also currently deadlocks, since sched_yield neither moves time forward nor gives other threads a chance to run.

  • Iteration fallback: A spinlock that spins some maximum number of iterations will eventually work. Arguably a "good" spinlock implementation will have such a limit.

Potential solutions

  • Reschedule thread: Implement sched_yield to allow other ready threads to run, but not move time forward: This would minimize the change to "simulation-visible" behavior, and could allow Shadow to handle spin-loops where another thread is ready to satisfy the condition (e.g. do some processing and release a spinlock). This is a little tricky though - it'd require adding some way of scheduling a "run this thread again" event at the same sim-time, but after other alread-scheduled events.

  • Nanosleep: Implement sched_yield as effectively a nanosleep of n ns. With a very small value (e.g. 1 ns) this would solve at least as many spin-loops as "reschedule thread", but would be much easier to implement. It'd slightly change simulation results, but since even the cheapest syscalls take ~20 ns, n up to that amount would only make the simulation more accurate (well, until that minimum syscall latency gets smaller on future systems). This could also fix some spin loops that are stuck waiting for a wallclock fallback, but with a very small n could take many iterations to exit the loop.

    Larger values of n would reduce the number of such iterations, but at some point could cause the thread in the simulation to be unrealistically slow. e.g. waiting for a full 1s would let shadow exit most such spin loops in a single iteration, but inside the simulation would make the thread seem "slower" than it would be on a real system.

Real world examples

libopenblas compiled with thread support makes extensive use of spin loops: #1788. They fallback out of the loop based on "wallclock time", which on x86 ends up being ~270M CPU cycles (270 ms on a 1 GHz CPU). i.e. implementing a "minimal" wait of ~20 ns would cause such loops to be simulated extremely slowly, to the point of being mistaken for deadlock anyway. Implementing a sleep on the order of 1 ms would make these not horribly slow, but we'd probably want to carefully evaluate whether this skews simulation results.

Conclusion

In the only current real-world example (libopenblas), we would need sched_yield to sleep for a fairly substantial amount of time to avoid substantial overhead in Shadow emulating many iterations of the loop. Since there other workarounds (don't use libopenblas with thread support compiled in, or set OPENBLAS_NUM_THREADS=1), it doesn't seem worth potentially skewing other simulation results.

For now, let's keep this issue open and collect other real-world examples as they come up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: BugError or flaw producing unexpected results

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions