Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

p12tic
Copy link
Contributor

@p12tic p12tic commented Oct 28, 2022

Currently if there is a long-running task that does not itself produce additional tasks there will be one thread doing busy waiting because it won't be able to park. This is because if the thread is the last one without tasks (_num_thieves == 1), then parking can only happen if _num_actives == 0.

This check is to prevent missed wakeups that are initiated in _exploit_task(). The same can be achieved by enhancing the condition to specifically check for transition of _num_actives from 0 to nonzero. This allows the thread to park and avoids busy waiting.

The change removes the guarantee that when _num_actives changes from 0 to 1 and the last thief thread is in the process of parking, then this parking will be cancelled. It's possible, that between _num_actives.load() and if(prev_num_actives == 0 && _num_actives) the value of _num_actives goes from 1 to 0 and back to 1 due to different threads exiting and entering _exploit_task().

This is a rare occurrence and rather harmless, because even if the thread mistakenly parked itself, it verified that the queues are empty just before doing so. Any new items in the queues will wake up at least one thread anyway.

To give some hard numbers, I'm working with an application that has long periods of time when only single task is active. When testing on a Samsung S10 phone I saw around 30% performance improvement with this fix and reduction of CPU usage from 200% to 100% during the periods of time when only single task is active. The performance improvement is due to higher boost clocks when only single thread is active and also due to being able to keep high frequencies for longer without throttling.

This PR needs testing on a wider set of test scenarios. Could you point me to test suites that you usually run to verify that there are no performance regressions?

@tsung-wei-huang
Copy link
Member

tsung-wei-huang commented Oct 28, 2022

@p12tic This is a great pull proposal! I will be happy to understand more details (yes I am aware of the issue you mentioned) by looking into the pull closely. Normally, I use Taskflow Benchmarks to analyze the performance difference between two versions. I have not yet had any good idea/time to automate this process like the usual regression. Any suggestion from you will be of great welcome :)

Also, have you passed the unittest? The unittest should also tell you some runtime data from a comprehensive set of tests.

@p12tic
Copy link
Contributor Author

p12tic commented Oct 29, 2022

@tsung-wei-huang Thanks for the suggestions.

As a preliminary test run I ran ctest on current master branch and on PR branch. The timings are as follows (this is on a workstation with AMD 2990WX with turbo boost enabled):

New (average 38.3s per run):

Total Test time (real) =  39.85 sec
Total Test time (real) =  37.95 sec
Total Test time (real) =  38.34 sec
Total Test time (real) =  38.06 sec
Total Test time (real) =  36.91 sec
Total Test time (real) =  37.08 sec
Total Test time (real) =  39.76 sec
Total Test time (real) =  37.30 sec
Total Test time (real) =  37.86 sec
Total Test time (real) =  40.05 sec

Old (average 40.1s):

Total Test time (real) =  39.00 sec
Total Test time (real) =  40.37 sec
Total Test time (real) =  39.88 sec
Total Test time (real) =  39.36 sec
Total Test time (real) =  39.45 sec
Total Test time (real) =  41.30 sec
Total Test time (real) =  39.73 sec
Total Test time (real) =  41.38 sec
Total Test time (real) =  40.13 sec
Total Test time (real) =  40.74 sec

So there is around 4% performance improvement visible in unit tests already. I will run the proper benchmarks a bit later.

Regarding regression testing, to compare whether software version A is faster/slower than version B, a strategy of running the tests in ABABABABABAB order works even on unstable hardware that is dependent on environmental factors such as temperature. I've used this strategy successfully in the past to catch statistically significant performance regressions in mobile apps. It should be possible to run this on something like Github Actions even when we don't control the actual hardware the code gets to run on. This will not help tracking down performance differences over time, because there's large variation of performance parameters even on identical hardware and we don't control what runner the performance test will run on on Github side.

@tsung-wei-huang
Copy link
Member

tsung-wei-huang commented Oct 29, 2022

@p12tic thank you for the feedback. What if the following situation happens:

// many parallel tasks spawned after a long linear chain (p1, p2, p3, ...)

// linear chain
a.precede(b);
b.precede(c);
c.precede(d);
d.precede(e);

// parallel task after the linear chain
e.precede(p1, p2, p3, p4, p5)

This is because the worker running the linear chain is counting on the fact at least one other worker is making a steal attempt, so the line here does not invoke any worker.

In this case, your algorithm will force p1 to p5 to be run by just one worker, right?

@p12tic
Copy link
Contributor Author

p12tic commented Nov 3, 2022

@tsung-wei-huang Thanks a lot for the question.

(This is a long comment, but I think you could find it interesting as there's small likelihood I've found a genuine improvement to the algorithm)

Turns out that your previous comment was spot on and I essentially broke the scheduling algorithm. The fact that there's at least one busy stealing worker is not a bug but an invariant of the design. I didn't previously read your latest paper about Taskflow in enough detail, now that I have read I see this as one of the core principles.

I have turned your example into a test and it verifies that the current PR breaks things. The test is included into new iteration of the PR.

Given that you know the library very well, I assume that the chance that I can improve it is pretty small, but I will still explore alternatives a bit, as one busy stealing worker is very non-optimal in my use case.

The objective is to park the currently busy stealing worker in cases when there's no new tasks. In order to do this we need to somehow wake at least one worker when there's a new task and there's no busy stealing worker.

Suppose we allow the last stealing worker to park once it doesn't find any tasks to steal. At that point the following holds:

  • There are zero stealing workers.

  • All workers have exactly zero tasks in their queues.

We can pick this situation as a synchronization point. When a worker schedules a new task, if it notices that its queue was empty just before adding a new task to it, it could check if there are zero stealing workers and wake at least one up. More formally:

// last stealing worker just before parking
num_thieves = 0
if all queues empty:
    park()

// worker that schedules a new task to its empty queue
queue_was_empty = queue.empty()
queue.push(task)
if num_thieves == 0 && queue_was_empty:
    notify_worker()

It's evident num_thieves = 0 in the parking worker must happen before queue.push(task) of any scheluling worker in order for the thread to park. As a consequence the scheduling worker is guarantee to observe num_thieves == 0 in case the parking thread parked and will be able to notify it.

This PR now implements the approach outlined above. The last commit optimizes it a bit by extracting the information of whether num_thieves == 0 is true into a separate variable. This makes the checks of num_thieves == 0 essentially free as the value can live in private caches of the processing cores most of the time. This is not possible with num_thieves because its value changes often.

I have run benchmarks from the benchmarks directory and did not observe any change of performance. One note is that I run comparisons with "Return whether TaskQueue was empty before push()" commit not the master branch because at least GCC 10.2 optimizes the function slightly differently (verified with dissassembler) which introduces around 3% performance regression depending on benchmark. I will resolve this by a separate PR as it's pre-existing missed optimization.

@p12tic
Copy link
Contributor Author

p12tic commented Nov 3, 2022

Just copying a comment I made on #442.

I've setup some counters and the frequency of _notifier.notify() calls from _schedule() are minuscule as implemented in #440. It's on the order of less than ten per second.

@p12tic
Copy link
Contributor Author

p12tic commented Nov 3, 2022

@tsung-wei-huang The thing that makes it possible to have a very small number of _notifier.notify(...) calls from _schedule(...) is this commit 9273952. Basically, we ensure that _notifier.notify(...) is unlikely to be called more than once in cases when it's needed.

@tsung-wei-huang
Copy link
Member

tsung-wei-huang commented Nov 7, 2022

Hi @p12tic,

Sorry for the late response as I have been thinking of a better way to put everything together. Thank you for the pull request. It inspires me to rethink a new work-stealing algorithm to avoid busy stealing when no new tasks are coming. I summarize the solution below (might be a bit long to read):

Recap: The busy waiting problem happened in the design of Taskflow's scheduling algorithm. The algorithm maintains the invariant -- when there is an active worker running tasks, we should keep at least a worker busy in stealing unless all are active. This prevents starvation and keeps over-utilized threads upper-bounded by 1. Of course, in the extreme case where you have a long linear chain of tasks, the busy worker making stealing attempts become totally non-necessary.

Motivation: To solve the problem, I have designed a new invariant based on our discussion -- when a task queue of a worker is non-empty, we should keep at least one worker busy in stealing unless all workers are busy in running tasks. This new invariant is different from the original one in the sense that no workers will be busy making stealing attempts if all queues are empty, in particular, when a worker is running a long task and its queue is empty. So, if we can keep this new invariant in our algorithm, we are good :)

Solution: I have pushed my solution to both the master and the dev branches. We can realize this new invariant by implementing the following logic:

  1. When a task queue of a worker becomes non-empty (i.e., from 0 to 1) AND there is no thieves making stealing attempts, we should notify. See code here.
  2. When a task drains out its local task queue, it becomes a thief (++_num_thieves) and starts to make stealing attempts. There are two situations here: If the steal is successful and it is the last thief, it notifies another thief to avoid starvation. Otherwise, it moves on to the 2-phase committee guard, decrements the thieve count (--_num_thieves), and check if all queus are empty before committing to sleep. See code here.

Proof Sketch: Consider the following two situations:

  1. A worker inserts a task to its empty queue: here, if _num_thieves ==0, we have the easy case to notify; otherwise, we know at least one thieves will check all queues again before committing to sleep (notice here we decrement the thieve count before this checking to resemble Dekker algorithm)
  2. A worker inserts a task to its non-empty queue: here, we will by-pass expensive calls to _num_thieves and notify, but can we guarantee no starvation happens? Yes, by induction, the 1st proof will guarantee at least one thieve will eventually steal this worker's queue and check if it is empty. Additionally, if the thieve is the last, it will wake up another one to steal tasks.

Benefit: The new invariant largely simplifies the scheduling algorithm. We now remove _num_actives which was required for the original invariant. Also, the _wait_for_tasks is also simplified without any access to _num_actives. The cost of accessing _num_thieves is not much for a running worker because it only happens when that worker inserts a task into an empty queue. I am also seeing a boost of performance with this new invariant:

# original invariant (with busy stealing)
Total Test time (real) =  20.13 sec
Total Test time (real) =  19.94 sec
Total Test time (real) =  19.90 sec
Total Test time (real) =  19.48 sec
Total Test time (real) =  19.58 sec
Total Test time (real) =  19.86 sec
Total Test time (real) =  19.82 sec
Total Test time (real) =  20.10 sec
Total Test time (real) =  19.82 sec
Total Test time (real) =  19.83 sec

# new invariant (without busy stealing)
Total Test time (real) =  19.11 sec
Total Test time (real) =  18.75 sec
Total Test time (real) =  19.19 sec
Total Test time (real) =  18.83 sec
Total Test time (real) =  19.02 sec
Total Test time (real) =  18.97 sec
Total Test time (real) =  19.09 sec
Total Test time (real) =  18.96 sec
Total Test time (real) =  19.04 sec
Total Test time (real) =  19.05 sec

Also, I ran the following program and compared its CPU utilization between the two invariants:

tf::Executor executor;
tf::Taskflow taskflow();
taskflow.emplace([]() { while(1); });  
executor.run(taskflow).wait();

Below is the CPU utilization using top command (199.7 vs 100.3):

# original invariant 
2993445 twhuang   20   0  596240   3428   3140 S 199.7   0.0   0:16.55 simple 

# new invariant
2159625 twhuang   20   0  596240   3376   3092 S 100.3   0.0   0:19.82 simple

Action for you: If possible, could you please test the new version (main branch) on your workload again? The implementation does not have any additional atomic variables like _last_thief_needs_wake_up you created in this pull. Also, I have verified the implementation is correct in terms of avoiding starvation and over-subscription of threading. Additionally, I will appreciate you modify the pull by migrating your unittest to the new implementation.

Thank you! This is a big contribution from you :)

Probably we can consider submitting our result to JSSPP.

@p12tic
Copy link
Contributor Author

p12tic commented Nov 7, 2022

@tsung-wei-huang The boost of performance in this particular case is because right now workers wake tasks much more frequently. _num_thieves will stay zero until some of the workers wakes up and enters stealing loop. During this time the queue may go from empty to non empty for more than one worker, more than once. Thus more than one worker may be woken up.

I'm not saying this is bad, just that I saw similar performance improvement when waking workers without checking for _num_thieves here.

Also in e654075#diff-4a1fcfe46f56827ffcf457447e964af5bdbc7e4d2673d9dd6a5f2b8db3af577aR1281, we should implement the check like in this PR, because most likely num_nodes workers will be woken up.

@p12tic
Copy link
Contributor Author

p12tic commented Nov 7, 2022

I will try to setup a benchmark that would exacerbate the issues that I have mentioned.

@tsung-wei-huang
Copy link
Member

Perfect! Thank you for the help!

@p12tic
Copy link
Contributor Author

p12tic commented Nov 8, 2022

@tsung-wei-huang I have updated the PR with test ported to the new interface. I needed a way to get the number of sleeping workers once num_actives was removed.

@tsung-wei-huang
Copy link
Member

The current implementation has revised the algorithm and included ideas in this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants