-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[RFC] Fix unnecessary busy waiting when no new tasks are incoming #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@p12tic This is a great pull proposal! I will be happy to understand more details (yes I am aware of the issue you mentioned) by looking into the pull closely. Normally, I use Taskflow Benchmarks to analyze the performance difference between two versions. I have not yet had any good idea/time to automate this process like the usual regression. Any suggestion from you will be of great welcome :) Also, have you passed the unittest? The unittest should also tell you some runtime data from a comprehensive set of tests. |
@tsung-wei-huang Thanks for the suggestions. As a preliminary test run I ran New (average 38.3s per run):
Old (average 40.1s):
So there is around 4% performance improvement visible in unit tests already. I will run the proper benchmarks a bit later. Regarding regression testing, to compare whether software version A is faster/slower than version B, a strategy of running the tests in ABABABABABAB order works even on unstable hardware that is dependent on environmental factors such as temperature. I've used this strategy successfully in the past to catch statistically significant performance regressions in mobile apps. It should be possible to run this on something like Github Actions even when we don't control the actual hardware the code gets to run on. This will not help tracking down performance differences over time, because there's large variation of performance parameters even on identical hardware and we don't control what runner the performance test will run on on Github side. |
@p12tic thank you for the feedback. What if the following situation happens:
This is because the worker running the linear chain is counting on the fact at least one other worker is making a steal attempt, so the line here does not invoke any worker. In this case, your algorithm will force |
@tsung-wei-huang Thanks a lot for the question. (This is a long comment, but I think you could find it interesting as there's small likelihood I've found a genuine improvement to the algorithm) Turns out that your previous comment was spot on and I essentially broke the scheduling algorithm. The fact that there's at least one busy stealing worker is not a bug but an invariant of the design. I didn't previously read your latest paper about Taskflow in enough detail, now that I have read I see this as one of the core principles. I have turned your example into a test and it verifies that the current PR breaks things. The test is included into new iteration of the PR. Given that you know the library very well, I assume that the chance that I can improve it is pretty small, but I will still explore alternatives a bit, as one busy stealing worker is very non-optimal in my use case. The objective is to park the currently busy stealing worker in cases when there's no new tasks. In order to do this we need to somehow wake at least one worker when there's a new task and there's no busy stealing worker. Suppose we allow the last stealing worker to park once it doesn't find any tasks to steal. At that point the following holds:
We can pick this situation as a synchronization point. When a worker schedules a new task, if it notices that its queue was empty just before adding a new task to it, it could check if there are zero stealing workers and wake at least one up. More formally:
It's evident This PR now implements the approach outlined above. The last commit optimizes it a bit by extracting the information of whether I have run benchmarks from the |
@tsung-wei-huang The thing that makes it possible to have a very small number of |
Hi @p12tic, Sorry for the late response as I have been thinking of a better way to put everything together. Thank you for the pull request. It inspires me to rethink a new work-stealing algorithm to avoid busy stealing when no new tasks are coming. I summarize the solution below (might be a bit long to read): Recap: The busy waiting problem happened in the design of Taskflow's scheduling algorithm. The algorithm maintains the invariant -- when there is an active worker running tasks, we should keep at least a worker busy in stealing unless all are active. This prevents starvation and keeps over-utilized threads upper-bounded by 1. Of course, in the extreme case where you have a long linear chain of tasks, the busy worker making stealing attempts become totally non-necessary. Motivation: To solve the problem, I have designed a new invariant based on our discussion -- when a task queue of a worker is non-empty, we should keep at least one worker busy in stealing unless all workers are busy in running tasks. This new invariant is different from the original one in the sense that no workers will be busy making stealing attempts if all queues are empty, in particular, when a worker is running a long task and its queue is empty. So, if we can keep this new invariant in our algorithm, we are good :) Solution: I have pushed my solution to both the master and the dev branches. We can realize this new invariant by implementing the following logic:
Proof Sketch: Consider the following two situations:
Benefit: The new invariant largely simplifies the scheduling algorithm. We now remove # original invariant (with busy stealing)
Total Test time (real) = 20.13 sec
Total Test time (real) = 19.94 sec
Total Test time (real) = 19.90 sec
Total Test time (real) = 19.48 sec
Total Test time (real) = 19.58 sec
Total Test time (real) = 19.86 sec
Total Test time (real) = 19.82 sec
Total Test time (real) = 20.10 sec
Total Test time (real) = 19.82 sec
Total Test time (real) = 19.83 sec
# new invariant (without busy stealing)
Total Test time (real) = 19.11 sec
Total Test time (real) = 18.75 sec
Total Test time (real) = 19.19 sec
Total Test time (real) = 18.83 sec
Total Test time (real) = 19.02 sec
Total Test time (real) = 18.97 sec
Total Test time (real) = 19.09 sec
Total Test time (real) = 18.96 sec
Total Test time (real) = 19.04 sec
Total Test time (real) = 19.05 sec Also, I ran the following program and compared its CPU utilization between the two invariants: tf::Executor executor;
tf::Taskflow taskflow();
taskflow.emplace([]() { while(1); });
executor.run(taskflow).wait(); Below is the CPU utilization using # original invariant
2993445 twhuang 20 0 596240 3428 3140 S 199.7 0.0 0:16.55 simple
# new invariant
2159625 twhuang 20 0 596240 3376 3092 S 100.3 0.0 0:19.82 simple Action for you: If possible, could you please test the new version (main branch) on your workload again? The implementation does not have any additional atomic variables like Thank you! This is a big contribution from you :) Probably we can consider submitting our result to JSSPP. |
@tsung-wei-huang The boost of performance in this particular case is because right now workers wake tasks much more frequently. I'm not saying this is bad, just that I saw similar performance improvement when waking workers without checking for Also in e654075#diff-4a1fcfe46f56827ffcf457447e964af5bdbc7e4d2673d9dd6a5f2b8db3af577aR1281, we should implement the check like in this PR, because most likely |
I will try to setup a benchmark that would exacerbate the issues that I have mentioned. |
Perfect! Thank you for the help! |
@tsung-wei-huang I have updated the PR with test ported to the new interface. I needed a way to get the number of sleeping workers once num_actives was removed. |
The current implementation has revised the algorithm and included ideas in this pull request. |
Currently if there is a long-running task that does not itself produce additional tasks there will be one thread doing busy waiting because it won't be able to park. This is because if the thread is the last one without tasks (_num_thieves == 1), then parking can only happen if _num_actives == 0.
This check is to prevent missed wakeups that are initiated in _exploit_task(). The same can be achieved by enhancing the condition to specifically check for transition of _num_actives from 0 to nonzero. This allows the thread to park and avoids busy waiting.
The change removes the guarantee that when _num_actives changes from 0 to 1 and the last thief thread is in the process of parking, then this parking will be cancelled. It's possible, that between
_num_actives.load()
andif(prev_num_actives == 0 && _num_actives)
the value of _num_actives goes from 1 to 0 and back to 1 due to different threads exiting and entering _exploit_task().This is a rare occurrence and rather harmless, because even if the thread mistakenly parked itself, it verified that the queues are empty just before doing so. Any new items in the queues will wake up at least one thread anyway.
To give some hard numbers, I'm working with an application that has long periods of time when only single task is active. When testing on a Samsung S10 phone I saw around 30% performance improvement with this fix and reduction of CPU usage from 200% to 100% during the periods of time when only single task is active. The performance improvement is due to higher boost clocks when only single thread is active and also due to being able to keep high frequencies for longer without throttling.
This PR needs testing on a wider set of test scenarios. Could you point me to test suites that you usually run to verify that there are no performance regressions?