-
Notifications
You must be signed in to change notification settings - Fork 984
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the feature
This feature introducing a mechanism that, when the Spark engine decides to shut down, starts a shutdown watchdog. If the timeout is reached, it will print the stack traces of all currently alive threads and then forcibly terminate the process.
Motivation
Currently, there are scenarios where the engine should exit but fails to do so due to various reasons, and these scenarios cannot be exhaustively enumerated. For example, see this discussion: #6992 (reply in thread), and these issues: #4280, #7019.
Similarly, we encountered this issue in production. For example, in the following log, after SparkContext stopped, the entire process should have executed the shutdown hook and exited. However, due to an abnormal Ranger thread, the process was blocked for over ten days until it eventually exhausted the ECS resources and was finally discovered.
Describe the solution
I want to add a daemon watchdog thread that starts with a timeout when the stop() method is called. If the process can shut down normally, this daemon thread will be interrupted and the entire process will exit gracefully. If the timeout is reached and the process is still alive, it means some threads are blocking the shutdown; I will then print all active threads in the current process and force quit.
Additional context
No response
Are you willing to submit PR?
- Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
- No. I cannot submit a PR at this time.