Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[FEATURE] Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7149

@wangzhigang1999

Description

@wangzhigang1999

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

This feature introducing a mechanism that, when the Spark engine decides to shut down, starts a shutdown watchdog. If the timeout is reached, it will print the stack traces of all currently alive threads and then forcibly terminate the process.

Motivation

Currently, there are scenarios where the engine should exit but fails to do so due to various reasons, and these scenarios cannot be exhaustively enumerated. For example, see this discussion: #6992 (reply in thread), and these issues: #4280, #7019.

Similarly, we encountered this issue in production. For example, in the following log, after SparkContext stopped, the entire process should have executed the shutdown hook and exited. However, due to an abnormal Ranger thread, the process was blocked for over ten days until it eventually exhausted the ECS resources and was finally discovered.

Image

Describe the solution

I want to add a daemon watchdog thread that starts with a timeout when the stop() method is called. If the process can shut down normally, this daemon thread will be interrupted and the entire process will exit gracefully. If the timeout is reached and the process is still alive, it means some threads are blocking the shutdown; I will then print all active threads in the current process and force quit.

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions