SPARK-1039: Set the upper bound for retry times of in-cluster drivers #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The main change is to add a configurable parameter spark.driver.maxretrynum which limits the retry times when the in-cluster driver program fails to start,
in the current implementation, the in-cluster driver will fall into a dead loop when it fails to start unless the user kill it manually
in DriverRunner.scala: keepTrying = supervise && exitCode != 0 && !killed
With this PR
In the worker, if the driver has tried for spark.driver.maxretrynum times and still failed to run, the master will receive the notification (DriverStateChanged)
If the driver state is ERROR or FAILED, and the driver still has chance to run on another worker (i.e. if the retry number hasn't been over the number of workers or configured maxretrynum), the driver will be relaunched to the other worker
In this implementation, the updated retry number of the driver is not written to the file by persistenceEngine (to avoid too many IO operations). Also the structure recording which workers the driver has been assigned to is also ignored by persistenceEngine
The description of the implementation:
set two-level upper limit on retry number: master & worker, both of which are specified by spark.driver.maxRetry. The most obvious benefit of setting such a limit is to kill a failed driver earlier to save memory and cpu cores.
master-side limit is in driverInfo.retriedcountOnMaster, which is a transient variable. worker-side limit is a local variable in DriverRunner
in master-side, the default value of spark.driver.maxRetry is 0, in this case, the driver will be restarted in at most "workers.size" times, otherwise the driver will be restarted for MIN(maxRetry, workers.size) times; in worker-side, the default value of spark.driver.maxRetry is 0, in this case, the driver will be tried for ONLY ONCE (I changed the design to this because I think moving a driver to a new worker may have larger probability to run it successfully, e.g. some misconfigured env variable may be correct on other machines or memory may be larger there, etc.), otherwise, the driver will be restarted for maxRetry times
DriverInfo will not be serialized to worker, i.e. returning back to the before-PR implementation of the relevant DeployMessages