Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@CodingCat
Copy link
Contributor

The main change is to add a configurable parameter spark.driver.maxretrynum which limits the retry times when the in-cluster driver program fails to start,

in the current implementation, the in-cluster driver will fall into a dead loop when it fails to start unless the user kill it manually

in DriverRunner.scala: keepTrying = supervise && exitCode != 0 && !killed

With this PR

In the worker, if the driver has tried for spark.driver.maxretrynum times and still failed to run, the master will receive the notification (DriverStateChanged)

If the driver state is ERROR or FAILED, and the driver still has chance to run on another worker (i.e. if the retry number hasn't been over the number of workers or configured maxretrynum), the driver will be relaunched to the other worker

In this implementation, the updated retry number of the driver is not written to the file by persistenceEngine (to avoid too many IO operations). Also the structure recording which workers the driver has been assigned to is also ignored by persistenceEngine

The description of the implementation:

set two-level upper limit on retry number: master & worker, both of which are specified by spark.driver.maxRetry. The most obvious benefit of setting such a limit is to kill a failed driver earlier to save memory and cpu cores.

master-side limit is in driverInfo.retriedcountOnMaster, which is a transient variable. worker-side limit is a local variable in DriverRunner

in master-side, the default value of spark.driver.maxRetry is 0, in this case, the driver will be restarted in at most "workers.size" times, otherwise the driver will be restarted for MIN(maxRetry, workers.size) times; in worker-side, the default value of spark.driver.maxRetry is 0, in this case, the driver will be tried for ONLY ONCE (I changed the design to this because I think moving a driver to a new worker may have larger probability to run it successfully, e.g. some misconfigured env variable may be correct on other machines or memory may be larger there, etc.), otherwise, the driver will be restarted for maxRetry times

DriverInfo will not be serialized to worker, i.e. returning back to the before-PR implementation of the relevant DeployMessages

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

2 similar comments
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mateiz
Copy link
Contributor

mateiz commented Apr 5, 2014

Jenkins, test this please

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13804/

@CodingCat
Copy link
Contributor Author

closed, as Patrick said in https://issues.apache.org/jira/browse/SPARK-1039, this won't be fixed

@CodingCat CodingCat closed this Apr 5, 2014
@CodingCat CodingCat deleted the SPARK-1039 branch April 5, 2014 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants