SPARK-1039: Set the upper bound for retry times of in-cluster drivers #8

CodingCat · 2014-02-26T23:58:51Z

The main change is to add a configurable parameter spark.driver.maxretrynum which limits the retry times when the in-cluster driver program fails to start,

in the current implementation, the in-cluster driver will fall into a dead loop when it fails to start unless the user kill it manually

in DriverRunner.scala: keepTrying = supervise && exitCode != 0 && !killed

With this PR

In the worker, if the driver has tried for spark.driver.maxretrynum times and still failed to run, the master will receive the notification (DriverStateChanged)

If the driver state is ERROR or FAILED, and the driver still has chance to run on another worker (i.e. if the retry number hasn't been over the number of workers or configured maxretrynum), the driver will be relaunched to the other worker

In this implementation, the updated retry number of the driver is not written to the file by persistenceEngine (to avoid too many IO operations). Also the structure recording which workers the driver has been assigned to is also ignored by persistenceEngine

The description of the implementation:

set two-level upper limit on retry number: master & worker, both of which are specified by spark.driver.maxRetry. The most obvious benefit of setting such a limit is to kill a failed driver earlier to save memory and cpu cores.

master-side limit is in driverInfo.retriedcountOnMaster, which is a transient variable. worker-side limit is a local variable in DriverRunner

in master-side, the default value of spark.driver.maxRetry is 0, in this case, the driver will be restarted in at most "workers.size" times, otherwise the driver will be restarted for MIN(maxRetry, workers.size) times; in worker-side, the default value of spark.driver.maxRetry is 0, in this case, the driver will be tried for ONLY ONCE (I changed the design to this because I think moving a driver to a new worker may have larger probability to run it successfully, e.g. some misconfigured env variable may be correct on other machines or memory may be larger there, etc.), otherwise, the driver will be restarted for maxRetry times

DriverInfo will not be serialized to worker, i.e. returning back to the before-PR implementation of the relevant DeployMessages

AmplabJenkins · 2014-02-26T23:58:53Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-02-27T00:04:06Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-03-28T01:55:33Z

Can one of the admins verify this patch?

mateiz · 2014-04-05T22:20:52Z

Jenkins, test this please

AmplabJenkins · 2014-04-05T22:22:24Z

Build triggered.

AmplabJenkins · 2014-04-05T22:22:31Z

Build started.

AmplabJenkins · 2014-04-05T22:25:25Z

Build finished.

AmplabJenkins · 2014-04-05T22:25:26Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13804/

CodingCat · 2014-04-05T22:53:18Z

closed, as Patrick said in https://issues.apache.org/jira/browse/SPARK-1039, this won't be fixed

CodingCat closed this Apr 5, 2014

CodingCat deleted the SPARK-1039 branch April 5, 2014 22:53

Set the upper bound for retry times of in-cluster drivers

90a685a

PingHao mentioned this pull request Oct 9, 2019

[SPARK-28120][SS] Rocksdb state storage implementation #24922

Closed

dongjoon-hyun mentioned this pull request Apr 23, 2020

[SPARK-31495][SQL] Support formatted explain for AQE #28271

Closed

holdenk mentioned this pull request May 1, 2020

[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370

Closed

viirya mentioned this pull request Jun 10, 2020

[SPARK-31954][SQL] Delete duplicate testcase in HiveQuerySuite #28782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-1039: Set the upper bound for retry times of in-cluster drivers #8

SPARK-1039: Set the upper bound for retry times of in-cluster drivers #8

Uh oh!

CodingCat commented Feb 26, 2014

Uh oh!

AmplabJenkins commented Feb 26, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

mateiz commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

CodingCat commented Apr 5, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SPARK-1039: Set the upper bound for retry times of in-cluster drivers #8

SPARK-1039: Set the upper bound for retry times of in-cluster drivers #8

Uh oh!

Conversation

CodingCat commented Feb 26, 2014

Uh oh!

AmplabJenkins commented Feb 26, 2014

Uh oh!

AmplabJenkins commented Feb 27, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

mateiz commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

CodingCat commented Apr 5, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants