generated from NASA-PDS/template-repo-java
-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Description
πͺ Motivation
I'd like to be able to set the backoffLimit property of KDP nodes to something other than the default of 6 when I deploy the operator.
π Additional Details
Nodes in KDP fail for several reasons, from running on EC2 spot instances that were evicted, to the host node running OOM. In most cases, just because Pods are failing, the entire pipeline shouldn't come to a halt.
βοΈ Acceptance Criteria
- When I deploy the operator, I can specify what the
backoffLimitof Jobs should be - Processing doesn't come to a standstill when any Node has its Pod die 6 times
βοΈ Engineering Details
- Update https://github.com/NASA-PDS/kdp/blob/main/operator/operator/manifests/node/job.yaml with a variable for
backoffLimit - Update https://github.com/NASA-PDS/kdp/blob/main/operator/operator/operator.py to pass that variable in from the input manifest (not exactly sure how this works, yet)