-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Before proceeding
- I didn't find a similar issue
Problem
I am facing a specific scenario where a new EKS node starts and multiple pods try to pull heavy images simultaneously. This triggers an AWS ECR QPS limit (rate-limit).
The behavior I observe:
Without HopeUntilEndOfDeployProcess, nelm fails almost immediately (within 2-5 mins) because it sees ImagePullBackOff.
However, if I just wait a few more seconds/minutes, the images actually get pulled successfully as Kubernetes retries.
To handle this, I tried using werf.io/fail-mode: HopeUntilEndOfDeployProcess.
The Problem: When "Hope" mode is active, nelm becomes "silent". It waits for the timeout (e.g., 20m), but it doesn't stream Pod logs or Kubernetes events (like the pull errors) during this period. If it eventually times out, I get a generic context timed out error without any logs or details of what was happening during those 20 minutes.
I understand that I need to fix my rate-limits on the infrastructure side, but I want to understand how to handle this correctly in nelm.
Questions:
Is there a way to increase the "grace period" before nelm decides to Fail-Fast, without switching to the silent "Hope" mode?
How can I force nelm to keep showing logs and events while it is in HopeUntilEndOfDeployProcess mode?
What is the recommended way to handle resources that are "flaky" at the start but definitely healthy after a short retry period?
Desired behavior: I want nelm to be "patient" (wait longer than 2 mins for a pull), but stay "talkative" (show me that it's currently failing and why).
Solution (if you have one)
No response
Additional information
No response