Observability gap when using fail-mode: HopeUntilEndOfDeployProcess during temporary infra flakes (ECR limits)

### Before proceeding

- [x] I didn't find a similar [issue](https://github.com/werf/nelm/issues)

### Problem

I am facing a specific scenario where a new EKS node starts and multiple pods try to pull heavy images simultaneously. This triggers an AWS ECR QPS limit (rate-limit).

The behavior I observe:

Without HopeUntilEndOfDeployProcess, nelm fails almost immediately (within 2-5 mins) because it sees ImagePullBackOff.

However, if I just wait a few more seconds/minutes, the images actually get pulled successfully as Kubernetes retries.

To handle this, I tried using werf.io/fail-mode: HopeUntilEndOfDeployProcess.

The Problem: When "Hope" mode is active, nelm becomes "silent". It waits for the timeout (e.g., 20m), but it doesn't stream Pod logs or Kubernetes events (like the pull errors) during this period. If it eventually times out, I get a generic context timed out error without any logs or details of what was happening during those 20 minutes.

I understand that I need to fix my rate-limits on the infrastructure side, but I want to understand how to handle this correctly in nelm.

Questions:

Is there a way to increase the "grace period" before nelm decides to Fail-Fast, without switching to the silent "Hope" mode?

How can I force nelm to keep showing logs and events while it is in HopeUntilEndOfDeployProcess mode?

What is the recommended way to handle resources that are "flaky" at the start but definitely healthy after a short retry period?

Desired behavior: I want nelm to be "patient" (wait longer than 2 mins for a pull), but stay "talkative" (show me that it's currently failing and why).

### Solution (if you have one)

_No response_

### Additional information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Observability gap when using fail-mode: HopeUntilEndOfDeployProcess during temporary infra flakes (ECR limits) #517

Before proceeding

Problem

Solution (if you have one)

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Observability gap when using fail-mode: HopeUntilEndOfDeployProcess during temporary infra flakes (ECR limits) #517

Description

Before proceeding

Problem

Solution (if you have one)

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions