Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Observability gap when using fail-mode: HopeUntilEndOfDeployProcess during temporary infra flakes (ECR limits) #517

@ashuraits

Description

@ashuraits

Before proceeding

  • I didn't find a similar issue

Problem

I am facing a specific scenario where a new EKS node starts and multiple pods try to pull heavy images simultaneously. This triggers an AWS ECR QPS limit (rate-limit).

The behavior I observe:

Without HopeUntilEndOfDeployProcess, nelm fails almost immediately (within 2-5 mins) because it sees ImagePullBackOff.

However, if I just wait a few more seconds/minutes, the images actually get pulled successfully as Kubernetes retries.

To handle this, I tried using werf.io/fail-mode: HopeUntilEndOfDeployProcess.

The Problem: When "Hope" mode is active, nelm becomes "silent". It waits for the timeout (e.g., 20m), but it doesn't stream Pod logs or Kubernetes events (like the pull errors) during this period. If it eventually times out, I get a generic context timed out error without any logs or details of what was happening during those 20 minutes.

I understand that I need to fix my rate-limits on the infrastructure side, but I want to understand how to handle this correctly in nelm.

Questions:

Is there a way to increase the "grace period" before nelm decides to Fail-Fast, without switching to the silent "Hope" mode?

How can I force nelm to keep showing logs and events while it is in HopeUntilEndOfDeployProcess mode?

What is the recommended way to handle resources that are "flaky" at the start but definitely healthy after a short retry period?

Desired behavior: I want nelm to be "patient" (wait longer than 2 mins for a pull), but stay "talkative" (show me that it's currently failing and why).

Solution (if you have one)

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions