Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Redexing
Copy link
Contributor

@Redexing Redexing commented May 28, 2025

when a probe target endpoints change, probes for obsolete endpoints are stopped by the scheduler, by calling cancelFunc on its lifecycle context; for each probe run, its context is chained from the same lifecycle context, leading to such in-progress probes being cancelled (and counted as failures).

Use a different context, but still with the intended timeout, for the probe run to fix this issue.

Alternative solution: make sure that target refresh schedule and probe run schedule are "well spaced/separated" from each other; but this becomes trick as probe frequency increases and number of targets increases.

@manugarg
Copy link
Contributor

manugarg commented Sep 8, 2025

Hi Bin,

Finally, got a chance to review this in more detail. I think the overall idea makes sense. But I'd like to change the implementation a bit so that we at least use the probe context instead of background context.

  1. Rename probeCtx here to probeTargetCtx:

    probeCtx, cancelF := context.WithCancel(ctx)

  2. Pass both probe level context ctx and target level ctx probeTargetCtx to startProbeForTarget. In startProbeForTarget take target level ctx as simply ctx and probe-level ctx as probeCtx.

  3. At line 157 (current change), have it use ctx or probeCtx based on a Scheduler option. Scheduler can have an additional field called UseProbeLevelContext -- in the comment add an explanation how is it different.

What do you think?

@manugarg
Copy link
Contributor

I think it still makes sense to do this, but #1147 might be a sufficient solution for the issues we are seeing. So it's not that urgent anymore.

@Redexing
Copy link
Contributor Author

Redexing commented Sep 18, 2025

From an observability perspective, not reporting cancelled probing as failed probing seems to be the right behavior (also added a comment in #1147 on if we want to explicitly test for cancel in such cases). Overall I like this idea.

For this PR: in some sense, it's orthogonal to #1147, as the decision we are trying to crystallize here is: when targets map gets updated, what to do with outstanding probes on removed targets? If we start from beginning, we could probably offer either semantics (i.e. 1) let it finish; or 2) cancel immediately). Given 2) has been the behavior, and if we have solved observability issue (i.e. no "false positive" failed probing in such case), I tend to lean towards keeping the current behavior, and close this PR.

Another benefit of 2) is: maybe the owner who updated the target map has some insider knowledge to know the removed target will no longer work; so if we chose 1) in such case, it will generate some other form of "false positive" on failed probing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants