let probes running with in-tree scheduler run with different context from probe lifecycle context #1086

Redexing · 2025-05-28T23:37:28Z

when a probe target endpoints change, probes for obsolete endpoints are stopped by the scheduler, by calling cancelFunc on its lifecycle context; for each probe run, its context is chained from the same lifecycle context, leading to such in-progress probes being cancelled (and counted as failures).

Use a different context, but still with the intended timeout, for the probe run to fix this issue.

Alternative solution: make sure that target refresh schedule and probe run schedule are "well spaced/separated" from each other; but this becomes trick as probe frequency increases and number of targets increases.

manugarg · 2025-09-08T02:25:58Z

Hi Bin,

Finally, got a chance to review this in more detail. I think the overall idea makes sense. But I'd like to change the implementation a bit so that we at least use the probe context instead of background context.

Rename probeCtx here to probeTargetCtx:

cloudprober/probes/common/sched/sched.go

Line 231 in 403ab84

probeCtx, cancelF := context.WithCancel(ctx)
Pass both probe level context ctx and target level ctx probeTargetCtx to startProbeForTarget. In startProbeForTarget take target level ctx as simply ctx and probe-level ctx as probeCtx.
At line 157 (current change), have it use ctx or probeCtx based on a Scheduler option. Scheduler can have an additional field called UseProbeLevelContext -- in the comment add an explanation how is it different.

What do you think?

manugarg · 2025-09-18T07:12:15Z

I think it still makes sense to do this, but #1147 might be a sufficient solution for the issues we are seeing. So it's not that urgent anymore.

Redexing · 2025-09-18T17:20:29Z

From an observability perspective, not reporting cancelled probing as failed probing seems to be the right behavior (also added a comment in #1147 on if we want to explicitly test for cancel in such cases). Overall I like this idea.

For this PR: in some sense, it's orthogonal to #1147, as the decision we are trying to crystallize here is: when targets map gets updated, what to do with outstanding probes on removed targets? If we start from beginning, we could probably offer either semantics (i.e. 1) let it finish; or 2) cancel immediately). Given 2) has been the behavior, and if we have solved observability issue (i.e. no "false positive" failed probing in such case), I tend to lean towards keeping the current behavior, and close this PR.

Another benefit of 2) is: maybe the owner who updated the target map has some insider knowledge to know the removed target will no longer work; so if we chose 1) in such case, it will generate some other form of "false positive" on failed probing.

Redexing and others added 2 commits May 28, 2025 16:26

let probes running with intree scheduler run with detached context

7ff8274

Merge branch 'main' into tweak_cancel

bccf821

kathleenmotley49-maker approved these changes Sep 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

let probes running with in-tree scheduler run with different context from probe lifecycle context #1086

let probes running with in-tree scheduler run with different context from probe lifecycle context #1086

Uh oh!

Redexing commented May 28, 2025 •

edited

Loading

Uh oh!

manugarg commented Sep 8, 2025

Uh oh!

manugarg commented Sep 18, 2025

Uh oh!

Redexing commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

let probes running with in-tree scheduler run with different context from probe lifecycle context #1086

Are you sure you want to change the base?

let probes running with in-tree scheduler run with different context from probe lifecycle context #1086

Uh oh!

Conversation

Redexing commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manugarg commented Sep 8, 2025

Uh oh!

manugarg commented Sep 18, 2025

Uh oh!

Redexing commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Redexing commented May 28, 2025 •

edited

Loading

Redexing commented Sep 18, 2025 •

edited

Loading