feat(outputs.opensearch): Implement startup-error-behavior options#18784
feat(outputs.opensearch): Implement startup-error-behavior options#18784Obeyed wants to merge 10 commits into
Conversation
|
Thanks so much for the pull request! |
|
!signed-cla |
85c1716 to
e97c635
Compare
When telegraf starts locally the remote opensearch service may not be reachable. In some cases were the network is occasionally not available it shouldn't prevent telegraf from starting. Instead allow telegraf to start collecting and potentially buffering metrics to send when possible.
e97c635 to
95462bd
Compare
srebhan
left a comment
There was a problem hiding this comment.
Thanks for your contribution @Obeyed!
Instead of unconditionally ignore the error I suggest implementing Telegrafs startup-error-behavior spec for the plugin. I.e. you need to return a StartupError with the Retry flag set. This allows the user to specify what should happen if the connection cannot be established.
|
Thanks, @srebhan. Appreciate the pointer! Let me know if my latest approach is as expected. |
Co-authored-by: Sven Rebhan <[email protected]>
|
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
"unnecessaryDefer: defer model.Close() is placed just before return"
|
@skartikey, thanks for the guidance! I need some help understanding the failing tests. Do you have any pointers on how to resolve the following? The
On the
|
| o.Log.Errorf("error creating OpenSearch client: %v", err) | ||
| } | ||
|
|
||
| _, err = o.osClient.Ping() |
There was a problem hiding this comment.
Ping() is called without the context, which is what's hanging CI.
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(o.Timeout))
defer cancel()
...
_, err = o.osClient.Ping()The ctx with the 5s timeout is never passed to Ping(), so the opensearch-go client performs the ping without any deadline.
The new TestConnectionIssueAtStartup uses an unstarted httptest.Server whose listener accepts TCP connections but never reads them, so the round trip blocks forever.
Reproduced locally:
panic: test timed out after 30s
opensearchapi.PingRequest.Do(...) at opensearch.go:138
That's the same root cause as the four failing test-go-* CI jobs (Too long with no output (exceeded 10m0s)).
Suggested fix:
_, err = o.osClient.Ping(o.osClient.Ping.WithContext(ctx))Ping.WithContext is provided by opensearchapi/api.ping.go.
This also makes the retry semantics meaningful at runtime. Without a deadline, each retry attempt can hang indefinitely against a half-open peer, which defeats the purpose of this PR.
Summary
Telegraf will fail on start if there's no network path to the output opensearch service. In scenarios where the local device occasionally has no internet, this shouldn't hard fail telegraf's start sequence. Telegraf can start collecting and buffering metrics to send to opensearch when possible.
The current implementation works fine if the connection to opensearch is available on boot, then the buffering of metrics works as expected
Checklist
Related issues
resolves #18783