-
Notifications
You must be signed in to change notification settings - Fork 572
feat: support backoff/retry in OTLP #3126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3126 +/- ##
=======================================
+ Coverage 79.6% 80.7% +1.1%
=======================================
Files 124 128 +4
Lines 23174 22850 -324
=======================================
+ Hits 18456 18460 +4
+ Misses 4718 4390 -328 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3847b26
to
fb141db
Compare
603c31e
to
75f0d71
Compare
@@ -35,6 +35,7 @@ tracing = {workspace = true, optional = true} | |||
|
|||
prost = { workspace = true, optional = true } | |||
tonic = { workspace = true, optional = true } | |||
tonic-types = { workspace = true, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for gRPC error type
.map_err(|e| OTelSdkError::InternalFailure(e.message)) | ||
} | ||
|
||
#[cfg(not(feature = "http-retry"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a massive duplication of code.
If we decide that http export always has retry behaviour, which means we include the unstable runtime feature, then we can remove all of this.
Alternatively we can provide export
once, with all the extra pomp and fanfare to support retry, and then just use the stub "don't actually retry" impl. There would be some slight runtime overhead to this, but the codebase would be much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment above applies to the other HTTP exporters also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be possible to use the same code for the logic (broken out into a method) and construct the HttpError
instances but just have different variants of retry_with_backoff
and the HttpError
classification depending on if we have the unstable runtime feature enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bantonsson something like this commit? --> ba1f9e2
I've factored the retry behaviour out into otlp::exporter::http
, and then switch out the implementation based on the retry-retry
flag. This removes huge volumes of code duplication and means that we could get this in without stabilising the async runtime feature.
and the HttpError classification depending on if we have the unstable runtime feature enabled?
I'm not sure the classification needs to change independently of the "do we use the retry variant or not" bit - but I might be missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also not touched this for the tonic exporters yet, but once we have the HTTP ones in shape I can continue the pattern over there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the classification needs to change independently of the "do we use the retry variant or not" bit - but I might be missing something?
You're not missing anything. That refactoring is cleaner than what I was thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the way the tonic exporters are structured its harder to cleanly pull it apart. I've struck what I think is a sensible compromise over there, adding a wrapper for the retry function that subs in retry/no-op based on feature flags. I think this is better than going too wild with traits/generics/lambdas to squeeze the last little bit of code re-use out in terms of clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements retry logic with exponential backoff and jitter for OTLP exporters to handle transient failures gracefully, addressing issue #3081. The implementation supports both HTTP and gRPC protocols with protocol-specific error classification and server-provided throttling hints.
- Adds a new
retry
module toopentelemetry-sdk
with configurable retry policies and exponential backoff - Implements protocol-specific error classification in
opentelemetry-otlp
for HTTP and gRPC responses - Integrates retry functionality into all OTLP exporters (traces, metrics, logs) for both HTTP and gRPC transports
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
opentelemetry-sdk/src/retry.rs | Core retry module with exponential backoff, jitter, and error classification |
opentelemetry-otlp/src/retry_classification.rs | Protocol-specific error classification for HTTP and gRPC responses |
opentelemetry-otlp/src/exporter/tonic/*.rs | gRPC exporter integration with retry functionality |
opentelemetry-otlp/src/exporter/http/*.rs | HTTP exporter integration with retry functionality |
opentelemetry-otlp/Cargo.toml | Feature flags and dependencies for retry support |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
fn parse_http_date_to_delay(date_str: &str) -> Result<u64, ()> { | ||
// For now, return error - would need proper HTTP date parsing | ||
// This could be implemented with chrono or similar | ||
let _ = date_str; | ||
Err(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP date parsing function is a stub that always returns an error. This means Retry-After headers with HTTP date format will be ignored, falling back to retryable behavior instead of respecting server-specified delays.
Copilot uses AI. Check for mistakes.
af933a2
to
f1636a0
Compare
ba1f9e2
to
ff33723
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the HTTP exporters look good now. Love all those red lines.
c0dba75
to
059943f
Compare
feature = "experimental-grpc-retry", | ||
any(feature = "trace", feature = "metrics", feature = "logs") | ||
))] | ||
async fn tonic_retry_with_backoff<R, F, Fut, T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the way the tonic path works, it is much harder to cleanly extract as much from the signal-specific exporters as we managed with HTTP.
I've done this as a kind of middleground - this way the feature flagging we use to sub between retry-available / no-retry is all in one place - as a thin wrapper around the global retry function retry_with_backoff
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep trying to slice and dice this in some nice way, but the generated clients do not implement a common interface, so it's complicated to do common code without changing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know what you mean. I came up with something that saved a few lines by forcing some intermediate abstractions but it didn't reduce line count and certainly didn't improve clarity.
I feel like the way it is strikes a reasonable balance even if it isn't particularly satisfying.
059943f
to
693765b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 for the HTTP code. I can't see a clear way to reuse more of the Tonic code.
Sorry for delay. I would like to review during this week - assigning to myself. |
… we don't want retry
af8f832
to
27d303a
Compare
Fixes #3081, building on the work started by @AaronRM 🤝
Changes
A new retry module added to
opentelemetry-sdk
Models the sorts of retry an operation may request (retry / can't retry / throttle), and provides a helper
retry_with_backoff
mechanism that can be used to wrap up a retryable operation and retry it. The helper relies onexperimental_async_runtime
for its runtime abstraction, to provide the actual pausing. It also takes a lambda to classify the error, so the caller can inform the retry mechanism if a retry is required.A new retry_classification module added to
opentelemetry-otlp
This bit takes the actual error responses that we get back over OTLP and maps them back to the retry model. Because this is OTLP-specific stuff it belongs here rather than alongside the retry code.
Retry binding
... happens in each one of the concrete exporters to tie it all together.
Also ...
Open Questions
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes