Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

scottgerring
Copy link
Member

@scottgerring scottgerring commented Aug 12, 2025

Fixes #3081, building on the work started by @AaronRM 🤝

Changes

A new retry module added to opentelemetry-sdk

Models the sorts of retry an operation may request (retry / can't retry / throttle), and provides a helper retry_with_backoff mechanism that can be used to wrap up a retryable operation and retry it. The helper relies on experimental_async_runtime for its runtime abstraction, to provide the actual pausing. It also takes a lambda to classify the error, so the caller can inform the retry mechanism if a retry is required.

A new retry_classification module added to opentelemetry-otlp

This bit takes the actual error responses that we get back over OTLP and maps them back to the retry model. Because this is OTLP-specific stuff it belongs here rather than alongside the retry code.

Retry binding

... happens in each one of the concrete exporters to tie it all together.

Also ...

  • Extended exporter builders to allow the user to customise default retry policy
  • Added new feature flags experimental-http-retry and experimental-grpc-retry which pull in the experimental-async-runtime dep and set everything up. This way we can get going with this now without having to stabilise the experimental-async-runtime feature.

Open Questions

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 70.36753% with 774 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.7%. Comparing base (ad88615) to head (27d303a).
⚠️ Report is 205 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-otlp/src/exporter/http/mod.rs 67.0% 161 Missing ⚠️
opentelemetry-sdk/src/metrics/data/mod.rs 13.4% 154 Missing ⚠️
opentelemetry-proto/src/transform/metrics.rs 11.1% 64 Missing ⚠️
...-sdk/src/metrics/internal/exponential_histogram.rs 65.1% 52 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/metrics.rs 0.0% 50 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/trace.rs 0.0% 48 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/logs.rs 0.0% 46 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/mod.rs 73.4% 42 Missing ⚠️
opentelemetry-sdk/src/metrics/instrument.rs 88.9% 29 Missing ⚠️
opentelemetry-sdk/src/logs/logger_provider.rs 92.0% 12 Missing ⚠️
... and 29 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3126     +/-   ##
=======================================
+ Coverage   79.6%   80.7%   +1.1%     
=======================================
  Files        124     128      +4     
  Lines      23174   22850    -324     
=======================================
+ Hits       18456   18460      +4     
+ Misses      4718    4390    -328     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@scottgerring scottgerring force-pushed the feat/retry-logic branch 4 times, most recently from 3847b26 to fb141db Compare August 12, 2025 10:22
@@ -35,6 +35,7 @@ tracing = {workspace = true, optional = true}

prost = { workspace = true, optional = true }
tonic = { workspace = true, optional = true }
tonic-types = { workspace = true, optional = true }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for gRPC error type

.map_err(|e| OTelSdkError::InternalFailure(e.message))
}

#[cfg(not(feature = "http-retry"))]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive duplication of code.

If we decide that http export always has retry behaviour, which means we include the unstable runtime feature, then we can remove all of this.

Alternatively we can provide export once, with all the extra pomp and fanfare to support retry, and then just use the stub "don't actually retry" impl. There would be some slight runtime overhead to this, but the codebase would be much simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment above applies to the other HTTP exporters also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be possible to use the same code for the logic (broken out into a method) and construct the HttpError instances but just have different variants of retry_with_backoff and the HttpError classification depending on if we have the unstable runtime feature enabled?

Copy link
Member Author

@scottgerring scottgerring Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bantonsson something like this commit? --> ba1f9e2

I've factored the retry behaviour out into otlp::exporter::http, and then switch out the implementation based on the retry-retry flag. This removes huge volumes of code duplication and means that we could get this in without stabilising the async runtime feature.

and the HttpError classification depending on if we have the unstable runtime feature enabled?

I'm not sure the classification needs to change independently of the "do we use the retry variant or not" bit - but I might be missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also not touched this for the tonic exporters yet, but once we have the HTTP ones in shape I can continue the pattern over there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the classification needs to change independently of the "do we use the retry variant or not" bit - but I might be missing something?

You're not missing anything. That refactoring is cleaner than what I was thinking about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way the tonic exporters are structured its harder to cleanly pull it apart. I've struck what I think is a sensible compromise over there, adding a wrapper for the retry function that subs in retry/no-op based on feature flags. I think this is better than going too wild with traits/generics/lambdas to squeeze the last little bit of code re-use out in terms of clarity.

@scottgerring scottgerring changed the title [not ready!] feat: support backoff/retry feat: support backoff/retry in OTLP Aug 12, 2025
@scottgerring scottgerring marked this pull request as ready for review August 19, 2025 14:32
@scottgerring scottgerring requested a review from a team as a code owner August 19, 2025 14:32
@lalitb lalitb requested a review from Copilot September 1, 2025 18:50
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements retry logic with exponential backoff and jitter for OTLP exporters to handle transient failures gracefully, addressing issue #3081. The implementation supports both HTTP and gRPC protocols with protocol-specific error classification and server-provided throttling hints.

  • Adds a new retry module to opentelemetry-sdk with configurable retry policies and exponential backoff
  • Implements protocol-specific error classification in opentelemetry-otlp for HTTP and gRPC responses
  • Integrates retry functionality into all OTLP exporters (traces, metrics, logs) for both HTTP and gRPC transports

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
opentelemetry-sdk/src/retry.rs Core retry module with exponential backoff, jitter, and error classification
opentelemetry-otlp/src/retry_classification.rs Protocol-specific error classification for HTTP and gRPC responses
opentelemetry-otlp/src/exporter/tonic/*.rs gRPC exporter integration with retry functionality
opentelemetry-otlp/src/exporter/http/*.rs HTTP exporter integration with retry functionality
opentelemetry-otlp/Cargo.toml Feature flags and dependencies for retry support

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +82 to +87
fn parse_http_date_to_delay(date_str: &str) -> Result<u64, ()> {
// For now, return error - would need proper HTTP date parsing
// This could be implemented with chrono or similar
let _ = date_str;
Err(())
}
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP date parsing function is a stub that always returns an error. This means Retry-After headers with HTTP date format will be ignored, falling back to retryable behavior instead of respecting server-specified delays.

Copilot uses AI. Check for mistakes.

@scottgerring scottgerring force-pushed the feat/retry-logic branch 2 times, most recently from af933a2 to f1636a0 Compare September 2, 2025 09:11
Copy link
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the HTTP exporters look good now. Love all those red lines.

feature = "experimental-grpc-retry",
any(feature = "trace", feature = "metrics", feature = "logs")
))]
async fn tonic_retry_with_backoff<R, F, Fut, T>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way the tonic path works, it is much harder to cleanly extract as much from the signal-specific exporters as we managed with HTTP.

I've done this as a kind of middleground - this way the feature flagging we use to sub between retry-available / no-retry is all in one place - as a thin wrapper around the global retry function retry_with_backoff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep trying to slice and dice this in some nice way, but the generated clients do not implement a common interface, so it's complicated to do common code without changing them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know what you mean. I came up with something that saved a few lines by forcing some intermediate abstractions but it didn't reduce line count and certainly didn't improve clarity.

I feel like the way it is strikes a reasonable balance even if it isn't particularly satisfying.

Copy link
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 for the HTTP code. I can't see a clear way to reuse more of the Tonic code.

@lalitb lalitb self-assigned this Sep 16, 2025
@lalitb
Copy link
Member

lalitb commented Sep 16, 2025

Sorry for delay. I would like to review during this week - assigning to myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OTLP Stabilization: Throttling & Retry
4 participants