DAS-2427: Adds retry logic to harmony requests Session #119

flamingbear · 2025-10-06T23:54:56Z

Jira Issue ID

Description

NSIDC testers were getting 502 and 504 errors during some long lasting requests. This caused some confusion as to why they would get an error in their notebook, but then go and look at the workflow-ui and see that the harmony service succeeded (or was still READY or RUNNING).

This PR adds retries to those transient errors.

Here are some questions since I had some gemini help becuase I'm not an expert.

Should all of the errors retried be retried? I have [429, 500, 502, 503, 504]
- These were definitely suggested by LLM.
- I see conflicting information about 429, too many requests (we don't want to retry too soon because that just adds more requests I would guess, but it would definitely handle the request eventually.)
- I see that 502 shouldn't be retried, but I think we're getting it for temporary errors and I can find other suggestions that it should.
Is backoff_factor 1 good?
- It seems long so that 7 seconds would elapse before 3 Retries would occur. I could make it shorter, or make it shorter and add a few more tries? I just don't really know.
- I added backoff_jitter just because it says that's best practices.

Local Test Steps

Build the branch and run the tests.

make install
make lint
make test

Next you can build a distribution and install to an envronment.

make build

That creates a .whl and .gz distribution. I installed it to my regression tests
and ran them against localhost and UAT envs. Just as a sanity check.

Let me know if there are other things to test/update?

PR Acceptance Checklist

~~Acceptance criteria met~~
Tests added/updated (if needed) and passing
~~Documentation updated (if needed)~~

flamingbear · 2025-10-06T23:59:25Z

tests/test_client.py

-    responses.add(
-        responses.GET,
-        expected_status_url(job_id),
-        status=500,
-        json=error
-    )


If you don't have an ordered registry for your responses, ~~the last one will just continue to be used~~. They will just match over and over again. This confused me when I thought I had to add the responses for retries.

flamingbear · 2025-10-07T15:21:16Z

I added this as ready for review, but it's ready for discussion more than anything.

flamingbear · 2025-10-08T17:33:32Z

harmony/client.py

+                total=3,
+                backoff_factor=1,  # Wait 1, 2, 4 seconds between retries
+                backoff_jitter=0.5,
+                status_forcelist=[429, 500, 502, 503, 504],
+                allowed_methods=["GET"],
+                raise_on_status=False,
+


I'd love some advice/encouragement on these values.

The choice of error codes makes sense to me. As does the allowed method only being GET.

I love the inclusion of jitter. Not sure I have strong opinions on the values for backoff_factor and backoff_jitter, though. As discussed in Slack: an additional 7 seconds total for the retries seems reasonable in comparison to a timeout (10s of seconds per time) or just generally service performance (again usually at least 10s of seconds).

I think your error code list makes sense as well - I have seen 500, 502, and 503 intermittent failures. I'm also on board with retrying a handful of times for up to 10 seconds so it's not a long delay if there are non-intermittent issues and if it is an intermittent issue the user doesn't get an error.

owenlittlejohns

I think this makes sense to me. It seems better behaviour than getting a 502 error from Client.wait_for_processing while the job is actually fine, it's just the /jobs endpoint that is inaccessible.

The tests all passed for me locally:

215 passed, 3 warnings in 10.76s

Other information that might be related:

Example failed request (502): https://harmony.earthdata.nasa.gov/jobs/8cc0f789-92bc-4a65-9270-17492e9c88ba?linktype=https
Potentially corresponds with a redeployment of the production Harmony instance (eyeballing things in Bamboo).

owenlittlejohns · 2025-10-08T17:36:16Z

harmony/client.py

+                total=3,
+                backoff_factor=1,  # Wait 1, 2, 4 seconds between retries
+                backoff_jitter=0.5,
+                status_forcelist=[429, 500, 502, 503, 504],
+                allowed_methods=["GET"],
+                raise_on_status=False,
+


The choice of error codes makes sense to me. As does the allowed method only being GET.

I love the inclusion of jitter. Not sure I have strong opinions on the values for backoff_factor and backoff_jitter, though. As discussed in Slack: an additional 7 seconds total for the retries seems reasonable in comparison to a timeout (10s of seconds per time) or just generally service performance (again usually at least 10s of seconds).

owenlittlejohns · 2025-10-08T17:37:13Z

tests/test_client.py

+    assert len(responses.calls) == 9 # (1 + 4 + 4)
+
+@responses.activate
+def test_handle_non_transient_error_no_retry():


Thanks for adding this unit test as discussed!

chris-durbin

I'm good with the design change. I haven't had a chance to test anything - would like someone else to at least verify no regressions when running a notebook before merging.

chris-durbin · 2025-10-15T13:20:00Z

harmony/client.py

+                total=3,
+                backoff_factor=1,  # Wait 1, 2, 4 seconds between retries
+                backoff_jitter=0.5,
+                status_forcelist=[429, 500, 502, 503, 504],
+                allowed_methods=["GET"],
+                raise_on_status=False,
+


I think your error code list makes sense as well - I have seen 500, 502, and 503 intermittent failures. I'm also on board with retrying a handful of times for up to 10 seconds so it's not a long delay if there are non-intermittent issues and if it is an intermittent issue the user doesn't get an error.

indiejames

Looks good to me

flamingbear added 2 commits October 6, 2025 17:44

DAS-2427: Adds retry logic to harmony requests Session

9536aa5

DAS-2427: remove methods that aren't used in the lib.

df98d79

flamingbear commented Oct 6, 2025

View reviewed changes

DAS-2427: Updates comments.

fa7f879

flamingbear marked this pull request as ready for review October 7, 2025 15:20

flamingbear requested review from chris-durbin, indiejames and ygliuvt as code owners October 7, 2025 15:20

DAS-2427: Actually add the jitter I said that I did.

04ffd59

flamingbear requested a review from owenlittlejohns October 7, 2025 17:05

flamingbear added the hacktoberfest-accepted label Oct 7, 2025

DAS-2427: Adds error test for no retry case.

a77b706

flamingbear commented Oct 8, 2025

View reviewed changes

owenlittlejohns reviewed Oct 8, 2025

View reviewed changes

chris-durbin approved these changes Oct 15, 2025

View reviewed changes

indiejames approved these changes Oct 15, 2025

View reviewed changes

flamingbear merged commit 4461bd6 into nasa:main Oct 15, 2025
11 checks passed

flamingbear deleted the mhs/DAS-2427/improve-retries-for-transient-errors branch October 15, 2025 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DAS-2427: Adds retry logic to harmony requests Session #119

DAS-2427: Adds retry logic to harmony requests Session #119

Uh oh!

flamingbear commented Oct 6, 2025 •

edited

Loading

Uh oh!

flamingbear Oct 6, 2025 •

edited

Loading

Uh oh!

flamingbear commented Oct 7, 2025

Uh oh!

flamingbear Oct 8, 2025

Uh oh!

owenlittlejohns Oct 8, 2025 •

edited

Loading

Uh oh!

chris-durbin Oct 15, 2025

Uh oh!

owenlittlejohns left a comment

Uh oh!

owenlittlejohns Oct 8, 2025 •

edited

Loading

Uh oh!

owenlittlejohns Oct 8, 2025

Uh oh!

chris-durbin left a comment

Uh oh!

chris-durbin Oct 15, 2025

Uh oh!

indiejames left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DAS-2427: Adds retry logic to harmony requests Session #119

DAS-2427: Adds retry logic to harmony requests Session #119

Uh oh!

Conversation

flamingbear commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira Issue ID

Description

Local Test Steps

PR Acceptance Checklist

Uh oh!

flamingbear Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flamingbear commented Oct 7, 2025

Uh oh!

flamingbear Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

owenlittlejohns Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-durbin Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

owenlittlejohns left a comment

Choose a reason for hiding this comment

Uh oh!

owenlittlejohns Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owenlittlejohns Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

chris-durbin left a comment

Choose a reason for hiding this comment

Uh oh!

chris-durbin Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

indiejames left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flamingbear commented Oct 6, 2025 •

edited

Loading

flamingbear Oct 6, 2025 •

edited

Loading

owenlittlejohns Oct 8, 2025 •

edited

Loading

owenlittlejohns Oct 8, 2025 •

edited

Loading