Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@flamingbear
Copy link
Member

@flamingbear flamingbear commented Oct 6, 2025

Jira Issue ID

DAS-2427

Description

NSIDC testers were getting 502 and 504 errors during some long lasting requests. This caused some confusion as to why they would get an error in their notebook, but then go and look at the workflow-ui and see that the harmony service succeeded (or was still READY or RUNNING).

This PR adds retries to those transient errors.

Here are some questions since I had some gemini help becuase I'm not an expert.

  1. Should all of the errors retried be retried? I have [429, 500, 502, 503, 504]

    • These were definitely suggested by LLM.
    • I see conflicting information about 429, too many requests (we don't want to retry too soon because that just adds more requests I would guess, but it would definitely handle the request eventually.)
    • I see that 502 shouldn't be retried, but I think we're getting it for temporary errors and I can find other suggestions that it should.
  2. Is backoff_factor 1 good?

    • It seems long so that 7 seconds would elapse before 3 Retries would occur. I could make it shorter, or make it shorter and add a few more tries? I just don't really know.
    • I added backoff_jitter just because it says that's best practices.

Local Test Steps

Build the branch and run the tests.

make install
make lint
make test

Next you can build a distribution and install to an envronment.

make build

That creates a .whl and .gz distribution. I installed it to my regression tests
and ran them against localhost and UAT envs. Just as a sanity check.

Let me know if there are other things to test/update?

PR Acceptance Checklist

  • Acceptance criteria met
  • Tests added/updated (if needed) and passing
  • Documentation updated (if needed)

Comment on lines -1485 to -1490
responses.add(
responses.GET,
expected_status_url(job_id),
status=500,
json=error
)
Copy link
Member Author

@flamingbear flamingbear Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't have an ordered registry for your responses, the last one will just continue to be used. They will just match over and over again. This confused me when I thought I had to add the responses for retries.

@flamingbear flamingbear marked this pull request as ready for review October 7, 2025 15:20
@flamingbear
Copy link
Member Author

I added this as ready for review, but it's ready for discussion more than anything.

Comment on lines +171 to +177
total=3,
backoff_factor=1, # Wait 1, 2, 4 seconds between retries
backoff_jitter=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
raise_on_status=False,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love some advice/encouragement on these values.

Copy link
Member

@owenlittlejohns owenlittlejohns Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of error codes makes sense to me. As does the allowed method only being GET.

I love the inclusion of jitter. Not sure I have strong opinions on the values for backoff_factor and backoff_jitter, though. As discussed in Slack: an additional 7 seconds total for the retries seems reasonable in comparison to a timeout (10s of seconds per time) or just generally service performance (again usually at least 10s of seconds).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your error code list makes sense as well - I have seen 500, 502, and 503 intermittent failures. I'm also on board with retrying a handful of times for up to 10 seconds so it's not a long delay if there are non-intermittent issues and if it is an intermittent issue the user doesn't get an error.

Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to me. It seems better behaviour than getting a 502 error from Client.wait_for_processing while the job is actually fine, it's just the /jobs endpoint that is inaccessible.

The tests all passed for me locally:

215 passed, 3 warnings in 10.76s

Other information that might be related:

Comment on lines +171 to +177
total=3,
backoff_factor=1, # Wait 1, 2, 4 seconds between retries
backoff_jitter=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
raise_on_status=False,

Copy link
Member

@owenlittlejohns owenlittlejohns Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of error codes makes sense to me. As does the allowed method only being GET.

I love the inclusion of jitter. Not sure I have strong opinions on the values for backoff_factor and backoff_jitter, though. As discussed in Slack: an additional 7 seconds total for the retries seems reasonable in comparison to a timeout (10s of seconds per time) or just generally service performance (again usually at least 10s of seconds).

assert len(responses.calls) == 9 # (1 + 4 + 4)

@responses.activate
def test_handle_non_transient_error_no_retry():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this unit test as discussed!

Copy link
Contributor

@chris-durbin chris-durbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the design change. I haven't had a chance to test anything - would like someone else to at least verify no regressions when running a notebook before merging.

Comment on lines +171 to +177
total=3,
backoff_factor=1, # Wait 1, 2, 4 seconds between retries
backoff_jitter=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
raise_on_status=False,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your error code list makes sense as well - I have seen 500, 502, and 503 intermittent failures. I'm also on board with retrying a handful of times for up to 10 seconds so it's not a long delay if there are non-intermittent issues and if it is an intermittent issue the user doesn't get an error.

Copy link
Contributor

@indiejames indiejames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@flamingbear flamingbear merged commit 4461bd6 into nasa:main Oct 15, 2025
11 checks passed
@flamingbear flamingbear deleted the mhs/DAS-2427/improve-retries-for-transient-errors branch October 15, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants