-
Notifications
You must be signed in to change notification settings - Fork 28
DAS-2427: Adds retry logic to harmony requests Session #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAS-2427: Adds retry logic to harmony requests Session #119
Conversation
| responses.add( | ||
| responses.GET, | ||
| expected_status_url(job_id), | ||
| status=500, | ||
| json=error | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't have an ordered registry for your responses, the last one will just continue to be used. They will just match over and over again. This confused me when I thought I had to add the responses for retries.
|
I added this as ready for review, but it's ready for discussion more than anything. |
| total=3, | ||
| backoff_factor=1, # Wait 1, 2, 4 seconds between retries | ||
| backoff_jitter=0.5, | ||
| status_forcelist=[429, 500, 502, 503, 504], | ||
| allowed_methods=["GET"], | ||
| raise_on_status=False, | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love some advice/encouragement on these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of error codes makes sense to me. As does the allowed method only being GET.
I love the inclusion of jitter. Not sure I have strong opinions on the values for backoff_factor and backoff_jitter, though. As discussed in Slack: an additional 7 seconds total for the retries seems reasonable in comparison to a timeout (10s of seconds per time) or just generally service performance (again usually at least 10s of seconds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your error code list makes sense as well - I have seen 500, 502, and 503 intermittent failures. I'm also on board with retrying a handful of times for up to 10 seconds so it's not a long delay if there are non-intermittent issues and if it is an intermittent issue the user doesn't get an error.
owenlittlejohns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense to me. It seems better behaviour than getting a 502 error from Client.wait_for_processing while the job is actually fine, it's just the /jobs endpoint that is inaccessible.
The tests all passed for me locally:
215 passed, 3 warnings in 10.76s
Other information that might be related:
- Example failed request (502): https://harmony.earthdata.nasa.gov/jobs/8cc0f789-92bc-4a65-9270-17492e9c88ba?linktype=https
- Potentially corresponds with a redeployment of the production Harmony instance (eyeballing things in Bamboo).
| total=3, | ||
| backoff_factor=1, # Wait 1, 2, 4 seconds between retries | ||
| backoff_jitter=0.5, | ||
| status_forcelist=[429, 500, 502, 503, 504], | ||
| allowed_methods=["GET"], | ||
| raise_on_status=False, | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of error codes makes sense to me. As does the allowed method only being GET.
I love the inclusion of jitter. Not sure I have strong opinions on the values for backoff_factor and backoff_jitter, though. As discussed in Slack: an additional 7 seconds total for the retries seems reasonable in comparison to a timeout (10s of seconds per time) or just generally service performance (again usually at least 10s of seconds).
| assert len(responses.calls) == 9 # (1 + 4 + 4) | ||
|
|
||
| @responses.activate | ||
| def test_handle_non_transient_error_no_retry(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this unit test as discussed!
chris-durbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with the design change. I haven't had a chance to test anything - would like someone else to at least verify no regressions when running a notebook before merging.
| total=3, | ||
| backoff_factor=1, # Wait 1, 2, 4 seconds between retries | ||
| backoff_jitter=0.5, | ||
| status_forcelist=[429, 500, 502, 503, 504], | ||
| allowed_methods=["GET"], | ||
| raise_on_status=False, | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your error code list makes sense as well - I have seen 500, 502, and 503 intermittent failures. I'm also on board with retrying a handful of times for up to 10 seconds so it's not a long delay if there are non-intermittent issues and if it is an intermittent issue the user doesn't get an error.
indiejames
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Jira Issue ID
DAS-2427
Description
NSIDC testers were getting 502 and 504 errors during some long lasting requests. This caused some confusion as to why they would get an error in their notebook, but then go and look at the workflow-ui and see that the harmony service succeeded (or was still READY or RUNNING).
This PR adds retries to those transient errors.
Here are some questions since I had some gemini help becuase I'm not an expert.
Should all of the errors retried be retried? I have [429, 500, 502, 503, 504]
Is backoff_factor 1 good?
Local Test Steps
Build the branch and run the tests.
Next you can build a distribution and install to an envronment.
That creates a .whl and .gz distribution. I installed it to my regression tests
and ran them against localhost and UAT envs. Just as a sanity check.
Let me know if there are other things to test/update?
PR Acceptance Checklist
Acceptance criteria metDocumentation updated (if needed)