-
Notifications
You must be signed in to change notification settings - Fork 33
download: add retry and timeout kwargs, controlled via env vars #1073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
β¦led via env vars
MehmedGIT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covering client-side 4XX and server-side 5XX response codes should be enough.
kba
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, also I think environment variables are the right tool here. We need to document those, though. Perhaps we should add a section for that to the main ocrd --help output? And to the README.
ocrd/ocrd/resolver.py
Outdated
| 429, # Too Many Requests | ||
| 440, # Login Timeout | ||
| 500, # Internal Server Error | ||
| 503, # Service Unavailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps also 502 Bad Gateway in case there is a connection issue between outward-facing and internal web service?
@MehmedGIT What HTTP status is returned when retrieval of @subugoe images fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything from 4XX and 5XX should be included, except the 404.
@kba It's already covered there - 500.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's correct to include all error cases. For a retry mechanism we should focus on transient failures, not on permanent ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
404 and 410 are likely permanent, so should not be retried. But I would include all server side errors.
|
There's an issue with mocking |
I have already repaired the tests and uploaded the fixes to my feature branch, but somehow the PR does not update... |
duh, sorry β turns out I had pushed to upstream instead of bertsky. I'll remove the stale branch from upstream now. |
I still needed to adapt the OAI request test: -Subproject commit 506b33936d89080a683fa8a26837f2a23b23e5e2
+Subproject commit 1194310c18d90b280c380bdc3cb04adb6a41120f-dirty
diff --git a/tests/test_resolver_oai.py b/tests/test_resolver_oai.py
index ca5c590d8..c0ecf64aa 100644
--- a/tests/test_resolver_oai.py
+++ b/tests/test_resolver_oai.py
@@ -72,7 +72,7 @@ def test_handle_common_oai_response(mock_get, response_dir, oai_response_content
result = resolver.download_to_directory(response_dir, url)
# assert
- mock_get.assert_called_once_with(url, timeout=None)
+ mock_get.assert_called_once_with(url)
assert result == 'oai'
@@ -100,7 +100,7 @@ def test_handle_response_for_invalid_content(mock_get, response_dir):
resolver.download_to_directory(response_dir, url)
# assert
- mock_get.assert_called_once_with(url, timeout=None)
+ mock_get.assert_called_once_with(url)
log_output = capt.getvalue()
assert 'WARNING ocrd_models.utils.handle_oai_response' in log_output
Then tests pass. |
AFAICT these are necessary β see CI failures above and current success status. |
Done: |
Fixes #973 (retry facility) and also implements a timeout facility, and provides those to any download use-case besides CLI
ocrd workspace find --download(e.g. processor's typicalWorkspace.download_file(input_file), orWorkspaceValidator(... download=True ...).Since there are so many places which could ultimately trigger
Resolver.download_to_directory, and we don't want to drag along the new kwargsretriesandtimeouteverywhere, I chose to control them via environment variables:OCRD_DOWNLOAD_RETRIESβ number of attemptsOCRD_DOWNLOAD_TIMEOUTβ a single float or a comma-separated tuple of floats (connection timeout, read timeout)The retry facility itself only triggers on certain HTTP statuses (i.e. opt-in), which is debatable (e.g. I have seen proposals opting out of everything but 200 and 404).