-
Couldn't load subscription status.
- Fork 447
Description
Hi!
I've run a few times into an issue related with DynamoDB retries and would like to know if you're open to changing or tweaking the default behaviour (I know I can otherwise override it with my own #aws_config.ddb_retry callback) - I'd be available to contribute.
The gist of what happens is:
- There's either a temporary network issue or DynamoDB provisioned capacity is exceeded
- Thousands of concurrent requests now start being tried up to 9 times:
erlcloud/src/erlcloud_ddb_impl.erl
Line 118 in f16290b
-define(NUM_ATTEMPTS, 10).
erlcloud/src/erlcloud_ddb_impl.erl
Lines 149 to 155 in f16290b
retry(#ddb2_error{attempt = Attempt} = Error) when Attempt >= ?NUM_ATTEMPTS -> {error, Error#ddb2_error.reason}; retry(#ddb2_error{should_retry = false} = Error) -> {error, Error#ddb2_error.reason}; retry(#ddb2_error{attempt = Attempt}) -> backoff(Attempt), {attempt, Attempt + 1}. - For each request, the backoff period between attempts exponentially increases, going up to an average of
(1 bsl (9 - 1)) * 100milliseconds, or ~12.5 seconds, by the 9th attempt:
erlcloud/src/erlcloud_ddb_impl.erl
Line 124 in f16290b
timer:sleep(erlcloud_util:rand_uniform((1 bsl (Attempt - 1)) * 100)). - This makes the average cumulative backoff periods hover at around 25s (0 + 0.1 + ... + 6.25 + 12.5s)
- This means thousands of processes blocked for long periods as more processes are created and get into the same situation (even with rate limiting in place, the system is simply too backed up too quickly)
- Message queues are growing, system becoming unresponsive
- Everything's on fire 🔥
Assuming this is not the way it's intended to work, my proposal would be to:
- Lower the maximum number of request attempts and/or backoff periods (perhaps even make it configurable)
- Not retry when provisioned capacity is exceeded - it always makes the underlying situation worse, as by the time the few requests that finally succeed return, their callers are no longer there (and this is made even worse by DynamoDB itself delaying responses in those situations, blocking callers for even longer)
What do you think?
(And then again, I'll understand if you tell me the current behaviour is intentional - I can always resort to using a custom retry callback.)