Snowballing DynamoDB retries

Hi!

I've run a few times into an issue related with DynamoDB retries and would like to know if you're open to changing or tweaking the default behaviour (I know I can otherwise override it with my own `#aws_config.ddb_retry` callback) - I'd be available to contribute.

The gist of what happens is:
1. There's either a temporary network issue or DynamoDB provisioned capacity is exceeded
2. Thousands of concurrent requests now start being tried up to 9 times:
https://github.com/erlcloud/erlcloud/blob/f16290b9f856d6ad708b3fe87049abbc8ec93bb4/src/erlcloud_ddb_impl.erl#L118
https://github.com/erlcloud/erlcloud/blob/f16290b9f856d6ad708b3fe87049abbc8ec93bb4/src/erlcloud_ddb_impl.erl#L149-L155
3. For each request, the backoff period between attempts exponentially increases, going up to an average of `(1 bsl (9 - 1)) * 100` milliseconds, or ~12.5 seconds, by the 9th attempt:
https://github.com/erlcloud/erlcloud/blob/f16290b9f856d6ad708b3fe87049abbc8ec93bb4/src/erlcloud_ddb_impl.erl#L124
4. This makes the average cumulative backoff periods hover at around 25s (0 + 0.1 + ... + 6.25 + 12.5s)
5. This means thousands of processes blocked for long periods as more processes are created and get into the same situation (even with rate limiting in place, the system is simply too backed up too quickly)
6. Message queues are growing, system becoming unresponsive
7. Everything's on fire 🔥 

---

Assuming this is not the way it's intended to work, my proposal would be to:
1) Lower the maximum number of request attempts and/or backoff periods (perhaps even make it configurable)
2) **Not** retry when provisioned capacity is exceeded - it always makes the underlying situation worse, as by the time the few requests that finally succeed return, their callers are no longer there (and this is made even worse by DynamoDB itself delaying responses in those situations, blocking callers for even longer)

What do you think? 

(And then again, I'll understand if you tell me the current behaviour is intentional - I can always resort to using a custom retry callback.)

	retry(#ddb2_error{attempt = Attempt} = Error) when Attempt >= ?NUM_ATTEMPTS ->
	{error, Error#ddb2_error.reason};
	retry(#ddb2_error{should_retry = false} = Error) ->
	{error, Error#ddb2_error.reason};
	retry(#ddb2_error{attempt = Attempt}) ->
	backoff(Attempt),
	{attempt, Attempt + 1}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Snowballing DynamoDB retries #709

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Snowballing DynamoDB retries #709

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions