Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Snowballing DynamoDB retries #709

@g-andrade

Description

@g-andrade

Hi!

I've run a few times into an issue related with DynamoDB retries and would like to know if you're open to changing or tweaking the default behaviour (I know I can otherwise override it with my own #aws_config.ddb_retry callback) - I'd be available to contribute.

The gist of what happens is:

  1. There's either a temporary network issue or DynamoDB provisioned capacity is exceeded
  2. Thousands of concurrent requests now start being tried up to 9 times:
    -define(NUM_ATTEMPTS, 10).

    retry(#ddb2_error{attempt = Attempt} = Error) when Attempt >= ?NUM_ATTEMPTS ->
    {error, Error#ddb2_error.reason};
    retry(#ddb2_error{should_retry = false} = Error) ->
    {error, Error#ddb2_error.reason};
    retry(#ddb2_error{attempt = Attempt}) ->
    backoff(Attempt),
    {attempt, Attempt + 1}.
  3. For each request, the backoff period between attempts exponentially increases, going up to an average of (1 bsl (9 - 1)) * 100 milliseconds, or ~12.5 seconds, by the 9th attempt:
    timer:sleep(erlcloud_util:rand_uniform((1 bsl (Attempt - 1)) * 100)).
  4. This makes the average cumulative backoff periods hover at around 25s (0 + 0.1 + ... + 6.25 + 12.5s)
  5. This means thousands of processes blocked for long periods as more processes are created and get into the same situation (even with rate limiting in place, the system is simply too backed up too quickly)
  6. Message queues are growing, system becoming unresponsive
  7. Everything's on fire 🔥

Assuming this is not the way it's intended to work, my proposal would be to:

  1. Lower the maximum number of request attempts and/or backoff periods (perhaps even make it configurable)
  2. Not retry when provisioned capacity is exceeded - it always makes the underlying situation worse, as by the time the few requests that finally succeed return, their callers are no longer there (and this is made even worse by DynamoDB itself delaying responses in those situations, blocking callers for even longer)

What do you think?

(And then again, I'll understand if you tell me the current behaviour is intentional - I can always resort to using a custom retry callback.)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions