Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: messages are sent out of order in SNS fifo topics -> SQS fifo queues #7418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

amitassaraf
Copy link
Contributor

@amitassaraf amitassaraf commented Jan 3, 2023

This PR fixes an issue when setting up FIFO SNS topics that publish messages to FIFO SQS queues.

Issue description
The issue is that when publishing to FIFO SNS topics sending messages to it's subscribers was an async operation, meaning that if multiple messages were sent in short succession, SQS would receive these messages out of order.
In addition, SQS treated the time the messages were received as the ordering method for the messages. In the case that a message was received by SNS, we can use the SentTimestamp as the correct ordering method instead of the time received to better ensure correct ordering in FIFO queues.

The downside to this PR:
Seems like localstack SNS publishing is pretty slow, when you have multiple subscribers from the FIFO SNS topic, publishing may take a couple of seconds, slowing down FIFO SNS topics.

Better solution might be:
Keep a priority queue or a mutex in FIFO SNS topics and send messages in a thread according to the queue/mutex, while releasing the client to move on. There will still be a delay for the messages to arrive in the SQS queue but at least it will not be client blocking, but it might push the issue to the client's end requiring it to handle the delay (?)
See this alternate solution using mutex locks here - amitassaraf@3cebf50

Summary
At Landa, the company I work at, we currently swallow this delay in order to ensure the correct ordering of the messages. For your consideration whether localstack needs this change or not.

In addition, I saw that there is an SNS refactor PR that is open that might conflict with this fix, #7267

Copy link
Contributor

@localstack-bot localstack-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to LocalStack! Thanks for raising your first Pull Request and landing in your contributions. Our team will reach out with any reviews or feedbacks that we have shortly. We recommend joining our Slack Community and share your PR on the #community channel to share your contributions with us. Please make sure you are following our contributing guidelines and our Code of Conduct.

@localstack-bot
Copy link
Contributor

localstack-bot commented Jan 3, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@amitassaraf amitassaraf changed the title fix: messages are sent out of order in fifo topics -> fifo queues fix: messages are sent out of order in SNS fifo topics -> SQS fifo queues Jan 3, 2023
@amitassaraf
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@amitassaraf
Copy link
Contributor Author

recheck

@amitassaraf
Copy link
Contributor Author

Note: After using this for a while the delay is unbearable, @bentsku do you have insight from what this delay arrises? We are talking about SQS publish from SNS request take about 4 seconds per message.

@bentsku
Copy link
Contributor

bentsku commented Jan 3, 2023

Hi @amitassaraf and thanks for your contribution!

I've had a quick look and it seems the delay comes from the fact that you removed the async part. We now sequentially send every single SQS message to every subscriber manually in the publish call (and create a client every time), and this would not work in reality.

The PR you linked is very close to being merged, so I propose that we wait until then. If the issue is still present, we will look into the semaphore solution inside the worker threads (so that a worker thread does not send the 2nd message before the 1st), but messages should be delivered sequentially in the order they were queued into the executor.

In what case do you get unordered messages in the FIFO queues? How many messages do you publish?

@amitassaraf
Copy link
Contributor Author

@bentsku I completely understand that the delay is caused by the removal of async, but Im not sure why calling publish takes so long per message. We have to remove the async / use a mutex in order to ensure that the SNS publishes the messages in order, otherwise its up to the async's socket.select.

If you publish 5 SNS messages to a FIFO topic that has 2 subscribers, you are guaranteed to get out of order messages. Due to the async part of the code, we should explore mutex, ping me once the PR is merged and I'll see if this still happens (most likely it will as I peeked on the PR).

@thrau
Copy link
Member

thrau commented Jan 3, 2023

hi @amitassaraf, thank you for all the input on the topic!

as expected, the changes in this PR now conflict with the SNS rework we merged today, so we'll have to re-think the solution.

regarding the issue itself: we haven't been able to reproduce the issue so far, and even conceptually it seems to be a very unlikely scenario. we're happy to look into it more though once we have a way to reliably reproduce the problem. to that end, it would be great if we could start from a new bug report that outlines the circumstances in which the race condition can occur, and provides some instructions on how to reproduce them.

given all this, i closed the PR for now. once we have a clearer understanding of the issue, and we have an agreement of what exactly needs to be fixed, we'd be happy to shepherd a PR!

thank you for helping to tackle this issue!

@thrau thrau closed this Jan 3, 2023
@amitassaraf
Copy link
Contributor Author

@thrau once the PR is merged I'll try to reproduce the issue, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants