-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
fix: messages are sent out of order in SNS fifo topics -> SQS fifo queues #7418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome to LocalStack! Thanks for raising your first Pull Request and landing in your contributions. Our team will reach out with any reviews or feedbacks that we have shortly. We recommend joining our Slack Community and share your PR on the #community channel to share your contributions with us. Please make sure you are following our contributing guidelines and our Code of Conduct.
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read the CLA Document and I hereby sign the CLA |
recheck |
Note: After using this for a while the delay is unbearable, @bentsku do you have insight from what this delay arrises? We are talking about SQS publish from SNS request take about 4 seconds per message. |
Hi @amitassaraf and thanks for your contribution! I've had a quick look and it seems the delay comes from the fact that you removed the async part. We now sequentially send every single SQS message to every subscriber manually in the The PR you linked is very close to being merged, so I propose that we wait until then. If the issue is still present, we will look into the semaphore solution inside the worker threads (so that a worker thread does not send the 2nd message before the 1st), but messages should be delivered sequentially in the order they were queued into the executor. In what case do you get unordered messages in the FIFO queues? How many messages do you publish? |
@bentsku I completely understand that the delay is caused by the removal of async, but Im not sure why calling publish takes so long per message. We have to remove the async / use a mutex in order to ensure that the SNS publishes the messages in order, otherwise its up to the async's socket.select. If you publish 5 SNS messages to a FIFO topic that has 2 subscribers, you are guaranteed to get out of order messages. Due to the async part of the code, we should explore mutex, ping me once the PR is merged and I'll see if this still happens (most likely it will as I peeked on the PR). |
hi @amitassaraf, thank you for all the input on the topic! as expected, the changes in this PR now conflict with the SNS rework we merged today, so we'll have to re-think the solution. regarding the issue itself: we haven't been able to reproduce the issue so far, and even conceptually it seems to be a very unlikely scenario. we're happy to look into it more though once we have a way to reliably reproduce the problem. to that end, it would be great if we could start from a new bug report that outlines the circumstances in which the race condition can occur, and provides some instructions on how to reproduce them. given all this, i closed the PR for now. once we have a clearer understanding of the issue, and we have an agreement of what exactly needs to be fixed, we'd be happy to shepherd a PR! thank you for helping to tackle this issue! |
@thrau once the PR is merged I'll try to reproduce the issue, thanks. |
This PR fixes an issue when setting up FIFO SNS topics that publish messages to FIFO SQS queues.
Issue description
The issue is that when publishing to FIFO SNS topics sending messages to it's subscribers was an async operation, meaning that if multiple messages were sent in short succession, SQS would receive these messages out of order.
In addition, SQS treated the time the messages were received as the ordering method for the messages. In the case that a message was received by SNS, we can use the
SentTimestamp
as the correct ordering method instead of the time received to better ensure correct ordering in FIFO queues.The downside to this PR:
Seems like localstack SNS publishing is pretty slow, when you have multiple subscribers from the FIFO SNS topic, publishing may take a couple of seconds, slowing down FIFO SNS topics.
Better solution might be:
Keep a priority queue or a mutex in FIFO SNS topics and send messages in a thread according to the queue/mutex, while releasing the client to move on. There will still be a delay for the messages to arrive in the SQS queue but at least it will not be client blocking, but it might push the issue to the client's end requiring it to handle the delay (?)
See this alternate solution using mutex locks here - amitassaraf@3cebf50
Summary
At Landa, the company I work at, we currently swallow this delay in order to ensure the correct ordering of the messages. For your consideration whether localstack needs this change or not.
In addition, I saw that there is an SNS refactor PR that is open that might conflict with this fix, #7267