-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Messenger] Reduce lock time when using MySQL for transport #60207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Messenger] Reduce lock time when using MySQL for transport #60207
Conversation
introduce a algorithm in `Connection` for mysql platforms to minimize exclusive locking
|
Hey! To help keep things organized, we don't allow "Draft" pull requests. Could you please click the "ready for review" button or close this PR and open a new one when you are done? Note that a pull request does not have to be "perfect" or "ready for merge" when you first open it. We just want it to be ready for a first review. Cheers! Carsonbot |
| $this->deleteDeliveredMessageForMySQLPlatform(); | ||
| } | ||
| try { | ||
| $this->driverConnection->delete($this->configuration['table_name'], ['delivered_at' => '9999-12-31 23:59:59']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this made an exclusive lock on more than a row or record level. this is problematic since it blocks all other processes
|
|
||
| private function getMessageForMySQLPlatform(): ?array | ||
| { | ||
| $possibleIdsToClaim = $this->createAvailableMessagesQueryBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fetch only ids till a message is claimed to not load all the payloads unnecessary (they can be huge)
| return null; | ||
| } | ||
|
|
||
| $messageData = $this->createQueryBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load the data only for the claimed message
|
|
||
| $claimed = $this->driverConnection->createQueryBuilder() | ||
| ->update($this->configuration['table_name']) | ||
| ->set('delivered_at', ':now') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will invoke an exclusive lock on row/record level to ensure we wont have race conditions and multiple workers handling the same message.
either we can update the message, that means the message id has not been updated (delivered_at)
or we wont find the message to update and go on with the next message id we can try
| $ids = $this->selectMessageIdsToDelete(); | ||
| $this->driverConnection->createQueryBuilder() | ||
| ->delete($this->configuration['table_name']) | ||
| ->where('id IN (:ids)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row/record level exclusive lock to not interfere with the selecting part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it work to replace this by a subquery instead of doing a roundtrip to get the ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id rather not use subqueries. they are very often a cause of performance flaws
|
maybe someone competent enough in other platforms like oracle or postgres can decide if it should be adapted there aswell 🤷♂️ |
|
what do you think? shall we approach to merge this? then the tests must be fixed... should this be first controlled with a flag to have an opt in? to not possibly break anything? |
src/Symfony/Component/Messenger/Bridge/Doctrine/Transport/Connection.php
Outdated
Show resolved
Hide resolved
src/Symfony/Component/Messenger/Bridge/Doctrine/Transport/Connection.php
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| if (!isset($claimedId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the var should be declared before
| if (!isset($claimedId)) { | |
| if (null === $claimedId) { |
| $ids = $this->selectMessageIdsToDelete(); | ||
| $this->driverConnection->createQueryBuilder() | ||
| ->delete($this->configuration['table_name']) | ||
| ->where('id IN (:ids)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it work to replace this by a subquery instead of doing a roundtrip to get the ids?
| Types::STRING, | ||
| Types::STRING, | ||
| ]) | ||
| ->setMaxResults(5_000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there are more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem. each time one message gets claimed 5k are being deleted.
src/Symfony/Component/Messenger/Bridge/Doctrine/Transport/Connection.php
Outdated
Show resolved
Hide resolved
| $connection->get(); | ||
| } | ||
|
|
||
| public static function providePlatformSql(): iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be nice to reduce the diff on this file, doable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh no. i changed the algorithm and the queries. i tried to, but that was the best i could do here on the unit test
…ection.php Co-authored-by: Nicolas Grekas <[email protected]>
…ection.php Co-authored-by: Nicolas Grekas <[email protected]>
…ection.php Co-authored-by: Nicolas Grekas <[email protected]>
introduce a variable instead of checking isset.
| } | ||
| } | ||
|
|
||
| if (!null === $claimedId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (!null === $claimedId) { | |
| if (null === $claimedId) { |
Typo i gues
|
|
||
| $claimedId = null; | ||
| foreach ($possibleIdsToClaim as $id) { | ||
| if (null === $claimedId = $this->claimMessage($id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (null === $claimedId = $this->claimMessage($id)) { | |
| if (null !== $claimedId = $this->claimMessage($id)) { |
|
hi, i sadly close this pr. as you might have seen i do not find the spare time to work on this. some private duties take all my time. |
|
@JanPaulBeumer I had same deadlocking issues, but I was able to drill down to the root cause of them and the correct way of solving them, and verified it running my solution in production on a fairly high traffic (over 3 million per day) queue - #61963 |
…at causes deadlocks (psihius) This PR was merged into the 6.4 branch. Discussion ---------- [Doctrine][Messenger] Remove old MySQL special handling that causes deadlocks | Q | A | ------------- | --- | Branch? | 6.4 | Bug fix? | yes | New feature? | no | Deprecations? | no | Issues | [#47633](#47366), [#47366](#47366), [#57906](#57906) (and many others since closed), abandoned PR #60207 and so on | License | MIT We run over 3 million queue items a day, we had run into major issues with current implementation deadlocking regularly, no amount of adjusting the purge threads and other settings did fix the root case - the messenger_messages table not having a proper covering index for the SELECT FOR UPDATE query. Because MySQL implementation has been special cased to batch delete's by `delivered_at` having a special value, at least in MySQL 8.0.* and up (we run 8.0.42 and now running 8.4.6) this results in row range locks that basically lock the whole table due to delivered_at index being of extremely low cardinality, resulting in locking of all the rows that delivered_at is at null value. Then UPDATE queries try to update delivered_at and delete is run by delivered_at condition, resulting in eventual deadlock. At out scale this lead to deadlocks completelly overwhelming the server within an hour and hard-locking it to a point we had to `kill -9 <mysql pid>`, even running very agressive deadlock timeouts doesn't help. Our machine for the database has plenty of resources and ram free, so it never was a CPU, RAM or I/O issue - server barelly uses over 15% of the CPU, innodb buffer is only 40% full so everything fits into memory. I/O never rose above 3%, mostly sitting bellow 1% (we have InnoDB io capacity set at 6000 baseline and 12000 peak, which is only a fraction of what the storage layer is capable of). Adding covering index `delivered_at, id` does help to aliviate the onset of the issue, but still resulted in hard dealocks, just took about 14-16 hours under our workloads. I was unable to find the original reasons why delete batching was added, but I suspect that's some MySQL 4/5 era schenanigans that are outdated and not true any more. So this PR is what I have deployed 6 days ago to our production enviroment and it has been running trouble free since then without a single deadlock recorded against messenger compoment table. Collecting statistics also shows that this is the correct way to solve this, here are performance schema queries that show before and after: I removed all batched handling and let MySQL run the same way all other databases do it, which works like a charm if we also add a proper index of `queue_name + avaiable_at + delivered_at + id` - this allows MySQL to lock only the specificly required row by it's primary id, removing all lock contention issues (the id field in the index is need, that's what gives index the cardinality to do the job right). Before, notice average lock ms column, it is bad. ``` mysql> SELECT DIGEST_TEXT, -> COUNT_STAR, -> ROUND(SUM_TIMER_WAIT/1e12,3) AS total_sec, -> ROUND(SUM_LOCK_TIME/1e12,3) AS lock_sec, -> ROUND((SUM_LOCK_TIME/1e12)/NULLIF(COUNT_STAR,0)*1000,3) AS avg_lock_ms -> FROM performance_schema.events_statements_summary_by_digest -> WHERE DIGEST_TEXT LIKE '%MESSENGER_MESSAGES%' -> ORDER BY SUM_TIMER_WAIT DESC -> LIMIT 10; +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------+------------+-------------+ | DIGEST_TEXT | COUNT_STAR | total_sec | lock_sec | avg_lock_ms | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------+------------+-------------+ | DELETE FROM `messenger_messages` WHERE `delivered_at` = ? | 2699821 | 126790.694 | 120946.017 | 44.798 | | UPDATE `messenger_messages` SET `delivered_at` = ? WHERE `id` = ? | 3098328 | 43760.777 | 25541.015 | 8.243 | | SELECT `m` . * FROM `messenger_messages` `m` WHERE ( `m` . `queue_name` = ? ) AND ( `m` . `delivered_at` IS NULL OR `m` . `delivered_at` < ? ) AND ( `m` . `available_at` <= ? ) ORDER BY `available_at` ASC LIMIT ? FOR UPDATE SKIP LOCKED | 2696084 | 4204.948 | 2.202 | 0.001 | | INSERT INTO `messenger_messages` ( `body` , `headers` , `queue_name` , `created_at` , `available_at` ) VALUES (...) | 1552710 | 2470.059 | 1069.126 | 0.689 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------+------------+-------------+ ``` After ``` mysql> SELECT DIGEST_TEXT, -> COUNT_STAR, -> ROUND(SUM_TIMER_WAIT/1e12,3) AS total_sec, -> ROUND(SUM_LOCK_TIME/1e12,3) AS lock_sec, -> ROUND((SUM_LOCK_TIME/1e12)/NULLIF(COUNT_STAR,0)*1000,3) AS avg_lock_ms -> FROM performance_schema.events_statements_summary_by_digest -> WHERE DIGEST_TEXT LIKE '%MESSENGER_MESSAGES%' -> ORDER BY SUM_TIMER_WAIT DESC -> LIMIT 10; +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------+-------------+ | DIGEST_TEXT | COUNT_STAR | total_sec | lock_sec | avg_lock_ms | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------+-------------+ | SELECT `m` . * FROM `messenger_messages` `m` WHERE ( `m` . `queue_name` = ? ) AND ( `m` . `delivered_at` IS NULL OR `m` . `delivered_at` < ? ) AND ( `m` . `available_at` <= ? ) ORDER BY `available_at` ASC LIMIT ? FOR UPDATE SKIP LOCKED | 19002450 | 29151.318 | 22.938 | 0.001 | | DELETE FROM `messenger_messages` WHERE `id` = ? | 12677551 | 12511.529 | 66.584 | 0.005 | | INSERT INTO `messenger_messages` ( `body` , `headers` , `queue_name` , `created_at` , `available_at` ) VALUES (...) | 12786292 | 2260.588 | 18.044 | 0.001 | | UPDATE `messenger_messages` SET `delivered_at` = ? WHERE `id` = ? | 12865570 | 1689.881 | 7.368 | 0.001 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------+----------+-------------+ ``` I imagine that the same covering index for the select query should have similar results for other databases, as this goes down to basics of indexing columns for database performance, but obviousuly some help with validating would be appriciated. I also belive this should be backported all the way down to 6.4 branch, as this is an issue I have seen a lot of people running into and common advice being "just use RabbitMQ instead", while the root cause isn't investigated properly. I had the envrioment and authority to dig into root cause and this is the result of that investigation. Commits ------- 81b9d93 [Messenger][Doctrine] Remove batched message delete for MySQL and add a covering index for a select query
i experienced alot of lock wait time and some deadlocks on my database when working with multiple workers handling the same queue. so i fixed it in my application with a decorater. i want to contribute back to symfony.
introduce a algorithm in
Connectionfor mysql platforms to minimize exclusive locking[TODO list]