Summary
Commit 0278980 (PR #13959, "CQ shared store: Delete from index on remove or roll over") introduced a regression in the classic queue message store GC that causes unbounded disk growth under sustained publish load when a slow-consumer queue shares the same vhost as high-throughput queues.
The regression is present in the current main branch. Reverting 0278980 restores stable disk behavior.
Root Cause
PR #13959 changed scan_and_vacuum_message_file in delete_file to an eager index cleanup mechanism. As a side effect, messages removed from non-current files now produce not_found index lookups during scan_and_vacuum_message_file instead of previously_valid ones. This was noted in the PR review by @gomoripeti:
if I see it correctly compaction might become slower with this change (as during scan_file_for_valid_messages before this change there were ref-count=0 index entries which resulted in a previously_valid status while now not_found entries result in invalid status, and causing a scan_next_byte scanning mode)
Under high throughput with many queues, the GC compaction rate drops far enough that it cannot keep pace with the publish rate. Files accumulate faster than they are reclaimed, and disk usage can grow without bound. The GC stall also causes consumer latency spikes and broker unresponsiveness on established TCP connections.
Reproduction
Three concurrent workloads on a single RabbitMQ node (m7g.large, 196 GB EBS):
- main-workload: 100 classic queues, 100 producers + 100 consumers, 120 KB messages, 5 msg/s per producer (500 msg/s aggregate), consumers acking immediately
- slow-ack-publisher: 1 producer, 3 msg/s to
slow-ack-queue, 120 KB messages
- slow-ack-consumer: consumer on
slow-ack-queue holding acks for 1-30 minutes (up to 1000 messages in flight simultaneously)
All queues in the same vhost with queue-version: 2 policy.
Reproduction scripts: https://github.com/lukebakken/rmq-gc-lag
Observed Behavior
With 0278980 present (main):
Disk declined ~3.1 GB in 6 minutes during the baseline phase (200 msg/s), then briefly recovered when the spike phase began (500 msg/s), then resumed declining. Over ~100 minutes of monitoring, disk fell from 185.4 GB to ~169 GB - a loss of ~16 GB. Ready messages grew from 0 to 3500-4200 as the broker fell behind on delivery, and unacked messages accumulated to 3500+ across multiple reconnect cycles.
The GC stall also causes consumer latency spikes. The broker stopped sending data on an established TCP connection long enough to trigger a client-side socket read timeout:
[AMQP Connection 10.0.1.90:5672] ERROR - An unexpected connection driver error occurred
java.net.SocketTimeoutException: Read timed out
Consumer latency at time of socket read timeout:
min/median/75th/95th/99th/max consumer latency:
64886 / 1,511,958 / 6,254,378 / 46,742,920 / 54,532,926 / 568,205,100 µs
(median 1.5s, 95th 46s, 99th 54s, max 568s)

With 0278980 reverted (branch lukebakken/cq-gc):
Disk stable in a 0.5 GB oscillation band (184.96-185.47 GB) throughout three consecutive 20-minute monitoring windows (60 minutes total) under the same workload at 500 msg/s with ~1000 unacked messages. Ready messages held at 0 throughout. No latency spikes, no broker unresponsiveness.

With v3.13.7 (pre-regression, pre-refactor):
Disk stable throughout a 23-minute run under the same workload. Note: v3.13.7 predates the major rabbit_msg_store refactor that introduced the shared store architecture, so this data point establishes a pre-regression baseline but is not directly comparable to main.

Fix
Revert 0278980. Three independent improvements from that commit can be retained safely:
compact_file/2 early-exit guard (file already deleted)
prioritise_cast/3 in rabbit_msg_store_gc (delete requests before compaction)
index_update_fields assertion relaxed (true= to _=)
See branch lukebakken/cq-gc on https://github.com/lukebakken/rmq-rabbitmq-server for the revert with retained improvements.
Workaround
Move queues with long consumer timeouts to a dedicated vhost. This gives them a separate message store instance whose unacked messages do not pin files in the shared store. Confirmed effective: disk stable throughout a 40-minute run with the same workload after vhost isolation.
Summary
Commit 0278980 (PR #13959, "CQ shared store: Delete from index on remove or roll over") introduced a regression in the classic queue message store GC that causes unbounded disk growth under sustained publish load when a slow-consumer queue shares the same vhost as high-throughput queues.
The regression is present in the current
mainbranch. Reverting 0278980 restores stable disk behavior.Root Cause
PR #13959 changed
scan_and_vacuum_message_fileindelete_fileto an eager index cleanup mechanism. As a side effect, messages removed from non-current files now producenot_foundindex lookups duringscan_and_vacuum_message_fileinstead ofpreviously_validones. This was noted in the PR review by @gomoripeti:Under high throughput with many queues, the GC compaction rate drops far enough that it cannot keep pace with the publish rate. Files accumulate faster than they are reclaimed, and disk usage can grow without bound. The GC stall also causes consumer latency spikes and broker unresponsiveness on established TCP connections.
Reproduction
Three concurrent workloads on a single RabbitMQ node (m7g.large, 196 GB EBS):
slow-ack-queue, 120 KB messagesslow-ack-queueholding acks for 1-30 minutes (up to 1000 messages in flight simultaneously)All queues in the same vhost with
queue-version: 2policy.Reproduction scripts: https://github.com/lukebakken/rmq-gc-lag
Observed Behavior
With 0278980 present (
main):Disk declined ~3.1 GB in 6 minutes during the baseline phase (200 msg/s), then briefly recovered when the spike phase began (500 msg/s), then resumed declining. Over ~100 minutes of monitoring, disk fell from 185.4 GB to ~169 GB - a loss of ~16 GB. Ready messages grew from 0 to 3500-4200 as the broker fell behind on delivery, and unacked messages accumulated to 3500+ across multiple reconnect cycles.
The GC stall also causes consumer latency spikes. The broker stopped sending data on an established TCP connection long enough to trigger a client-side socket read timeout:
Consumer latency at time of socket read timeout:
(median 1.5s, 95th 46s, 99th 54s, max 568s)
With 0278980 reverted (branch
lukebakken/cq-gc):Disk stable in a 0.5 GB oscillation band (184.96-185.47 GB) throughout three consecutive 20-minute monitoring windows (60 minutes total) under the same workload at 500 msg/s with ~1000 unacked messages. Ready messages held at 0 throughout. No latency spikes, no broker unresponsiveness.
With v3.13.7 (pre-regression, pre-refactor):
Disk stable throughout a 23-minute run under the same workload. Note: v3.13.7 predates the major
rabbit_msg_storerefactor that introduced the shared store architecture, so this data point establishes a pre-regression baseline but is not directly comparable tomain.Fix
Revert 0278980. Three independent improvements from that commit can be retained safely:
compact_file/2early-exit guard (file already deleted)prioritise_cast/3inrabbit_msg_store_gc(delete requests before compaction)index_update_fieldsassertion relaxed (true=to_=)See branch
lukebakken/cq-gcon https://github.com/lukebakken/rmq-rabbitmq-server for the revert with retained improvements.Workaround
Move queues with long consumer timeouts to a dedicated vhost. This gives them a separate message store instance whose unacked messages do not pin files in the shared store. Confirmed effective: disk stable throughout a 40-minute run with the same workload after vhost isolation.