Classic queue message store GC cannot keep pace under high throughput after 0278980ba0

## Summary

Commit 0278980ba0 (PR #13959, "CQ shared store: Delete from index on remove or roll over") introduced a regression in the classic queue message store GC that causes unbounded disk growth under sustained publish load when a slow-consumer queue shares the same vhost as high-throughput queues.

The regression is present in the current `main` branch. Reverting 0278980ba0 restores stable disk behavior.

## Root Cause

PR #13959 changed `scan_and_vacuum_message_file` in `delete_file` to an eager index cleanup mechanism. As a side effect, messages removed from non-current files now produce `not_found` index lookups during `scan_and_vacuum_message_file` instead of `previously_valid` ones. This was noted in the PR review by @gomoripeti:

> if I see it correctly compaction might become slower with this change (as during `scan_file_for_valid_messages` before this change there were ref-count=0 index entries which resulted in a `previously_valid` status while now `not_found` entries result in `invalid` status, and causing a scan_next_byte scanning mode)

Under high throughput with many queues, the GC compaction rate drops far enough that it cannot keep pace with the publish rate. Files accumulate faster than they are reclaimed, and disk usage can grow without bound. The GC stall also causes consumer latency spikes and broker unresponsiveness on established TCP connections.

## Reproduction

Three concurrent workloads on a single RabbitMQ node (m7g.large, 196 GB EBS):

- **main-workload**: 100 classic queues, 100 producers + 100 consumers, 120 KB messages, 5 msg/s per producer (500 msg/s aggregate), consumers acking immediately
- **slow-ack-publisher**: 1 producer, 3 msg/s to `slow-ack-queue`, 120 KB messages
- **slow-ack-consumer**: consumer on `slow-ack-queue` holding acks for 1-30 minutes (up to 1000 messages in flight simultaneously)

All queues in the same vhost with `queue-version: 2` policy.

Reproduction scripts: https://github.com/lukebakken/rmq-gc-lag

## Observed Behavior

**With 0278980ba0 present (`main`):**

Disk declined ~3.1 GB in 6 minutes during the baseline phase (200 msg/s), then briefly recovered when the spike phase began (500 msg/s), then resumed declining. Over ~100 minutes of monitoring, disk fell from 185.4 GB to ~169 GB - a loss of ~16 GB. Ready messages grew from 0 to 3500-4200 as the broker fell behind on delivery, and unacked messages accumulated to 3500+ across multiple reconnect cycles.

The GC stall also causes consumer latency spikes. The broker stopped sending data on an established TCP connection long enough to trigger a client-side socket read timeout:

```
[AMQP Connection 10.0.1.90:5672] ERROR - An unexpected connection driver error occurred
java.net.SocketTimeoutException: Read timed out
```

Consumer latency at time of socket read timeout:
```
min/median/75th/95th/99th/max consumer latency:
64886 / 1,511,958 / 6,254,378 / 46,742,920 / 54,532,926 / 568,205,100 µs
```

(median 1.5s, 95th 46s, 99th 54s, max 568s)

![Grafana dashboard showing continuous disk decline on unpatched main](https://raw.githubusercontent.com/lukebakken/rmq-gc-lag/main/screenshots/rabbitmq-main.png)

**With 0278980ba0 reverted (branch `lukebakken/cq-gc`):**

Disk stable in a 0.5 GB oscillation band (184.96-185.47 GB) throughout three consecutive 20-minute monitoring windows (60 minutes total) under the same workload at 500 msg/s with ~1000 unacked messages. Ready messages held at 0 throughout. No latency spikes, no broker unresponsiveness.

![Grafana dashboard showing stable disk on patched lukebakken/cq-gc branch](https://raw.githubusercontent.com/lukebakken/rmq-gc-lag/main/screenshots/rabbitmq-lukebakken-cq-gc.png)

**With v3.13.7 (pre-regression, pre-refactor):**

Disk stable throughout a 23-minute run under the same workload. Note: v3.13.7 predates the major `rabbit_msg_store` refactor that introduced the shared store architecture, so this data point establishes a pre-regression baseline but is not directly comparable to `main`.

![Grafana dashboard showing stable disk on v3.13.7](https://raw.githubusercontent.com/lukebakken/rmq-gc-lag/main/screenshots/rabbitmq-v3.13.7.png)

## Fix

Revert 0278980ba0. Three independent improvements from that commit can be retained safely:

- `compact_file/2` early-exit guard (file already deleted)
- `prioritise_cast/3` in `rabbit_msg_store_gc` (delete requests before compaction)
- `index_update_fields` assertion relaxed (`true=` to `_=`)

See branch `lukebakken/cq-gc` on https://github.com/lukebakken/rmq-rabbitmq-server for the revert with retained improvements.

## Workaround

Move queues with long consumer timeouts to a dedicated vhost. This gives them a separate message store instance whose unacked messages do not pin files in the shared store. Confirmed effective: disk stable throughout a 40-minute run with the same workload after vhost isolation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classic queue message store GC cannot keep pace under high throughput after 0278980ba0 #16141

Summary

Root Cause

Reproduction

Observed Behavior

Fix

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Classic queue message store GC cannot keep pace under high throughput after 0278980ba0 #16141

Description

Summary

Root Cause

Reproduction

Observed Behavior

Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions