ames, gall: fix "nacked-leave" logic #6954
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The gall-ames-desync caused a %leave and a %cork to be sent one after the other to %ames, both to be delivered to the remote server %ames. The change in #6759 made it that %gall only sends a %cork to %ames when the %leave gets acked, and then removes it from its queue. If a %leave gets %nacked, we start a timer to go over every outstanding %leave to send them (if there's only one message in the outstanding %gall queue, and it's a %leave). This is a problem because there could be %leaves there that belong to %dead-flows and have not been acks, so we are creating a space leak in %ames' unsent-message queue.
This was made worse by the bug introduced in the %flub logic when the specific message we deliver to the vane is a %leave (see #6953) since now there are more %leaves that are going to be %nacked—this was discovered on ~norsyr-torryn when multiple "...on closing bone, ignoring" messages were seeing. Investigating this further revealed that these were %leave %pleas, and also that ~halbex-palheb was always %nacking the same %leave over and over. Looking into why this happened revealed the situation described in #6759.
Here we just "flag" %nacked %leaves by adding a %missing request that it's checked when the nacked-leaves timer fires to skip %leaves that belong to dead-flows.
I've also added trace-logging (under the %odd flag) in %ames for sending %pleas on closing bones—this is more contentious since this PR should fix what has caused the "closing bone" message, so the
~&shouldn't be hidden behind a logging flag and the specific commit should probably be reverted.The situation for the "...on closing bone" messages seems different than the nacked-leave scenario. I suspect that these are flows that came via the gall-ames desync, and were marked as closing (%cork and %leave were sent at the same time to %ames), but the %leave got dropped on the server, so the client %ames just resends it, but %gall still kept the outstanding %leave in the queue. The behavior of %nacking a %leave triggered by the bug in #6953 made it that all outstanding %leave request are sent to %ames, even for those flows that are in closing.