Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@rickard-green
Copy link
Contributor

@rickard-green rickard-green commented Jan 17, 2022

By default global does not take any actions to restore a fully connected network when connections are lost due to network issues. This is problematic for all applications expecting a fully connected network to be provided, such as for example mnesia, but also for global itself. A network of overlapping partitions might cause the internal state of global to become inconsistent. Such an inconsistency can remain even after such partitions have been brought together to form a fully connected network again. The effect on other applications that expects that a fully connected network is maintained may vary, but they might misbehave in very subtle hard to detect ways during such a partitioning.

In order to prevent such issues, we have introduced a prevent overlapping partitions fix which can be enabled using the prevent_overlapping_partitions kernel(6) parameter. When this fix has been enabled, global will actively disconnect from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions. Note that this fix has to be enabled on all nodes in the network in order to work properly. Since this quite substantially changes the behavior, this fix is currently disabled by default. Since you might get hard to detect issues without this fix you are, however, strongly advised to enable this fix in order to avoid issues such as the ones described above. As of OTP 25 this fix will become enabled by default.

One example of a global internal insconsistency is described on the erlang questions mailing list: https://erlang.org/pipermail/erlang-questions/2020-October/100034.html

#5687 contains this PR merged to the latest master (as of 2022-02-07) plus enabling of the kernel parameter prevent_overlapping_partitions by default.

@rickard-green rickard-green force-pushed the rickard/prevent-overlapping-partitions/24.2/ERIERL-732/OTP-17843 branch 2 times, most recently from f2ef4fd to 90e6456 Compare January 22, 2022 05:54
@github-actions
Copy link
Contributor

github-actions bot commented Jan 22, 2022

CT Test Results

No tests were run for this PR. This is either because the build failed, or the PR is based on a branch without GH actions tests configured.

Results for commit 20c9e34

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@max-au
Copy link
Contributor

max-au commented Feb 4, 2022

I attempted to include this PR in more tests of existing software. It fails a large amount of tests where distributed applications are involved, because of the change in the default behaviour, and the fact that by default global is enabled.

I wonder if there is a chance of keeping the compatibility behaviour by default, and allowing to explicitly enable prevent_overlapping_partitions for those who need this issue fixed?

@rickard-green
Copy link
Contributor Author

We will release this (with some modifications) as a patch on OTP 24, and then it will be disabled by default due to the change in behaviour. The intent is to have it enabled by default in OTP 25 since it will be present in the initial version where users expect larger changes to be made. To me it feels really weird having a bugfix disabled by default especially in the first version of a new release. I would rather disable global as such as default before disabling the bugfix as default (note that I'm not saying that we will do that).

Regardless, even if we were to have it disabled by default, all test cases in OTP have to be able to handle this fix being enabled. Otherwise, we would not be able to run test with it enabled. Testcases not depending on global, and not depending on any other applications depending on fully connected networks, may of course disable global using -connect_all false.

@max-au
Copy link
Contributor

max-au commented Feb 4, 2022

Having global disabled by default sounds even better to me. Although it is also a significant behaviour change.

What I am really concerned about is global disconnecting nodes based on local information about the cluster state. This may be a subject to various race conditions, similar to all kinds of those I had with pg.
My worst nightmare would be cascading disconnects, e.g. when a node in the cluster loses connection to another, and then all other nodes of the cluster also decide to disconnect the node that had a small hiccup.

I wonder if there is any other solution that does not involve disconnection? Apart from https://erlang.org/pipermail/erlang-questions/2020-October/100034.html are there any other manifestations of this bug? I fear that the bugfix itself will cause more churn than leaving the bug there (or solving it differently, if you're open for discussions on that).

@okeuday
Copy link
Contributor

okeuday commented Feb 6, 2022

@rickard-green I would have expected something similar to a kernel app configuration parameter like dist_auto_disconnect (like dist_auto_connect). That would make it easier to disable for obtaining the desired behavior among specific source code. That should also make misconfiguration less likely. People would be making the decision based on their source code's use of global which can be separate from how the source code is executed.

…2/OTP-17843' into rickard/prevent-overlapping-partitions/23.3.4/ERIERL-732/OTP-17843

* rickard/prevent-overlapping-partitions/22.3.4/ERIERL-732/OTP-17843:
  global: Preventing overlapping partitions fix
  global: Propagate and save version between all nodes
  kernel: Fix a race condition in Global
…2/OTP-17843' into rickard/prevent-overlapping-partitions/24.2/ERIERL-732/OTP-17843

* rickard/prevent-overlapping-partitions/23.3.4/ERIERL-732/OTP-17843:
  global: Preventing overlapping partitions fix
  global: Propagate and save version between all nodes
@rickard-green
Copy link
Contributor Author

Copies of conversations in #5417 related to this PR

@rickard-green wrote:

@max-au wrote:

I verified that both workarounds keep tests passing, and can make any desired change. But to be honest I'm not sure whether global can make such decisions locally and disconnect nodes at will.

One of global's main tasks is to maintain a fully connected network. Other applications such as mnesia relies on this. Currently global fails at providing this service in the presence of network failures. The only way *) to provide this service when connections are lost, is to remove other connections so you end up with fully connected partitions.

If global is disabled (-connect_all false), it won't provide a fully connected network and will also not take down connections like this.

Since this is quite a big change in behaviour compared to before, it will also be possible to disable this bugfix (-prevent_overlapping_partitions false), but then you might get intricate issues with applications relying on the fully connected network service being provided.

*) This is not completely true. You could also provide this service by implementing "virtual connections" routing signals over other connected nodes. We would, however, not be able to deliver such a solution in the near future. That is, that can only be a long term option. Note that I'm not saying whether or not "virtual connections" will be implemented, just that it could be an option for the future.

@rickard-green wrote:

@max-au wrote:

I found why I was not able to reproduce the failure. I merged the patch from maint branch, but the feature (disconnecting nodes global does not like) was not enabled in that branch. Likely that merged to master #5611 has it enabled by default.

Yes, it is only enabled in our master tests. I'll soon make a PR available with the changes I've made in master which hopefully can be helpful.

Adding catch to all peer:stop calls makes the tests pass, but it means nearly all tests starting more than a single peer node should either disable global, or swallow peer node shutdown errors. This makes tests less useful, as it loses signal previously delivered from unexpected peer node crashes.

I don't think that is the correct way to handle this. The termination is due to nodes behaving as old style slave nodes. I think the correct way to handle this is to use connection => 0 making nodes behave as old style peer nodes. This is what I've done in the test cases that have needed it which not only include running tests with prevent_overlapping_partitions enabled, but also tests which performs explicit disconnects. This is also how disconnects were handled before the CT_PEER changes were introduced.

It also prevents us from running test cases in parallel! I am right now running rpc_SUITE tests concurrently, completing it in just a few seconds. This is one of the improvements that I'd like for OTP tests (so that we can get testing signal in just a few minutes, for the entire OTP codebase).

Yes it would have been nice to able to run everything in parallel, but that is not always possible. prevent_overlapping_partitions is not the only thing that prevents parallel testing.

When it comes to rpc_SUITE I think that I've disabled global if I remember correctly. I don't think it is a problem to disable global if it is not involved in the testing. In that case prevent_overlapping_partitions won't prevent parallel execution.

@rickard-green
Copy link
Contributor Author

@okeuday wrote:

@rickard-green I would have expected something similar to a kernel app configuration parameter like dist_auto_disconnect (like dist_auto_connect). That would make it easier to disable for obtaining the desired behavior among specific source code. That should also make misconfiguration less likely. People would be making the decision based on their source code's use of global which can be separate from how the source code is executed.

Yes, I've changed it to a kernel parameter, I will push that change soon. I've not changed -connect_all to a kernel parameter in this branch which I think we should do too, but I'll make that change in a separate branch.

@rickard-green
Copy link
Contributor Author

rickard-green commented Feb 7, 2022

Copies of conversations in #5654 related to this PR

@rickard-green wrote:

@max-au wrote:

I reproduced pg failures with #5611 using master branch (before this PR) with prevent_overlapping_partitions set to true.

The pg_SUITE has since its introduction tried to disable global but failed to do so by concatenating the cookie with the -connect_all false argument due to a missing space:

 -spec controller(atom(), atom(), pid()) -> ok.
 controller(Name, Scope, Self) ->
     Pa = filename:dirname(code:which(?MODULE)),
     Pa2 = filename:dirname(code:which(pg)),
     Args = lists:concat(["-setcookie ", erlang:get_cookie(),
-            "-connect_all false -kernel dist_auto_connect never -noshell -pa ", Pa, " -pa ", Pa2]),
+            " -connect_all false -kernel dist_auto_connect never -noshell -pa ", Pa, " -pa ", Pa2]),
     {ok, Node} = test_server:start_node(Name, peer, [{args, Args}]),
     case rpc:call(Node, ?MODULE, control, [Scope], 5000) of
         {badrpc, nodedown} ->
             Self ! {badrpc, Node},
             ok;

When I made the above fix (to an unmodified pg_SUITE on master) pg_SUITE does not fail more with #5611 than it did before #5611 (it has never run completely clean on all platforms).

I think that new implementation of global and pg have contradicting assumptions: pg is designed to work with missing links in the mesh (tolerating netsplits), while global wants to disconnect all loose ends and keep only the strongly connected component.

I don't think pg and global contradicts each other anymore now with #5611 than what they did when pg was introduced. global has always had the idea of a fully connected mesh (although failed in providing that).

global_group_SUITE has that issue too, but is ignores slave shutdown issues, see https://github.com/erlang/otp/blob/master/lib/kernel/test/global_group_SUITE.erl#L1214

As I wrote in #5417 I think this should be solved by using connection => 0.

I'll wait for OTP board decision on #5611, as changing the default behaviour that way may render significant amount of tests broken, or requiring to disable global via -connect_all false for all peer nodes.

Using -connect_all false for all peer nodes will not be acceptable, since it then would not be possible to test anything using global. The only tests effected by #5611 should be test making explicit disconnect. There are a huge amount of tests having no issue what so ever with having prevent_overlapping_partitions enabled.

Copying this #5611 since it relates to that PR

@rickard-green rickard-green force-pushed the rickard/prevent-overlapping-partitions/24.2/ERIERL-732/OTP-17843 branch from 90e6456 to 20c9e34 Compare February 7, 2022 01:08
@rickard-green
Copy link
Contributor Author

rickard-green commented Feb 7, 2022

@max-au wrote:

Having global disabled by default sounds even better to me. Although it is also a significant behaviour change.

I'm personally not that attached to global, but I think we probably not should disable global by default. However, compared to having a broken global by default I think it would be better.

What I am really concerned about is global disconnecting nodes based on local information about the cluster state. This may be a subject to various race conditions, similar to all kinds of those I had with pg. My worst nightmare would be cascading disconnects, e.g. when a node in the cluster loses connection to another, and then all other nodes of the cluster also decide to disconnect the node that had a small hiccup.

I wonder if there is any other solution that does not involve disconnection? Apart from https://erlang.org/pipermail/erlang-questions/2020-October/100034.html are there any other manifestations of this bug? I fear that the bugfix itself will cause more churn than leaving the bug there (or solving it differently, if you're open for discussions on that).

There are unfortunately (as I wrote in the comment from #5417 above) only two alternatives if global (or anything else) are to provide a fully connected mesh, either take down connections or keep them up by other means.

I worked on another solution, simulating a nodedown/nodeup pair internally in global without actually taking down any connections, when I viewed this as a global internal issue. This should prevent global from getting internal inconsistencies. While working with that I realized that such a solution would just make things worse for other applications, such as mnesia, that rely on global providing a fully connected mesh, so I dropped that solution. This since global would then just hide information about these issues taking place preventing such applications from taking appropriate actions.

@rickard-green
Copy link
Contributor Author

@max-au #5687 contains this PR merged to the latest master plus enabling of the kernel parameter prevent_overlapping_partitions by default.

@rickard-green rickard-green changed the base branch from maint-24 to maint February 8, 2022 12:47
@rickard-green
Copy link
Contributor Author

@max-au

We've had an OTP Technical Board meeting discussing defaults for global. The decisions made were that global will remain enabled by default and as of OTP 25 the prevent_overlapping_partitions fix will also be enabled by default.

Regarding the issues we've seen with these defaults together with peer. peer:stop() will by default stop nodes by calling erlang:disconnect_node(PeerNode). From global's perspective this is problematic since global on PeerNode will react to this and potentially cause even more connections to be dropped. This problem is however not global specific. Other (current and/or future) applications executing on PeerNode may also react on the connection loss in undesirable ways. That is, peer:stop() at least needs to try to halt the node properly before removing the connection. It can as a last resort disconnect from it if it is unresponsive, though. I've made a pull request #5705 implementing this and fixing some other issues. @bmk and I tested a simpler version of this which seems to fix all snmp test issues. That version of this fix had other issues, though.

@max-au @okeuday

Regarding kernel parameters. In this PR there is now the prevent_overlapping_partitions kernel parameter. If the -connect_all switch is introduced as a kernel parameter as well, which I think should be named global since it enables/disables global, we get yet another kernel parameter for global. Do you think it is better with a global kernel parameter with a value containing a list of two tuples like: [{enable, true | false}, {prevent_overlapping_partitions, true | false}] or are two separate boolean parameters global and prevent_overlapping_partitions better? Or do you have any other suggestions?

@okeuday
Copy link
Contributor

okeuday commented Feb 11, 2022

@rickard-green My concern regarding the prevent_overlapping_partitions kernel parameter was focused on the ability to disable it in a config file to ensure global doesn't attempt to disconnect nodes. I don't think I would disable global, though I avoid its use. My view is that overlapping partitions should be able to exist without getting resolved, due to use of a separate process group (e.g., https://github.com/okeuday/cpg/ handles both hidden and visible node connections, so any network topology is allowed). If global use blocked infinitely due to overlapping partitions, I would still consider that better than automatically disconnecting nodes (for my use), though I understand the prevent_overlapping_partitions feature is important for global use.

A global kernel parameter with a value of [{enable, true | false}, {prevent_overlapping_partitions, true | false}] is a good approach. An alternative I was thinking about was dist_auto_disconnect with a value of never | global (possibly other values in the future) instead of a prevent_overlapping_partitions value. The name dist_auto_disconnect makes the potential to cause node disconnects clearer (to avoid the reaction of "oh no, where did my nodes go, why is this happening, ahhhh!", the logging output may already make the circumstance clear but I would assume people would want to be aware of the potential disconnects in advance of the occurrence(s)).

A way to disable global with a kernel parameter is important and using a global parameter name for that with a list of options is a good solution.

@max-au
Copy link
Contributor

max-au commented Feb 11, 2022

@rickard-green thank you for the comprehensive update!
As I commented in #5705, it was a design choice to have brutal_kill acting brutally. It is however questionable whether it should be the default mode or not (before #5611, I think it was a good choice as it sped up all tests, but now it can be reconsidered). If you believe that peer shutdown should be graceful by default, I won't object (but please leave the choice to have brutally fast stop for tests that want that).

As for global kernel parameter, I prefer two separate parameters, like -kernel global_enabled true and -kernel prevent_overlapping_partitions true. The reason is, if I ever need to override that via command line, it is simple to do with separate parameters. Figuring out bash/zsh/...sh escaping to pass [{enable, true | false}, {prevent_overlapping_partitions, true | false}] scares me off.

I also agree with @okeuday that a name clearly designating the actual action (disconnecting otherwise perfectly fine nodes) should have disconnect (and probably force) in its name. To me "prevent" means "do not allow to connect" which isn't exactly true, those nodes would successfully connect and disconnect in a rapid succession (e.g. "blink", go up/down/up/down).

@rickard-green
Copy link
Contributor Author

@max-au @okeuday

We are much closer to the first release candidate than I thought, and this needs to be present in it, so it is to late to change naming of parameters before the merge. The naming of the parameters and format of their values can, however, still be changed up until we've released it in a non release candidate version. We'll have to have a discussion about it internally here at OTP as well once the release candidate is out.

@rickard-green rickard-green merged commit ed2a8c5 into erlang:maint Feb 11, 2022
ns-codereview pushed a commit to couchbase/ns_server that referenced this pull request Feb 22, 2023
Disable the new check in erlang 25 that "prevents overlapping
partitions". It is tied to the new facility of the "global" module
that gives a few features such as name registration and simple locking
across a cluster. We do not use this module, but instead have
implemented this ourselves. Perhaps this upgraded "global" module
could be used in place of our custom solution, but in the meantime
this causes problems while nodes are attempting to join together in
some situations. It has also caused some problems with remote shell
connections. Changing this setting puts this feature back to the way
it was in erlang 24 so we can decide at another time if it's worth
using the new changes from the global module.

Here are some link(s) that explains the new feature and why this is
problematic:
 - https://www.erlang.org/patches/otp-25.0
 - erlang/otp#5611
 - erlang/otp#5687
 - https://stackoverflow.com/q/73567169
 - https://stackoverflow.com/a/73578740

Change-Id: Ibd810eadf4d0716e399b4d5c5f6c2c60a6b1675e
Reviewed-on: https://review.couchbase.org/c/ns_server/+/186561
Tested-by: Build Bot <[email protected]>
Tested-by: Bryan McCoid <[email protected]>
Well-Formed: Build Bot <[email protected]>
Reviewed-by: Artem Stemkovski <[email protected]>
ns-codereview pushed a commit to couchbase/couchdbx-app that referenced this pull request Feb 22, 2023
Disable the new check in erlang 25 that "prevents overlapping
partitions". It is tied to the new facility of the "global" module
that gives a few features such as name registration and simple locking
across a cluster. We do not use this module, but instead have
implemented this ourselves. Perhaps this upgraded "global" module
could be used in place of our custom solution, but in the meantime
this causes problems while nodes are attempting to join together in
some situations. It has also caused some problems with remote shell
connections. Changing this setting puts this feature back to the way
it was in erlang 24 so we can decide at another time if it's worth
using the new changes from the global module.

Here are some link(s) that explains the new feature and why this is
problematic:
 - https://www.erlang.org/patches/otp-25.0
 - erlang/otp#5611
 - erlang/otp#5687
 - https://stackoverflow.com/q/73567169
 - https://stackoverflow.com/a/73578740

Change-Id: I96b7bf67a6a0a41230f5b10e5c80dab1005e5be8
Reviewed-on: https://review.couchbase.org/c/couchdbx-app/+/187155
Tested-by: Bryan McCoid <[email protected]>
Reviewed-by: Hareen Kancharla <[email protected]>
ns-codereview pushed a commit to couchbase/ns_server that referenced this pull request Nov 20, 2023
Disable the new check in erlang 25 that "prevents overlapping
partitions". It is tied to the new facility of the "global" module
that gives a few features such as name registration and simple locking
across a cluster. We do not use this module, but instead have
implemented this ourselves. Perhaps this upgraded "global" module
could be used in place of our custom solution, but in the meantime
this causes problems while nodes are attempting to join together in
some situations. It has also caused some problems with remote shell
connections. Changing this setting puts this feature back to the way
it was in erlang 24 so we can decide at another time if it's worth
using the new changes from the global module.

Here are some link(s) that explains the new feature and why this is
problematic:
 - https://www.erlang.org/patches/otp-25.0
 - erlang/otp#5611
 - erlang/otp#5687
 - https://stackoverflow.com/q/73567169
 - https://stackoverflow.com/a/73578740

This should have originally been tagged with MB-54582, as this was
introduced with the erlang 25 upgrade.

Change-Id: Ibd810eadf4d0716e399b4d5c5f6c2c60a6b1675e
Reviewed-on: https://review.couchbase.org/c/ns_server/+/201009
Tested-by: Build Bot <[email protected]>
Tested-by: Ben Huddleston <[email protected]>
Well-Formed: Restriction Checker
Well-Formed: Build Bot <[email protected]>
Reviewed-by: Dave Finlay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix team:VM Assigned to OTP team VM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants