-
Notifications
You must be signed in to change notification settings - Fork 3k
Prevent global inconsistency by preventing overlapping partitions #5611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent global inconsistency by preventing overlapping partitions #5611
Conversation
f2ef4fd to
90e6456
Compare
CT Test ResultsNo tests were run for this PR. This is either because the build failed, or the PR is based on a branch without GH actions tests configured. Results for commit 20c9e34 To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts
// Erlang/OTP Github Action Bot |
|
I attempted to include this PR in more tests of existing software. It fails a large amount of tests where distributed applications are involved, because of the change in the default behaviour, and the fact that by default I wonder if there is a chance of keeping the compatibility behaviour by default, and allowing to explicitly enable |
|
We will release this (with some modifications) as a patch on OTP 24, and then it will be disabled by default due to the change in behaviour. The intent is to have it enabled by default in OTP 25 since it will be present in the initial version where users expect larger changes to be made. To me it feels really weird having a bugfix disabled by default especially in the first version of a new release. I would rather disable Regardless, even if we were to have it disabled by default, all test cases in OTP have to be able to handle this fix being enabled. Otherwise, we would not be able to run test with it enabled. Testcases not depending on |
|
Having What I am really concerned about is I wonder if there is any other solution that does not involve disconnection? Apart from https://erlang.org/pipermail/erlang-questions/2020-October/100034.html are there any other manifestations of this bug? I fear that the bugfix itself will cause more churn than leaving the bug there (or solving it differently, if you're open for discussions on that). |
|
@rickard-green I would have expected something similar to a kernel app configuration parameter like |
…2/OTP-17843' into rickard/prevent-overlapping-partitions/23.3.4/ERIERL-732/OTP-17843 * rickard/prevent-overlapping-partitions/22.3.4/ERIERL-732/OTP-17843: global: Preventing overlapping partitions fix global: Propagate and save version between all nodes kernel: Fix a race condition in Global
…2/OTP-17843' into rickard/prevent-overlapping-partitions/24.2/ERIERL-732/OTP-17843 * rickard/prevent-overlapping-partitions/23.3.4/ERIERL-732/OTP-17843: global: Preventing overlapping partitions fix global: Propagate and save version between all nodes
|
Copies of conversations in #5417 related to this PR @rickard-green wrote:
@rickard-green wrote:
|
|
@okeuday wrote:
Yes, I've changed it to a kernel parameter, I will push that change soon. I've not changed |
|
Copies of conversations in #5654 related to this PR @rickard-green wrote:
|
90e6456 to
20c9e34
Compare
|
@max-au wrote:
I'm personally not that attached to
There are unfortunately (as I wrote in the comment from #5417 above) only two alternatives if I worked on another solution, simulating a |
|
We've had an OTP Technical Board meeting discussing defaults for Regarding the issues we've seen with these defaults together with Regarding kernel parameters. In this PR there is now the |
|
@rickard-green My concern regarding the A A way to disable |
|
@rickard-green thank you for the comprehensive update! As for I also agree with @okeuday that a name clearly designating the actual action (disconnecting otherwise perfectly fine nodes) should have |
|
We are much closer to the first release candidate than I thought, and this needs to be present in it, so it is to late to change naming of parameters before the merge. The naming of the parameters and format of their values can, however, still be changed up until we've released it in a non release candidate version. We'll have to have a discussion about it internally here at OTP as well once the release candidate is out. |
Disable the new check in erlang 25 that "prevents overlapping partitions". It is tied to the new facility of the "global" module that gives a few features such as name registration and simple locking across a cluster. We do not use this module, but instead have implemented this ourselves. Perhaps this upgraded "global" module could be used in place of our custom solution, but in the meantime this causes problems while nodes are attempting to join together in some situations. It has also caused some problems with remote shell connections. Changing this setting puts this feature back to the way it was in erlang 24 so we can decide at another time if it's worth using the new changes from the global module. Here are some link(s) that explains the new feature and why this is problematic: - https://www.erlang.org/patches/otp-25.0 - erlang/otp#5611 - erlang/otp#5687 - https://stackoverflow.com/q/73567169 - https://stackoverflow.com/a/73578740 Change-Id: Ibd810eadf4d0716e399b4d5c5f6c2c60a6b1675e Reviewed-on: https://review.couchbase.org/c/ns_server/+/186561 Tested-by: Build Bot <[email protected]> Tested-by: Bryan McCoid <[email protected]> Well-Formed: Build Bot <[email protected]> Reviewed-by: Artem Stemkovski <[email protected]>
Disable the new check in erlang 25 that "prevents overlapping partitions". It is tied to the new facility of the "global" module that gives a few features such as name registration and simple locking across a cluster. We do not use this module, but instead have implemented this ourselves. Perhaps this upgraded "global" module could be used in place of our custom solution, but in the meantime this causes problems while nodes are attempting to join together in some situations. It has also caused some problems with remote shell connections. Changing this setting puts this feature back to the way it was in erlang 24 so we can decide at another time if it's worth using the new changes from the global module. Here are some link(s) that explains the new feature and why this is problematic: - https://www.erlang.org/patches/otp-25.0 - erlang/otp#5611 - erlang/otp#5687 - https://stackoverflow.com/q/73567169 - https://stackoverflow.com/a/73578740 Change-Id: I96b7bf67a6a0a41230f5b10e5c80dab1005e5be8 Reviewed-on: https://review.couchbase.org/c/couchdbx-app/+/187155 Tested-by: Bryan McCoid <[email protected]> Reviewed-by: Hareen Kancharla <[email protected]>
Disable the new check in erlang 25 that "prevents overlapping partitions". It is tied to the new facility of the "global" module that gives a few features such as name registration and simple locking across a cluster. We do not use this module, but instead have implemented this ourselves. Perhaps this upgraded "global" module could be used in place of our custom solution, but in the meantime this causes problems while nodes are attempting to join together in some situations. It has also caused some problems with remote shell connections. Changing this setting puts this feature back to the way it was in erlang 24 so we can decide at another time if it's worth using the new changes from the global module. Here are some link(s) that explains the new feature and why this is problematic: - https://www.erlang.org/patches/otp-25.0 - erlang/otp#5611 - erlang/otp#5687 - https://stackoverflow.com/q/73567169 - https://stackoverflow.com/a/73578740 This should have originally been tagged with MB-54582, as this was introduced with the erlang 25 upgrade. Change-Id: Ibd810eadf4d0716e399b4d5c5f6c2c60a6b1675e Reviewed-on: https://review.couchbase.org/c/ns_server/+/201009 Tested-by: Build Bot <[email protected]> Tested-by: Ben Huddleston <[email protected]> Well-Formed: Restriction Checker Well-Formed: Build Bot <[email protected]> Reviewed-by: Dave Finlay <[email protected]>
By default
globaldoes not take any actions to restore a fully connected network when connections are lost due to network issues. This is problematic for all applications expecting a fully connected network to be provided, such as for examplemnesia, but also forglobalitself. A network of overlapping partitions might cause the internal state ofglobalto become inconsistent. Such an inconsistency can remain even after such partitions have been brought together to form a fully connected network again. The effect on other applications that expects that a fully connected network is maintained may vary, but they might misbehave in very subtle hard to detect ways during such a partitioning.In order to prevent such issues, we have introduced a prevent overlapping partitions fix which can be enabled using the
prevent_overlapping_partitionskernel(6) parameter. When this fix has been enabled,globalwill actively disconnect from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions. Note that this fix has to be enabled on all nodes in the network in order to work properly. Since this quite substantially changes the behavior, this fix is currently disabled by default. Since you might get hard to detect issues without this fix you are, however, strongly advised to enable this fix in order to avoid issues such as the ones described above. As of OTP 25 this fix will become enabled by default.One example of a
globalinternal insconsistency is described on the erlang questions mailing list: https://erlang.org/pipermail/erlang-questions/2020-October/100034.html#5687 contains this PR merged to the latest master (as of 2022-02-07) plus enabling of the kernel parameter
prevent_overlapping_partitionsby default.