-
Notifications
You must be signed in to change notification settings - Fork 385
HEP: Migrate from Wicked to NetworkManager [skip ci] #9039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
65c461a to
81a1c2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HEP looks have covered the broad areas, thanks.
|
|
||
| With new installs, `/oem/90_custom.yaml` will include NetworkManager connection profiles instead of `ifcfg-*` files. | ||
|
|
||
| With upgrades, the existing `/oem/90_custom.yaml` file will still include the old `ifcfg-*` files, which will be ignored by NetworkManager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
harvester.config might be outdated after several upgrades and user's potential changes; the general upgrade practice is to generate the target networkmanager.yaml from the active 90_custom.yaml
request user to manually change before/after upgrade is not acceptable in most cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern here about the requesting manual changes, but there are some difficulties with generating NetworkManager config based on what's in 90_custom.yaml.
What we need, to generate NetworkManager config, is essentially the data in install.management_interface from harvester.config (i.e. whether it's DHCP or static IP, the list of interfaces to be bonded, the bond options, the mtu, the VLAN ID). This is all explicitly stated in harvester.config and can be reliably read, at least since Harvester v1.1 (earlier in v1.0 and v0.3 the format was different, so actually if someone's got a system that old, that would be another problem with the approach I'm suggesting here :-/)
The data we need is not explicit or trivial to extract from 90_custom.yaml. Rather, it's spread over a bunch of ifcfg-* files embedded in that YAML. So we would have to read 90_custom.yaml, look for all the files directives, check those to see which ones were ifcfg files, parse them all and figure out what the settings are based on that. This strikes me as complicated and potentially error prone, especially if we have to handle every possible thing that someone could have added to an ifcfg-* file. If any of this were to break, the user would be forced to fix it manually.
Note that there's not a 1:1 mapping between ifcfg-* files and NetworkManager connection profiles, and SLE doesn't include any tools to perform this migration.
In the happy case (Harvester was originally installed with v1.1 or newer, and no manual changes have been made to management interface configuration via 90_custom.yaml), it's going to be much simpler to read harvester.config than trying to parse 90_custom.yaml.
In the unhappy case (Harvester originally installed with v1.0 or older, or manual changes made to management interface configuration via 90_custom.yaml), reading harvester.config may not give the correct configuration. But OTOH, trying to parse 90_custom.yaml can't be guaranteed to give a reliable result either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to document that users ensure harvester.config needs to match the existing network definition before the user initiates the upgrade. This would only be needed in case the users have made post install networking changes by editing the harvester-installer generated elemental cloudinit.
We also need to handle the schema change in harvester.config which was introduced in Harvester v1.1. This has been stable since.
However in harvester v1.0 our networking configuration looked like
install:
mode: create
networks:
harvester-mgmt:
interfaces:
- name: ens5
hwAddr: "B8:CA:3A:6A:64:7C"
method: dhcp
compared to post v1.1
install:
mode: create
management_interface:
interfaces:
- name: ens5
hwAddr: "B8:CA:3A:6A:64:7C"
method: dhcp
We need to handle this as part of the upgrade as we may have users running since Harvester v1.0 and the harvester.config has remained untouched.
A possible option would be to provide a utility/cli which leverages the cloudinit generation logic and can be run in advance on each node in the cluster. This would help users review the generated network configuration before the actual upgrade. This will allow them the opportunity to fine tune the harvester.config to ensure the generate network configuration is matching the current network setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Just as a small mention on the Docs aspect 😄 , thinking:
Since:
The user may also likely have persisted that change, in |
|
|
||
| No special user action is required. Everything should just appear to work as it currently does. | ||
|
|
||
| #### Upgrades Where Post-installation Configuration Changes Have Been Made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to better understand the failure domains and possible scenarios. Basically, what is the worst thing that can happen and how do we recover? Presumably, if the switch over from wicked to nm failed, the problem will be very visible, right? The host loses connectivity, RKE2 reports the node status as NotReady then Unknown. Upgrade is halted. Likely, the admin can still access the host (via serial console?) to diagnose and repair.
If the host loses connectivity, does its role play a role (no pun intended) here? Will role promotion happen while the upgrade is happening? If yes, will the next promoted host be promoted to the correct version? Do we need to ask user to add another management host to ensure quorum in case of partition? Should upgrade be done on management hosts first? (I assume we are already doing this, but can't remember.)
Taking it a step further, is it possible for the (networking stack) upgrade to complete successfully, but user applications to be completely broken, say because some customization translation was amiss, they failed silently etc.? If yes, then what can user do to detect the problem mid-upgrade, before it takes down the entire user application layer? Is it possible to pause the upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete failure would leave the host without network connectivity as you say, and the admin would need to access it via remote console to diagnose and repair.
Role promotion should only happen if a host is actually removed. AFAIK that won't happen automatically if a host just becomes uncontactable for an extended period.
Could user apps be broken by a messed up network config? Uh.. I'm not sure. All we do statically (i.e. the stuff we're changing the configuration of here) is the management interface and how it comes up on boot. Other networks that might affect workloads (extra cluster networks, vm networks, storage network) are all created by harvester dynamically at runtime and so thus shouldn't(!) be affected by this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete failure would leave the host without network connectivity as you say, and the admin would need to access it via remote console to diagnose and repair.
Just to add to that -- it turns out this is (or should be) really easy to do if necessary since harvester/harvester-installer#1150 went in:
- Login via remote console and become root
- Edit
/oem/harvester.configand make sureinstall: managementinterface: ....specifies the network configuration you want - Run
harvester-installer generate-network-yaml. By default (i.e. with no other options specified) this will create/oem/91_networkmanager.yamlbased onoem/harvester.config. - Reboot and enjoy your now functional network
|
@irishgordo yeah, instead of tweaking |
|
@tserong for Harvester ISO the "Net Install" variant will there be any extra considerations needed from the network manager switch? |
I don't believe so @irishgordo but we should probably give it a quick test run to make sure. |
|
BTW there's a WIP docs PR at harvester/docs#897, which between the finished bits and the bits still marked "DOCS TODO" should hopefully cover all the relevant changes. |
Signed-off-by: Tim Serong <[email protected]>
81a1c2a to
c0b83ad
Compare
Problem:
We need to migrate from Wicked to NetworkManager in order to be able to update our base OS to SL Micro 6.x.
Solution:
Described in this HEP
Related Issue(s):
#3418
Test plan:
Included in this HEP