Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tserong
Copy link
Contributor

@tserong tserong commented Sep 3, 2025

Problem:

We need to migrate from Wicked to NetworkManager in order to be able to update our base OS to SL Micro 6.x.

Solution:

Described in this HEP

Related Issue(s):

#3418

Test plan:

Included in this HEP

@tserong tserong requested a review from a team September 3, 2025 10:21
@tserong tserong force-pushed the wip-networkmanager-hep branch from 65c461a to 81a1c2a Compare September 3, 2025 10:24
Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HEP looks have covered the broad areas, thanks.


With new installs, `/oem/90_custom.yaml` will include NetworkManager connection profiles instead of `ifcfg-*` files.

With upgrades, the existing `/oem/90_custom.yaml` file will still include the old `ifcfg-*` files, which will be ignored by NetworkManager.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harvester.config might be outdated after several upgrades and user's potential changes; the general upgrade practice is to generate the target networkmanager.yaml from the active 90_custom.yaml

request user to manually change before/after upgrade is not acceptable in most cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern here about the requesting manual changes, but there are some difficulties with generating NetworkManager config based on what's in 90_custom.yaml.

What we need, to generate NetworkManager config, is essentially the data in install.management_interface from harvester.config (i.e. whether it's DHCP or static IP, the list of interfaces to be bonded, the bond options, the mtu, the VLAN ID). This is all explicitly stated in harvester.config and can be reliably read, at least since Harvester v1.1 (earlier in v1.0 and v0.3 the format was different, so actually if someone's got a system that old, that would be another problem with the approach I'm suggesting here :-/)

The data we need is not explicit or trivial to extract from 90_custom.yaml. Rather, it's spread over a bunch of ifcfg-* files embedded in that YAML. So we would have to read 90_custom.yaml, look for all the files directives, check those to see which ones were ifcfg files, parse them all and figure out what the settings are based on that. This strikes me as complicated and potentially error prone, especially if we have to handle every possible thing that someone could have added to an ifcfg-* file. If any of this were to break, the user would be forced to fix it manually.

Note that there's not a 1:1 mapping between ifcfg-* files and NetworkManager connection profiles, and SLE doesn't include any tools to perform this migration.

In the happy case (Harvester was originally installed with v1.1 or newer, and no manual changes have been made to management interface configuration via 90_custom.yaml), it's going to be much simpler to read harvester.config than trying to parse 90_custom.yaml.

In the unhappy case (Harvester originally installed with v1.0 or older, or manual changes made to management interface configuration via 90_custom.yaml), reading harvester.config may not give the correct configuration. But OTOH, trying to parse 90_custom.yaml can't be guaranteed to give a reliable result either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to document that users ensure harvester.config needs to match the existing network definition before the user initiates the upgrade. This would only be needed in case the users have made post install networking changes by editing the harvester-installer generated elemental cloudinit.

We also need to handle the schema change in harvester.config which was introduced in Harvester v1.1. This has been stable since.

However in harvester v1.0 our networking configuration looked like

install:
  mode: create
  networks:
    harvester-mgmt:
      interfaces:
      - name: ens5
        hwAddr: "B8:CA:3A:6A:64:7C"
      method: dhcp

compared to post v1.1

install:
  mode: create
  management_interface:
    interfaces:
    - name: ens5
      hwAddr: "B8:CA:3A:6A:64:7C"
    method: dhcp

We need to handle this as part of the upgrade as we may have users running since Harvester v1.0 and the harvester.config has remained untouched.

A possible option would be to provide a utility/cli which leverages the cloudinit generation logic and can be run in advance on each node in the cluster. This would help users review the generated network configuration before the actual upgrade. This will allow them the opportunity to fine tune the harvester.config to ensure the generate network configuration is matching the current network setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a couple of enhancement issues to cover the above:

#9300
#9301

@irishgordo
Copy link
Contributor

Just as a small mention on the Docs aspect 😄 , thinking:

Since:

Story 2
I'm upgrading from Harvester v1.6.x to v1.7.0. I have not made any post-installation changes to network configuration by manually editing /oem/*.yaml files. I expect networking to continue to operate correctly after the upgrade.

The user may also likely have persisted that change, in /oem so they likely wouldn't meet the qualifications of Story 2 -> but Fresh Installs of Harvester Post-Install, a user may still want to change / update DNS


No special user action is required. Everything should just appear to work as it currently does.

#### Upgrades Where Post-installation Configuration Changes Have Been Made
Copy link
Contributor

@ihcsim ihcsim Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to better understand the failure domains and possible scenarios. Basically, what is the worst thing that can happen and how do we recover? Presumably, if the switch over from wicked to nm failed, the problem will be very visible, right? The host loses connectivity, RKE2 reports the node status as NotReady then Unknown. Upgrade is halted. Likely, the admin can still access the host (via serial console?) to diagnose and repair.

If the host loses connectivity, does its role play a role (no pun intended) here? Will role promotion happen while the upgrade is happening? If yes, will the next promoted host be promoted to the correct version? Do we need to ask user to add another management host to ensure quorum in case of partition? Should upgrade be done on management hosts first? (I assume we are already doing this, but can't remember.)

Taking it a step further, is it possible for the (networking stack) upgrade to complete successfully, but user applications to be completely broken, say because some customization translation was amiss, they failed silently etc.? If yes, then what can user do to detect the problem mid-upgrade, before it takes down the entire user application layer? Is it possible to pause the upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete failure would leave the host without network connectivity as you say, and the admin would need to access it via remote console to diagnose and repair.

Role promotion should only happen if a host is actually removed. AFAIK that won't happen automatically if a host just becomes uncontactable for an extended period.

Could user apps be broken by a messed up network config? Uh.. I'm not sure. All we do statically (i.e. the stuff we're changing the configuration of here) is the management interface and how it comes up on boot. Other networks that might affect workloads (extra cluster networks, vm networks, storage network) are all created by harvester dynamically at runtime and so thus shouldn't(!) be affected by this change.

Copy link
Contributor Author

@tserong tserong Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete failure would leave the host without network connectivity as you say, and the admin would need to access it via remote console to diagnose and repair.

Just to add to that -- it turns out this is (or should be) really easy to do if necessary since harvester/harvester-installer#1150 went in:

  • Login via remote console and become root
  • Edit /oem/harvester.config and make sure install: managementinterface: .... specifies the network configuration you want
  • Run harvester-installer generate-network-yaml. By default (i.e. with no other options specified) this will create /oem/91_networkmanager.yaml based on oem/harvester.config.
  • Reboot and enjoy your now functional network

@tserong
Copy link
Contributor Author

tserong commented Sep 16, 2025

@irishgordo yeah, instead of tweaking /etc/sysconfig/network/config, it'll be something like running nmcli con modify bridge-mgmt ipv4.dns 8.8.8.8,1.1.1.1 && nmcli device reapply mgmt-br to apply the change live. To persist it, you'll be looking for that nmcli con modify command in /oem/90_custom.yaml, rather than the sed -i 's/^NETCONFIG_DNS_STATIC_SERVERS line.

@irishgordo
Copy link
Contributor

@tserong for Harvester ISO the "Net Install" variant will there be any extra considerations needed from the network manager switch?

@tserong
Copy link
Contributor Author

tserong commented Oct 10, 2025

for Harvester ISO the "Net Install" variant will there be any extra considerations needed from the network manager switch?

I don't believe so @irishgordo but we should probably give it a quick test run to make sure.

@tserong
Copy link
Contributor Author

tserong commented Oct 10, 2025

BTW there's a WIP docs PR at harvester/docs#897, which between the finished bits and the bits still marked "DOCS TODO" should hopefully cover all the relevant changes.

@tserong tserong force-pushed the wip-networkmanager-hep branch from 81a1c2a to c0b83ad Compare October 13, 2025 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants