-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.17
-
Moderate
-
None
-
False
-
Description of problem:
I am not sure if this issue belongs to MCO or not. Please, feel free to reassign it to the right Component if MCO is not the right one. The CI job for vsphere-ipi-ovn-dualstack-privmaryv6 upgrading from 4.15 -> 4.17 is failing because some nodes cannot join the cluster after reboot. https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-4.17-upgrade-from-stable-4.15-vsphere-ipi-ovn-dualstack-privmaryv6-f28/1843678815767760896 We can see the following information in the failed node (in a rehearse job created to extract the debug information that the main job cannot provide): core@ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 ~]$ systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● systemd-network-generator.service loaded failed failed Generate network units from Kernel command line $ journalctl -u systemd-network-generator.service Oct 11 07:59:22 88-110-38-10.in-addr.arpa systemd[1]: Finished Generate network units from Kernel command line. Oct 11 08:01:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Deactivated successfully. Oct 11 08:01:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Stopped Generate network units from Kernel command line. -- Boot 7f937ffe8dd741a29d753994ae03b187 -- Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[799]: Failed to parse kernel command line: Invalid argument Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'. Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Failed to start Generate network units from Kernel command line. -- Boot c253b3e5b5f1454392a1f8c663305f12 -- Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[808]: Failed to parse kernel command line: Invalid argument Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'. Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Failed to start Generate network units from Kernel command line. -- Boot 4ef77d9b467942ef892d2145f8b8fa44 -- Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[805]: Failed to parse kernel command line: Invalid argument Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'. Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Failed to start Generate network units from Kernel command line. -- Boot 23e84d36ee4846708b91213de129c437 -- Oct 11 11:58:53 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[813]: Failed to parse kernel command line: Invalid argument Oct 11 11:58:53 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE Oct 11 11:58:53 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'. $ cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-6462af9eefed4c54bc8ce5f08af127a24f7bc0b5c816874344fb5ad6b934b54d/vmlinuz-5.14.0-427.40.1.el9_4.x86_64 ostree=/ostree/boot.1/rhcos/6462af9eefed4c54bc8ce5f08af127a24f7bc0b5c816874344fb5ad6b934b54d/0 ignition.platform.id=vmware console=ttyS0,115200n8 console=tty0 root=UUID=45ad3301-cf30-45b9-9917-5fcd00e9944b rw rootflags=prjquota boot=UUID=5dd5450f-5543-4865-bd25-94fecb2a18b2 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 ip=dhcp,dhcp6
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-10-10-193220 True False 169m Error while reconciling 4.16.0-0.nightly-2024-10-10-193220: the cluster operator machine-config is degraded
How reproducible:
In the original prow job it happens consistently (sometimes in the vsphere-ipi-ovn-dualstack-privmaryv6-f28-cucushift-chainupgrade-toimage step and other times in the vsphere-ipi-ovn-dualstack-privmaryv6-f28-openshift-extended-upgrade-pre-custom-cli step or in the vsphere-ipi-ovn-dualstack-privmaryv6-f28-cucushift-upgrade-prehealthcheck step). If we use a rehearse job to reproduce it we need a bit of luck to hit the problem, but eventually we hit it.
Steps to Reproduce:
1. Run CI job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-4.17-upgrade-from-stable-4.15-vsphere-ipi-ovn-dualstack-privmaryv6-f28
Actual results:
A node cannot join the cluster after reboot. When we login to the failed node the systemd-network-generator.service is failing.
Expected results:
All nodes should be able to join the cluster after they are rebooted.
Additional info:
In the first comment we posted the links to the must-gather file, the journal log and the ssosreport