Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43098

Upgrade from 4.15 -> 4.17 is failing in vsphere-ipi-ovn-dualstack-privmaryv6

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.17
    • RHCOS
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      I am not sure if this issue belongs to MCO or not. Please, feel free to reassign it to the right Component if MCO is not the right one.
      
      The CI job for vsphere-ipi-ovn-dualstack-privmaryv6 upgrading from 4.15 -> 4.17 is failing because some nodes cannot join the cluster after reboot.
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-4.17-upgrade-from-stable-4.15-vsphere-ipi-ovn-dualstack-privmaryv6-f28/1843678815767760896
      
      
      
      We can see the following information in the failed node (in a rehearse job created to extract the debug information that the main job cannot provide):
      
      core@ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 ~]$ systemctl --failed
        UNIT                              LOAD   ACTIVE SUB    DESCRIPTION                                    
      ● systemd-network-generator.service loaded failed failed Generate network units from Kernel command line
      
      $ journalctl -u systemd-network-generator.service
      Oct 11 07:59:22 88-110-38-10.in-addr.arpa systemd[1]: Finished Generate network units from Kernel command line.
      Oct 11 08:01:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Deactivated successfully.
      Oct 11 08:01:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Stopped Generate network units from Kernel command line.
      -- Boot 7f937ffe8dd741a29d753994ae03b187 --
      Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[799]: Failed to parse kernel command line: Invalid argument
      Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE
      Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'.
      Oct 11 08:01:16 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Failed to start Generate network units from Kernel command line.
      -- Boot c253b3e5b5f1454392a1f8c663305f12 --
      Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[808]: Failed to parse kernel command line: Invalid argument
      Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE
      Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'.
      Oct 11 10:49:15 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Failed to start Generate network units from Kernel command line.
      -- Boot 4ef77d9b467942ef892d2145f8b8fa44 --
      Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[805]: Failed to parse kernel command line: Invalid argument
      Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE
      Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'.
      Oct 11 11:25:03 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: Failed to start Generate network units from Kernel command line.
      -- Boot 23e84d36ee4846708b91213de129c437 --
      Oct 11 11:58:53 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd-network-generator[813]: Failed to parse kernel command line: Invalid argument
      Oct 11 11:58:53 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Main process exited, code=exited, status=1/FAILURE
      Oct 11 11:58:53 ci-op-mzcwnlzz-661ff-997hj-worker-0-vmvq4 systemd[1]: systemd-network-generator.service: Failed with result 'exit-code'.
      
      
      $ cat /proc/cmdline 
      BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-6462af9eefed4c54bc8ce5f08af127a24f7bc0b5c816874344fb5ad6b934b54d/vmlinuz-5.14.0-427.40.1.el9_4.x86_64 ostree=/ostree/boot.1/rhcos/6462af9eefed4c54bc8ce5f08af127a24f7bc0b5c816874344fb5ad6b934b54d/0 ignition.platform.id=vmware console=ttyS0,115200n8 console=tty0 root=UUID=45ad3301-cf30-45b9-9917-5fcd00e9944b rw rootflags=prjquota boot=UUID=5dd5450f-5543-4865-bd25-94fecb2a18b2 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 ip=dhcp,dhcp6
      
      
      
          

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.16.0-0.nightly-2024-10-10-193220   True        False         169m    Error while reconciling 4.16.0-0.nightly-2024-10-10-193220: the cluster operator machine-config is degraded
      
          

      How reproducible:

      In the original prow job it happens consistently (sometimes in the vsphere-ipi-ovn-dualstack-privmaryv6-f28-cucushift-chainupgrade-toimage step and other times in the vsphere-ipi-ovn-dualstack-privmaryv6-f28-openshift-extended-upgrade-pre-custom-cli step or in the vsphere-ipi-ovn-dualstack-privmaryv6-f28-cucushift-upgrade-prehealthcheck step).
      
      If we use a rehearse job to reproduce it we need a bit of luck to hit the problem, but eventually we hit it.
      
          

      Steps to Reproduce:

          1. Run CI job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-4.17-upgrade-from-stable-4.15-vsphere-ipi-ovn-dualstack-privmaryv6-f28
      
          

      Actual results:

          A node cannot join the cluster after reboot. When we login to the failed node the systemd-network-generator.service is failing.
          

      Expected results:

          All nodes should be able to join the cluster after they are rebooted.
          

      Additional info:

          In the first comment we posted the links to the must-gather file, the journal log and the ssosreport
          

              Unassigned Unassigned
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: