Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41811

4.17 Failed workers reboot in HA topology prevents cluster deployment completion

XMLWordPrintable

    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Component Readiness has found a potential regression in the following test:

      operator conditions network

      Probability of significant regression: 99.42%

      Sample (being evaluated) Release: 4.17
      Start Time: 2024-09-04T00:00:00Z
      End Time: 2024-09-11T23:59:59Z
      Success Rate: 60.00%
      Successes: 6
      Failures: 4
      Flakes: 0

      Base (historical) Release: 4.16
      Start Time: 2024-05-28T00:00:00Z
      End Time: 2024-06-27T23:59:59Z
      Success Rate: 100.00%
      Successes: 22
      Failures: 0
      Flakes: 0

      View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=parallel&Suite=parallel&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-28%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20parallel%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-11%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-09-04%2000%3A00%3A00&testId=Operator%20results%3A4b5f6af893ad5577904fbaec3254506d&testName=operator%20conditions%20network

      The bug is being opened against the component we see the regression for. However we see 4 out of 10 jobs have failed install with missing worker nodes.

      Slack thread

      Had a look at the agent-gather and found just the worker ones (as expected, since they did not join the bootstrap). From the journal the workers were able to fetch the ignition, ie
      ...
      set 09 21:19:27 worker-0 assisted-installer[2823]: time="2024-09-09T19:19:27Z" level=info msg="Getting ignition from https://192.168.111.5:22623/config/worker"
      ...
      and ready to reboot:
      ...
      set 09 21:20:31 worker-0 assisted-installer[2823]: time="2024-09-09T19:20:31Z" level=info msg="Uploading logs and reporting status before rebooting the node 2cc1856b-c4f5-4e3e-9117-128cf97e1d15 for cluster abe52fa7-9e7e-465e-a5c5-457ecb49bb70"
      ...
      But the reboot never happened, and they remaining stuck (thus not completing the joining procedure). The reason of the stuck it's not yet clear
      
      
      

            afasano@redhat.com Andrea Fasano
            rh-ee-fbabcock Forrest Babcock
            zhenying niu zhenying niu
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated: