Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-50011

4.17 Failed workers reboot in HA topology prevents cluster deployment completion

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when a worker node was trying to join a cluster the rendezvous node rebooted before the process completed. As the worker node could not communicate as expected with the rendezvous node, the installation was not successful. With this release, a patch is applied that fixes the racing condition that caused the rendezvous node to reboot prematurely and the issue is resolved.
      (link:https://issues.redhat.com/browse/OCPBUGS-41811 [*OCPBUGS-41811*])
      Show
      * Previously, when a worker node was trying to join a cluster the rendezvous node rebooted before the process completed. As the worker node could not communicate as expected with the rendezvous node, the installation was not successful. With this release, a patch is applied that fixes the racing condition that caused the rendezvous node to reboot prematurely and the issue is resolved. (link: https://issues.redhat.com/browse/OCPBUGS-41811 [* OCPBUGS-41811 *])
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-41811. The following is the description of the original issue:

      Component Readiness has found a potential regression in the following test:

      operator conditions network

      Probability of significant regression: 99.42%

      Sample (being evaluated) Release: 4.17
      Start Time: 2024-09-04T00:00:00Z
      End Time: 2024-09-11T23:59:59Z
      Success Rate: 60.00%
      Successes: 6
      Failures: 4
      Flakes: 0

      Base (historical) Release: 4.16
      Start Time: 2024-05-28T00:00:00Z
      End Time: 2024-06-27T23:59:59Z
      Success Rate: 100.00%
      Successes: 22
      Failures: 0
      Flakes: 0

      View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=parallel&Suite=parallel&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-28%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20parallel%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-11%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-09-04%2000%3A00%3A00&testId=Operator%20results%3A4b5f6af893ad5577904fbaec3254506d&testName=operator%20conditions%20network

      The bug is being opened against the component we see the regression for. However we see 4 out of 10 jobs have failed install with missing worker nodes.

      Slack thread

      Had a look at the agent-gather and found just the worker ones (as expected, since they did not join the bootstrap). From the journal the workers were able to fetch the ignition, ie
      ...
      set 09 21:19:27 worker-0 assisted-installer[2823]: time="2024-09-09T19:19:27Z" level=info msg="Getting ignition from https://192.168.111.5:22623/config/worker"
      ...
      and ready to reboot:
      ...
      set 09 21:20:31 worker-0 assisted-installer[2823]: time="2024-09-09T19:20:31Z" level=info msg="Uploading logs and reporting status before rebooting the node 2cc1856b-c4f5-4e3e-9117-128cf97e1d15 for cluster abe52fa7-9e7e-465e-a5c5-457ecb49bb70"
      ...
      But the reboot never happened, and they remaining stuck (thus not completing the joining procedure). The reason of the stuck it's not yet clear
      
      
      

              rhn-support-pamoedom Pedro Jose Amoedo Martinez
              openshift-crt-jira-prow OpenShift Prow Bot
              zhenying niu zhenying niu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: