Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37665

Agent installation fails. The rendezVous host detects only 1 ready master node, even though 2 masters (and 3 workers) are ready.

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

        This is happening "sometimes", in the baremetal CI.
      
      When the cluster is being installed and all the two masters and three workers become ready, we expect the rendezVous host to eventually reboot, install and join the cluster as the third master to conclude the installation.
      
      However, it seems that one of the two masters is not detected as ready, even if `oc get nodes` would show two masters as joined and ready.
      
      This lead the rendezVous host to keep stuck and never reboot to finish the installation by joining the cluster. 

      Version-Release number of selected component (if applicable):

          4.16

      How reproducible:

          Sometimes. Most of the times it works fine, sometimes it fails, always the same way.

      Steps to Reproduce:

          1. Install an agent-based cluster. The configuration i've faced this with is
            - dhcp
            - ipv4 only 
            - BMO enabled (platform: baremetal)
            - 3 masters + 3 workers
      

      Actual results:

          The installation fails

      Expected results:

          The rendezvous host reboots, join the cluster and the installation succeeds.
      
      
      

      Additional info:

      By using `oc get nodes` and `oc get co`, everything looked good, except the degraded operators due to the missing third master.
      
      example job (from a PR, but no changes are made to the automation for the agent installation) https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/54483/rehearse-54483-periodic-ci-openshift-hypershift-release-4.16-periodics-mce-e2e-agent-connected-ovn-ipv4-metal3-conformance/1817908001202245632
      
      Also, in the journal, right after the timeout of the cluster installation:
      "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-master-00_openshift-etcd(77a8963015ba959007010dc81931348d)\"" pod="openshift-etcd/etcd-bootstrap-member-master-00" podUID="77a8963015ba959007010dc81931348d"Jul 29 14:48:21 master-00 start-cluster-installation.sh[7044]: Cluster status: errorJul 29 14:48:26 master-00 start-cluster-installation.sh[7044]: Cluster status: error  Also see attachments.
      
      
      The etcd-bootstrap-member-master-00 is in crashloopbackoff. Log:
      
      aft.go:77","msg":"8f4fcab0df4f7c44 switched to configuration voters=(7372168020071371606 10326695331593419844 18105834420489811888)"}
      {"level":"info","ts":"2024-07-29T14:49:41.373886Z","caller":"membership/cluster.go:537","msg":"promote member","cluster-id":"cf7ed821fb17c7fa","local-member-id":"8f4fcab0df4f7c44"}
      {"level":"warn","ts":"2024-07-29T14:49:41.374503Z","caller":"etcdserver/server.go:1149","msg":"server error","error":"the member has been permanently removed from the cluster"}
      {"level":"warn","ts":"2024-07-29T14:49:41.374538Z","caller":"etcdserver/server.go:1150","msg":"data-dir used by this member must be removed"} 

              rwsu1@redhat.com Richard Su
              rhn-support-adistefa Alessandro Di Stefano
              None
              None
              Manoj Hans Manoj Hans
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: