Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38139

MNO Cluster Fails To Install- One BMH Stuck In Provisioning State

XMLWordPrintable

    • Moderate
    • Yes
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Environment:Hub Cluster: OCP 4.16.6, ACM 2.11.2, MCE 2.6.2

      Attempting to install latest 4.17 green nightly build 4.17.0-0.nightly-2024-08-01-213905 on a 5-node baremetal env via ZTP/GitOps using assisted installer. The installation consistently fails with one node (not always the same node) stuck in Provisioning state. Errors from the node are as follows:

      $ journalctl -b -f -u release-image.service -u bootkube.service

      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Checking if api.kni-qe-26.lab.eng.tlv2.redhat.com of type API_URL reachable
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to reach API_URL's https endpoint at https://api.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to validate. https://api.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz is currently unreachable.
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Check if API-Int URL is reachable during bootstrap
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Checking if api-int.kni-qe-26.lab.eng.tlv2.redhat.com of type API_INT_URL reachable
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to reach API_INT_URL's https endpoint at https://api-int.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to validate. https://api-int.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz is currently unreachable.
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: bootkube.service complete
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Deactivated successfully.
      Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Consumed 34.010s CPU time.}}

       

      Aug 07 19:53:24 helix26.lab.eng.tlv2.redhat.com kubelet.sh[6957]: E0807 19:53:24.209981 6957 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-api-int.helix26.lab.eng.tlv2.redhat.com_openshift-etcd(0a1902434a202329b9007efe5b197061)\"" pod="openshift-etcd/etcd-bootstrap-member-api-int.helix26.lab.eng.tlv2.redhat.com" podUID="0a1902434a202329b9007efe5b197061"  

       

      I have not seen this error when installing a compact 3-node cluster using the same hub cluster and three baremetal servers.

      Version-Release number of selected component (if applicable):

           ACM 2.11.2, MCE 2.6.2

      How reproducible:

          Always

      Steps to Reproduce:

          1.Install hub cluster with OCP 4.16.6, ACM 2.11.2, MCE 2.6.2
          2.Start spoke cluster installation using ZTP/Gitops Workflow
          3.Observe that "oc get bmh" shows 4/5 BMH as "Provisioned" with one stuck "Provisioning
          

      Actual results:

          One node (not always same node) stuck provisioning with errors shown above.

      Expected results:

          All BMHs provisioned.

      Additional info:

          See links in comments for must-gathers.

              lgamliel liat gamliel
              josclark@redhat.com Joshua Clark
              Michael Burman Michael Burman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: