Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27989

ACM: Assisted install fails due to Agent stuck in "Waiting for bootkube"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • ACM 2.15.0
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • Important
    • Yes

      Description of problem:

      During a 3-node cluster deployment via ClusterInstance with hub-side templating, the bootstrap node (helix28) became stuck in the "Waiting for bootkube" stage.
      The Cluster Version Operator (CVO) remained in Pending state indefinitely, causing bootkube.service to repeatedly fail and restart.

      CVO stuck in Pending, bootkube repeatedly fails:

      Pod Status:openshift-cluster-version/cluster-version-operator  Pending
      Error: error while checking pod status: timed out waiting for the condition
      bootkube.service: Failed with result 'exit-code' 

      Agent validation shows:

      Id: api-domain-name-resolved-correctly
      Message: Domain name resolution for the api.kni-qe-51.lab.eng.tlv2.redhat.com domain was successful or not required
      Status: success  Id: api-int-domain-name-resolved-correctly  
      Message: Domain name resolution for the api-int.kni-qe-51.lab.eng.tlv2.redhat.com domain was successful or not required
        Status: success 

      Installation proceeds and bootstrap node logs show:

      Jan 05 01:10:03 bootkube.sh[6646]: Unable to resolve API_INT_URL api-int.kni-qe-51.lab.eng.tlv2.redhat.com 

      Version-Release number of selected component (if applicable):

      HUB OCP VERSION 4.21.0-ec.3
      SPOKE OCP VERSION 4.21.0-rc.0
      ACM 2.15.0
      MCE 2.10.0

      How reproducible:

      often

      Steps to Reproduce:

      1. Configure a 3-node cluster deployment via ClusterInstance
      2. Deploy the cluster
      3. Observe agents pass validation and installation begins
      4. Bootstrap node gets stuck in "Waiting for bootkube" stage

      Actual results:

      Agent remains stuck until timeout, leading the installation to fail :

        State: error
        State Info: Host failed to install because its installation stage Waiting for bootkube took longer than expected 1h0m0s

      Expected results:

      bootkube get completed succesfully.

      Additional info:

      Bootstrap node journal showing repeated failures:

      Jan 05 01:30:21 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[10132]: Error: error while checking pod status: timed out waiting for the condition
      Jan 05 01:30:22 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
      Jan 05 01:50:39 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[16374]: Error: error while checking pod status: timed out waiting for the condition
      Jan 05 01:50:39 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
      Jan 05 02:10:56 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[22133]: Error: error while checking pod status: timed out waiting for the condition
      Jan 05 02:10:57 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
      Jan 05 02:31:13 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[27849]: Error: error while checking pod status: timed out waiting for the condition
      Jan 05 02:31:14 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.

       

      The bootstrap node's hostname kept flipping between the correct hostname and the API VIP hostname :

      Jan 05 00:57:26 helix28.lab.eng.tlv2.redhat.com NetworkManager[1784]: <info>  [1767574646.7445] policy: set-hostname: set hostname to 'helix28.lab.eng.tlv2.redhat.com' (from address lookup)
      Jan 05 01:10:36 helix28.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info>  [1767575436.1749] policy: set-hostname: set hostname to 'api.kni-qe-51.lab.eng.tlv2.redhat.com' (from address lookup)
      Jan 05 01:10:36 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd-hostnamed[10626]: Hostname set to <api.kni-qe-51.lab.eng.tlv2.redhat.com> (transient)
      Jan 05 01:17:41 api.kni-qe-51.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info>  [1767575861.7222] policy: set-hostname: set hostname to 'helix28.lab.eng.tlv2.redhat.com' (from address lookup)
      Jan 05 01:17:45 helix28.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info>  [1767575865.4495] policy: set-hostname: set hostname to 'api.kni-qe-51.lab.eng.tlv2.redhat.com' (from address lookup)

       

      • Failure can happen on any random server in a 3-node cluster.
      • Failure is intermittent - it happened with spoke 4.18 and 4.21, but not 4.20 with the same ACM/MCE version in the last CI iteration.

        1. helix28-journal.log
          12.58 MB
          Bahaa Azem
        2. helix27-journal.log
          19.42 MB
          Bahaa Azem
        3. helix26-journal.log
          31.06 MB
          Bahaa Azem

              eerez@redhat.com Elior Erez
              rh-ee-bazem Bahaa Azem
              Vladislav Kolodny Vladislav Kolodny
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: