Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-80422

BM clusters (bm15a-tlv2, bm15-ibm): OCP agent install never completes - NTP and host connectivity failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • CNV QE DevOps
    • None
    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • CNV QE DevOps Sprint 284
    • None

      Summary

      OpenShift deployment on bare-metal cluster bm15a-tlv2 fails during the "Deploy OCP" stage: agent-based install never reaches bootstrap-complete. Three agent hosts (cnvqe-085, cnvqe-086, cnvqe-087) remain in "insufficient" state due to NTP sync and inter-host connectivity failures. The job eventually hits the stage timeout. Reporter notes this happens only with this particular BM cluster.

      Actual result

      • "Deploy OCP" runs openshift-install agent wait-for bootstrap-complete (90m timeout) and retries twice; both attempts fail with exit code 5.
      • Then it runs wait-for install-complete (90m), which also times out.
      • Final pipeline error: "Timeout has been exceeded while trying to Deploy OCP" → job FAILURE.
      • All later stages (Add CatalogSources, Deploy OCS, Deploy CNV, etc.) are skipped.

      Root cause (from assisted-service / agent validations)

      Agent hosts cnvqe-085, cnvqe-086, cnvqe-087 are stuck in status "insufficient" with failing validations:

      • "Host could not synchronize with any NTP server"
      • "No connectivity to the majority of hosts in the cluster"

      Because these hosts never become "known" / ready, the cluster never has 3 dedicated control plane nodes ready and bootstrap never completes. After wait failure, the installer also reports:

      • dial tcp [2620:52:0:2ef8::215]:6443: connect: no route to host when gathering ClusterOperator status (API VIP unreachable, as bootstrap did not complete).

      Evidence (Jenkins console)

      Relevant log excerpts:

      level=debug msg=Host cnvqe-085.lab.eng.tlv2.redhat.com: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host could not synchronize with any NTP server ; No connectivity to the majority of hosts in the cluster)
      level=debug msg=Host cnvqe-086.lab.eng.tlv2.redhat.com: ... (same)
      level=debug msg=Host cnvqe-087.lab.eng.tlv2.redhat.com: ... (same)
      level=error msg=Attempted to gather ClusterOperator status after wait failure: ... dial tcp [2620:52:0:2ef8::215]:6443: connect: no route to host
      level=error msg=Bootstrap failed to complete: : bootstrap process timed out: context deadline exceeded
      [WARNING] Attempt 1/2 failed with code 5. Retrying...
      ... (same for attempt 2)
      ...
      Timeout has been exceeded while trying to Deploy OCP
      ERROR: Timeout has been exceeded while trying to Deploy OCP
      Finished: FAILURE
      

      Ansible play also reports "Assertion failed" (fatal) for cnvqe-085, cnvqe-086, cnvqe-087 in a later step, consistent with these hosts not meeting pre/install requirements.

      Steps to reproduce

      1. Run job deploy-ocp-bare-metal-cluster-with-abi-cnv-4.22 with CLUSTER_NAME=bm15a-tlv2 (and parameters that trigger OCP deploy on this BM cluster).
      2. Wait for "Deploy OCP" stage: openshift-install agent wait-for bootstrap-complete runs from cnvqe-030.
      3. Observe: agent hosts 085, 086, 087 stay insufficient (NTP + connectivity).
      4. After 90m, attempt 1 fails with code 5; retry (attempt 2) same result; then wait-for install-complete runs and times out; finally pipeline timeout aborts the stage.

      Expected result

      Agent hosts sync NTP and have connectivity to the majority of hosts so they become "known" and cluster can proceed to bootstrap-complete and install-complete within the job timeout.

      Environment / notes

      • Reproduced only on this particular BM cluster (bm15a-tlv2); other BM clusters in the same job do not show this behavior.
      • Suggests environment-specific issue on bm15a-tlv2: e.g. NTP not reachable from cnvqe-085/086/087, or network segmentation/firewall blocking required traffic between these hosts and/or to the rest of the cluster.
      • Jenkins: jenkins-csb-cnvqe-main.dno.corp.redhat.com
      • Job: deploy-ocp-bare-metal-cluster-with-abi-cnv-4.22, Build: 26
      • OCP version target: 4.22, FIPS enabled, dual stack (IP_STACK=dual)

              ryasharz@redhat.com Rabin Yasharzadehe
              lbednar@redhat.com Lukas Bednar
              Daniel Keler Daniel Keler
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: