Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45085

OCP 4.14.34 install could not be completed as host gets stuck in provisioning state.

XMLWordPrintable

    • None
    • 3
    • Metal Platform 263
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

          Customer is installing a ocp 4.14 cluster with 3 master and 4 worker nodes using ztp approach on ACM 2.9.5. The bmh hosts gets installed with the ISO but it is not getting added to the cluster and these nodes are stuck in the provisioning state in ACM UI because it cannot communicate with ironic-agent.
      
      
      
      omc get bmh -n ocp10
      svocp10wrk01.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.183/redfish/v1/Systems/System.Embedded.1   unknown            true             21hsvocp10wrk02.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.184/redfish/v1/Systems/System.Embedded.1   unknown            true             21hsvocp10wrk03.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.185/redfish/v1/Systems/System.Embedded.1   unknown            true             21hsvocp10wrk04.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.186/redfish/v1/Systems/System.Embedded.1   unknown            true             21hsvocp10wrk05.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.187/redfish/v1/Systems/System.Embedded.1   unknown            true             21hsvocp10wrk06.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.188/redfish/v1/Systems/System.Embedded.1   unknown            true             21hsvocp10wrk07.ocp10.pod4ocp.nbnco.lab   OK       provisioning              idrac-virtualmedia://10.0.32.189/redfish/v1/Systems/System.Embedded.1   unknown            true             21h$ omc get aci 
      NAME    CLUSTER   STATE
      ocp10   ocp10     pending-for-input
      ..
          - lastProbeTime: "2024-11-20T03:56:07Z"      lastTransitionTime: "2024-11-20T03:56:07Z"      message: 'The cluster''s validations are pending for user: Clusters must have
              exactly 3 dedicated control plane nodes. Add or remove hosts, or change their
              roles configurations to meet the requirement.,Hosts have not been discovered
              yet,Hosts have not been discovered yet,Hosts have not been discovered yet,Hosts
              have not been discovered yet,At least one of the CIDRs (Machine Network, Cluster
              Network, Service Network) is undefined.,At least one of the CIDRs (Machine
              Network, Cluster Network, Service Network) is undefined.'
            reason: ValidationsUserPending
            status: "False"

      On ACM hub Cluster, the  metal3-state service is missing 6388 and 5051 ports:

      $ oc get services -n openshift-machine-api
      NAME                                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)                      AGE
      baremetal-operator-webhook-service   ClusterIP   192.168.162.113   <none>        443/TCP                      603d
      cluster-autoscaler-operator          ClusterIP   192.168.84.157    <none>        443/TCP,9192/TCP             603d
      cluster-baremetal-operator-service   ClusterIP   192.168.240.36    <none>        8443/TCP                     603d
      cluster-baremetal-webhook-service    ClusterIP   192.168.225.165   <none>        443/TCP                      603d
      control-plane-machine-set-operator   ClusterIP   192.168.185.83    <none>        9443/TCP                     603d
      machine-api-controllers              ClusterIP   192.168.100.22    <none>        8441/TCP,8442/TCP,8444/TCP   603d
      machine-api-operator                 ClusterIP   192.168.40.247    <none>        8443/TCP                     603d
      machine-api-operator-webhook         ClusterIP   192.168.6.165     <none>        443/TCP                      603d
      metal3-image-customization-service   ClusterIP   192.168.131.65    <none>        80/TCP                       603d
      metal3-state                         ClusterIP   192.168.194.215   <none>        6180/TCP,6183/TCP            603
      
      

      0200-worker-journal.log

      Nov 21 09:31:19 localhost.localdomain podman[9499]: 2024-11-21 09:31:19.538 1 ERROR ironic-python-agent     raise ConnectionError(e, request=request)
      Nov 21 09:31:19 localhost.localdomain podman[9499]: 2024-11-21 09:31:19.538 1 ERROR ironic-python-agent requests.exceptions.ConnectionError: HTTPSConnectionPool(host='10.0.9.10', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f00f3799af0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))

      Steps to Reproduce:

          1. Install 4.14 cluster using ACM/ZTP sitconfig approach.     
      
      

      Actual results:

          Host gets stuck in provisioning state.

      Expected results:

          Hosts should have been added to the cluster.

      Additional info:

          

              rpittau@redhat.com Riccardo Pittau
              rhn-support-dchong Daniel Chong
              Jad Haj Yahya Jad Haj Yahya
              Daniel Chong, Nikhil Gupta
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: