Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25187

RHOCP newly added worker node stuck in Provisioning state and taking lots of time to turn into Runnig state

XMLWordPrintable

    • No
    • Metal Platform 248, Metal Platform 249
    • 2
    • False
    • Hide

      None

      Show
      None

      Original: Description of problem:

      After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes. This makes it very difficult for automation because during that time you also don't know if something is failing or not.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Additional info:

      ## Issue:
      =====================
      - After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes.
      - This makes it very difficult for automation because during that time you also don't know if something is failing or not.
      
      
      ## ENV:
      =====================
      - Platform: BareMetal
      - Install Type: IPI
      - Current:        4.12.42
      - Desired:        4.12.42
      
      
      
      ## Configuration:
      =====================
      - We have a 'cluster-ext' network for OOB (iDRAC) and customer IP range (eg: 192.168.120.0/24).
      - We have a 'cluster-int' network which is the 'baremetal' network (198.19.0.0/24).
      - We do not have a 'provisioning' network because of HW/BOM limitations.
      - We are using 'provisioningInterface' + 'provisioningIP' for getting a reachable idrac-virtualmedia URL.```
      ~~~
      $ oc get provisionings.metal3.io provisioning-configuration -o yaml
      apiVersion: metal3.io/v1alpha1
      kind: Provisioning
      metadata:
        ...
        name: provisioning-configuration
        ...
      spec:
        preProvisioningOSDownloadURLs: {}
        provisioningIP: 192.168.120.47
        provisioningInterface: cluster-ext
        provisioningMacAddresses:
        - 0c:29:ef:8d:c3:d0
        - 0c:29:ef:8d:c3:dd
        - 0c:29:ef:8d:c3:ea
        provisioningNetwork: Disabled
        provisioningNetworkCIDR: 192.168.120.0/24
        virtualMediaViaExternalNetwork: true
      
      ~~~
      
      ## Conditions:
      =====================
      - This happens at least in the following circumstances:
      - Always: When cluster has just been installed and provisioning config has been created/updated
      - Mostly: Provisioning config has been changed
      - Sometimes: Cluster has been rebooted
      
      
      
      
      ## Investigation:
      =====================
      - The metal3 complains the provisioner is not ready:
      ~~~
      $ METAL3_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state -o name)
      
      $ oc logs -n openshift-machine-api ${METAL3_POD}
      ~~~{"level":"info","ts":1701943415.7587945,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-0","RequeueAfter:":30}
      {"level":"info","ts":1701943415.758767,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-2","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"}
      {"level":"info","ts":1701943415.7587974,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-3","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"}
      {"level":"info","ts":1701943415.7588215,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-3","RequeueAfter:":30}
      {"level":"info","ts":1701943415.7588148,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-2","RequeueAfter:":30}
      ~~~
      
      During the first few minutes the metal3-ironic-inspector is throwing this meesages:
      ~~~
      2023-12-07 10:04:14.420 1 ERROR ironic_inspector.common.ironic [-] Failed to establish a connection with ironic, reason: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.: keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
      2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager [-] The periodic ironic_inspector.conductor.manager.sync_with_ironic failed with: Traceback (most recent call last):
        File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 290, in run
          work()
        File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 64, in __call__
          return self.callback(*self.args, **self.kwargs)
        File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 178, in decorator
          return f(*args, **kwargs)
        File "/usr/lib/python3.9/site-packages/ironic_inspector/conductor/manager.py", line 226, in sync_with_ironic
          ironic = ir_utils.get_client()
        File "/usr/lib/python3.9/site-packages/ironic_inspector/common/ironic.py", line 61, in get_client
          return openstack.connection.Connection(
        File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 87, in __get__
          proxy = self._make_proxy(instance)
        File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 223, in _make_proxy
          data = proxy_obj.get_endpoint_data()
        File "/usr/lib/python3.9/site-packages/keystoneauth1/adapter.py", line 312, in get_endpoint_data
          return self.session.get_endpoint_data(auth or self.auth, **kwargs)
        File "/usr/lib/python3.9/site-packages/keystoneauth1/session.py", line 1260, in get_endpoint_data
          return auth.get_endpoint_data(self, **kwargs)
        File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 377, in get_endpoint_data
          return super(FixedEndpointPlugin, self).get_endpoint_data(
        File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 142, in get_endpoint_data
          return endpoint_data.get_versioned_data(
        File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1057, in get_versioned_data
          new_data._set_version_info(
        File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1136, in _set_version_info
          self._run_discovery(
        File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1250, in _run_discovery
          raise exceptions.DiscoveryFailure(
      keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
      : None: None
      2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager NoneType: None
      2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager 
      ~~~
      
      The ironic-proxy pods are waiting for an IP which is already present:
      ~~~
      $ IRONIC_PROXY_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=ironic-proxy -o name | head -1)
      $ oc logs -n openshift-machine-api ${IRONIC_PROXY_POD}
      ...
      Waiting for 198.19.0.12 to be configured on an interface
      + ip -br addr show
      + grep -q -F ' 198.19.0.12/'
      + echo 'Waiting for 198.19.0.12 to be configured on an interface'
      + sleep 1
      Waiting for 198.19.0.12 to be configured on an interface
      + ip -br addr show
      + grep -q -F ' 198.19.0.12/'
      + echo 'Waiting for 198.19.0.12 to be configured on an interface'
      + sleep 1
      ~~~
      
      
      Verify the IP is present:
      ~~~
      # oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}
      sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/'
      br-ex            UNKNOWN        198.19.0.12/24 169.254.169.2/29 198.19.0.1/32 
      sh-5.1# echo $?
      0
      ~~~
      
      After a long time we see the same happening in the metal3-ironic-inspector pod waiting for the provisioningIP to be present:
      
      $ oc logs -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD}
      ~~~
      Waiting for 192.168.120.47 to be configured on an interface
      + echo 'Waiting for 192.168.120.47 to be configured on an interface'
      + sleep 1
      + ip -br addr show
      + grep -q -F ' 192.168.120.47/'
      Waiting for 192.168.120.47 to be configured on an interface
      + echo 'Waiting for 192.168.120.47 to be configured on an interface'
      + sleep 1
      + ip -br addr show
      + grep -q -F ' 192.168.120.47/'
      ~~~
      
      
      Verify the IP is present:
      ~~~
      $ oc rsh -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD}
      sh-5.1# ip -br addr show | grep -F ' 192.168.120.47/'
      cluster-ext@infra_10g UP             192.168.120.47/24 
      ~~~
      
      
      ## Reason:
      =====================
      - Both the metal3-ironic-inspector and ironic-proxy pod are executing the `/usr/bin/ironic-common.sh` script.
      - This script contains the following loop:
      ~~~
          until ip -br addr show | grep -q -F " ${IRONIC_IP}/"; do
            echo "Waiting for ${IRONIC_IP} to be configured on an interface???"
            sleep 1
          done
      ~~~
      
      
      - The reason is that 'grep -q' will close() the STDIN stream upon first matching, breaking the pipe.
      - This in combination with the bash 'pipefail' option will result in a non-zero exit code (ecit code 141).We can inspect this behaviour inside one of the affected containers:
      ~~~
      ### Connect to ironic-proxy POD
      $ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}### Set pipefail option
      sh-5.1# set -o pipefail### Test without -q
      sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/'
      br-ex            UNKNOWN        198.19.0.12/24 169.254.169.2/29 198.19.0.1/32
      
      sh-5.1# echo $?
      0
      
      ### Test with -q
      sh-5.1# ip -br addr show | grep -q -F ' 198.19.0.12/'
      sh-5.1# echo $?
      141
      
      ### Test proposed solution without using -q
      sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/' >/dev/null 2>&1
      sh-5.1# echo $?
      0
      ~~~
      
      
      ## Proposed resolution:
      =====================
      - We can achieve the desired result without using -q by redirecting the output to /dev/null:
      ~~~
          until ip -br addr show | grep -F " ${IRONIC_IP}/" >/dev/null 2>&1; do
            echo "Waiting for ${IRONIC_IP} to be configured on an interface???"
            sleep 1
          done
      ~~~
      
      
      ## Manual workaround:  
      =====================
      - Because the script is already running in the pod, we cannot patch it, but it is calling 'grep' every time.We can work around the issue manually by:
      - Entering the affected pods/containers (ironic-proxy and metal3:metal3-ironic-inspector)
      - Create a copy of the original grep binary
      - Create a wrapper around grep, detecting this use of '-q' and calling the original grep using a workaround
      - Replacing the grep binary with the wrapper, which now wraps around the copy of the original grep
      
      Procedure:
      ~~~
      # Connect to the affected pod
      $ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}# Create the grep wrapper
      cat > /root/grep <<EOF
      #!/bin/bashif [[ "\$1" == "-q" ]]; then
        shift; exec /usr/bin/grep.orig \$@ >/dev/null
      else
        exec /usr/bin/grep.orig \$@
      fi
      EOF
      .
      .
      # Make wrapper executable
      chmod 755 /root/grep# Make a copy of the original grep
      cp /usr/bin/grep /usr/bin/grep.orig# Replace grep with the wrapper
      cp /root/grep /usr/bin/grep
      ~~~
      
      - This unstuck the initialization and immediately started the registering and inspecting process of the node.Due to the complexity however, we cannot use this workaround in automation or production.   

              rhn-engineering-hpokorny Honza Pokorny
              rhn-support-shpawar SHUBHAM PAWAR
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: