-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.12
-
No
-
Metal Platform 248, Metal Platform 249
-
2
-
False
-
Original: Description of problem:
After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes. This makes it very difficult for automation because during that time you also don't know if something is failing or not.
Version-Release number of selected component (if applicable):
How reproducible:
Additional info:
## Issue: ===================== - After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes. - This makes it very difficult for automation because during that time you also don't know if something is failing or not. ## ENV: ===================== - Platform: BareMetal - Install Type: IPI - Current: 4.12.42 - Desired: 4.12.42 ## Configuration: ===================== - We have a 'cluster-ext' network for OOB (iDRAC) and customer IP range (eg: 192.168.120.0/24). - We have a 'cluster-int' network which is the 'baremetal' network (198.19.0.0/24). - We do not have a 'provisioning' network because of HW/BOM limitations. - We are using 'provisioningInterface' + 'provisioningIP' for getting a reachable idrac-virtualmedia URL.``` ~~~ $ oc get provisionings.metal3.io provisioning-configuration -o yaml apiVersion: metal3.io/v1alpha1 kind: Provisioning metadata: ... name: provisioning-configuration ... spec: preProvisioningOSDownloadURLs: {} provisioningIP: 192.168.120.47 provisioningInterface: cluster-ext provisioningMacAddresses: - 0c:29:ef:8d:c3:d0 - 0c:29:ef:8d:c3:dd - 0c:29:ef:8d:c3:ea provisioningNetwork: Disabled provisioningNetworkCIDR: 192.168.120.0/24 virtualMediaViaExternalNetwork: true ~~~ ## Conditions: ===================== - This happens at least in the following circumstances: - Always: When cluster has just been installed and provisioning config has been created/updated - Mostly: Provisioning config has been changed - Sometimes: Cluster has been rebooted ## Investigation: ===================== - The metal3 complains the provisioner is not ready: ~~~ $ METAL3_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state -o name) $ oc logs -n openshift-machine-api ${METAL3_POD} ~~~{"level":"info","ts":1701943415.7587945,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-0","RequeueAfter:":30} {"level":"info","ts":1701943415.758767,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-2","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"} {"level":"info","ts":1701943415.7587974,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-3","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"} {"level":"info","ts":1701943415.7588215,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-3","RequeueAfter:":30} {"level":"info","ts":1701943415.7588148,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-2","RequeueAfter:":30} ~~~ During the first few minutes the metal3-ironic-inspector is throwing this meesages: ~~~ 2023-12-07 10:04:14.420 1 ERROR ironic_inspector.common.ironic [-] Failed to establish a connection with ironic, reason: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.: keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled. 2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager [-] The periodic ironic_inspector.conductor.manager.sync_with_ironic failed with: Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 290, in run work() File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 64, in __call__ return self.callback(*self.args, **self.kwargs) File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 178, in decorator return f(*args, **kwargs) File "/usr/lib/python3.9/site-packages/ironic_inspector/conductor/manager.py", line 226, in sync_with_ironic ironic = ir_utils.get_client() File "/usr/lib/python3.9/site-packages/ironic_inspector/common/ironic.py", line 61, in get_client return openstack.connection.Connection( File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 87, in __get__ proxy = self._make_proxy(instance) File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 223, in _make_proxy data = proxy_obj.get_endpoint_data() File "/usr/lib/python3.9/site-packages/keystoneauth1/adapter.py", line 312, in get_endpoint_data return self.session.get_endpoint_data(auth or self.auth, **kwargs) File "/usr/lib/python3.9/site-packages/keystoneauth1/session.py", line 1260, in get_endpoint_data return auth.get_endpoint_data(self, **kwargs) File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 377, in get_endpoint_data return super(FixedEndpointPlugin, self).get_endpoint_data( File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 142, in get_endpoint_data return endpoint_data.get_versioned_data( File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1057, in get_versioned_data new_data._set_version_info( File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1136, in _set_version_info self._run_discovery( File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1250, in _run_discovery raise exceptions.DiscoveryFailure( keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled. : None: None 2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager NoneType: None 2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager ~~~ The ironic-proxy pods are waiting for an IP which is already present: ~~~ $ IRONIC_PROXY_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=ironic-proxy -o name | head -1) $ oc logs -n openshift-machine-api ${IRONIC_PROXY_POD} ... Waiting for 198.19.0.12 to be configured on an interface + ip -br addr show + grep -q -F ' 198.19.0.12/' + echo 'Waiting for 198.19.0.12 to be configured on an interface' + sleep 1 Waiting for 198.19.0.12 to be configured on an interface + ip -br addr show + grep -q -F ' 198.19.0.12/' + echo 'Waiting for 198.19.0.12 to be configured on an interface' + sleep 1 ~~~ Verify the IP is present: ~~~ # oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD} sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/' br-ex UNKNOWN 198.19.0.12/24 169.254.169.2/29 198.19.0.1/32 sh-5.1# echo $? 0 ~~~ After a long time we see the same happening in the metal3-ironic-inspector pod waiting for the provisioningIP to be present: $ oc logs -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD} ~~~ Waiting for 192.168.120.47 to be configured on an interface + echo 'Waiting for 192.168.120.47 to be configured on an interface' + sleep 1 + ip -br addr show + grep -q -F ' 192.168.120.47/' Waiting for 192.168.120.47 to be configured on an interface + echo 'Waiting for 192.168.120.47 to be configured on an interface' + sleep 1 + ip -br addr show + grep -q -F ' 192.168.120.47/' ~~~ Verify the IP is present: ~~~ $ oc rsh -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD} sh-5.1# ip -br addr show | grep -F ' 192.168.120.47/' cluster-ext@infra_10g UP 192.168.120.47/24 ~~~ ## Reason: ===================== - Both the metal3-ironic-inspector and ironic-proxy pod are executing the `/usr/bin/ironic-common.sh` script. - This script contains the following loop: ~~~ until ip -br addr show | grep -q -F " ${IRONIC_IP}/"; do echo "Waiting for ${IRONIC_IP} to be configured on an interface???" sleep 1 done ~~~ - The reason is that 'grep -q' will close() the STDIN stream upon first matching, breaking the pipe. - This in combination with the bash 'pipefail' option will result in a non-zero exit code (ecit code 141).We can inspect this behaviour inside one of the affected containers: ~~~ ### Connect to ironic-proxy POD $ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}### Set pipefail option sh-5.1# set -o pipefail### Test without -q sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/' br-ex UNKNOWN 198.19.0.12/24 169.254.169.2/29 198.19.0.1/32 sh-5.1# echo $? 0 ### Test with -q sh-5.1# ip -br addr show | grep -q -F ' 198.19.0.12/' sh-5.1# echo $? 141 ### Test proposed solution without using -q sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/' >/dev/null 2>&1 sh-5.1# echo $? 0 ~~~ ## Proposed resolution: ===================== - We can achieve the desired result without using -q by redirecting the output to /dev/null: ~~~ until ip -br addr show | grep -F " ${IRONIC_IP}/" >/dev/null 2>&1; do echo "Waiting for ${IRONIC_IP} to be configured on an interface???" sleep 1 done ~~~ ## Manual workaround: ===================== - Because the script is already running in the pod, we cannot patch it, but it is calling 'grep' every time.We can work around the issue manually by: - Entering the affected pods/containers (ironic-proxy and metal3:metal3-ironic-inspector) - Create a copy of the original grep binary - Create a wrapper around grep, detecting this use of '-q' and calling the original grep using a workaround - Replacing the grep binary with the wrapper, which now wraps around the copy of the original grep Procedure: ~~~ # Connect to the affected pod $ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}# Create the grep wrapper cat > /root/grep <<EOF #!/bin/bashif [[ "\$1" == "-q" ]]; then shift; exec /usr/bin/grep.orig \$@ >/dev/null else exec /usr/bin/grep.orig \$@ fi EOF . . # Make wrapper executable chmod 755 /root/grep# Make a copy of the original grep cp /usr/bin/grep /usr/bin/grep.orig# Replace grep with the wrapper cp /root/grep /usr/bin/grep ~~~ - This unstuck the initialization and immediately started the registering and inspecting process of the node.Due to the complexity however, we cannot use this workaround in automation or production.