-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
Metal Platform 248, Metal Platform 249
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Original: Description of problem:
After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes. This makes it very difficult for automation because during that time you also don't know if something is failing or not.
Version-Release number of selected component (if applicable):
How reproducible:
Additional info:
## Issue:
=====================
- After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes.
- This makes it very difficult for automation because during that time you also don't know if something is failing or not.
## ENV:
=====================
- Platform: BareMetal
- Install Type: IPI
- Current: 4.12.42
- Desired: 4.12.42
## Configuration:
=====================
- We have a 'cluster-ext' network for OOB (iDRAC) and customer IP range (eg: 192.168.120.0/24).
- We have a 'cluster-int' network which is the 'baremetal' network (198.19.0.0/24).
- We do not have a 'provisioning' network because of HW/BOM limitations.
- We are using 'provisioningInterface' + 'provisioningIP' for getting a reachable idrac-virtualmedia URL.```
~~~
$ oc get provisionings.metal3.io provisioning-configuration -o yaml
apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
...
name: provisioning-configuration
...
spec:
preProvisioningOSDownloadURLs: {}
provisioningIP: 192.168.120.47
provisioningInterface: cluster-ext
provisioningMacAddresses:
- 0c:29:ef:8d:c3:d0
- 0c:29:ef:8d:c3:dd
- 0c:29:ef:8d:c3:ea
provisioningNetwork: Disabled
provisioningNetworkCIDR: 192.168.120.0/24
virtualMediaViaExternalNetwork: true
~~~
## Conditions:
=====================
- This happens at least in the following circumstances:
- Always: When cluster has just been installed and provisioning config has been created/updated
- Mostly: Provisioning config has been changed
- Sometimes: Cluster has been rebooted
## Investigation:
=====================
- The metal3 complains the provisioner is not ready:
~~~
$ METAL3_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state -o name)
$ oc logs -n openshift-machine-api ${METAL3_POD}
~~~{"level":"info","ts":1701943415.7587945,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-0","RequeueAfter:":30}
{"level":"info","ts":1701943415.758767,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-2","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"}
{"level":"info","ts":1701943415.7587974,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-3","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"}
{"level":"info","ts":1701943415.7588215,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-3","RequeueAfter:":30}
{"level":"info","ts":1701943415.7588148,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-2","RequeueAfter:":30}
~~~
During the first few minutes the metal3-ironic-inspector is throwing this meesages:
~~~
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.common.ironic [-] Failed to establish a connection with ironic, reason: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.: keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager [-] The periodic ironic_inspector.conductor.manager.sync_with_ironic failed with: Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 290, in run
work()
File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 64, in __call__
return self.callback(*self.args, **self.kwargs)
File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 178, in decorator
return f(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/ironic_inspector/conductor/manager.py", line 226, in sync_with_ironic
ironic = ir_utils.get_client()
File "/usr/lib/python3.9/site-packages/ironic_inspector/common/ironic.py", line 61, in get_client
return openstack.connection.Connection(
File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 87, in __get__
proxy = self._make_proxy(instance)
File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 223, in _make_proxy
data = proxy_obj.get_endpoint_data()
File "/usr/lib/python3.9/site-packages/keystoneauth1/adapter.py", line 312, in get_endpoint_data
return self.session.get_endpoint_data(auth or self.auth, **kwargs)
File "/usr/lib/python3.9/site-packages/keystoneauth1/session.py", line 1260, in get_endpoint_data
return auth.get_endpoint_data(self, **kwargs)
File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 377, in get_endpoint_data
return super(FixedEndpointPlugin, self).get_endpoint_data(
File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 142, in get_endpoint_data
return endpoint_data.get_versioned_data(
File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1057, in get_versioned_data
new_data._set_version_info(
File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1136, in _set_version_info
self._run_discovery(
File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1250, in _run_discovery
raise exceptions.DiscoveryFailure(
keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
: None: None
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager NoneType: None
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager
~~~
The ironic-proxy pods are waiting for an IP which is already present:
~~~
$ IRONIC_PROXY_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=ironic-proxy -o name | head -1)
$ oc logs -n openshift-machine-api ${IRONIC_PROXY_POD}
...
Waiting for 198.19.0.12 to be configured on an interface
+ ip -br addr show
+ grep -q -F ' 198.19.0.12/'
+ echo 'Waiting for 198.19.0.12 to be configured on an interface'
+ sleep 1
Waiting for 198.19.0.12 to be configured on an interface
+ ip -br addr show
+ grep -q -F ' 198.19.0.12/'
+ echo 'Waiting for 198.19.0.12 to be configured on an interface'
+ sleep 1
~~~
Verify the IP is present:
~~~
# oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}
sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/'
br-ex UNKNOWN 198.19.0.12/24 169.254.169.2/29 198.19.0.1/32
sh-5.1# echo $?
0
~~~
After a long time we see the same happening in the metal3-ironic-inspector pod waiting for the provisioningIP to be present:
$ oc logs -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD}
~~~
Waiting for 192.168.120.47 to be configured on an interface
+ echo 'Waiting for 192.168.120.47 to be configured on an interface'
+ sleep 1
+ ip -br addr show
+ grep -q -F ' 192.168.120.47/'
Waiting for 192.168.120.47 to be configured on an interface
+ echo 'Waiting for 192.168.120.47 to be configured on an interface'
+ sleep 1
+ ip -br addr show
+ grep -q -F ' 192.168.120.47/'
~~~
Verify the IP is present:
~~~
$ oc rsh -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD}
sh-5.1# ip -br addr show | grep -F ' 192.168.120.47/'
cluster-ext@infra_10g UP 192.168.120.47/24
~~~
## Reason:
=====================
- Both the metal3-ironic-inspector and ironic-proxy pod are executing the `/usr/bin/ironic-common.sh` script.
- This script contains the following loop:
~~~
until ip -br addr show | grep -q -F " ${IRONIC_IP}/"; do
echo "Waiting for ${IRONIC_IP} to be configured on an interface???"
sleep 1
done
~~~
- The reason is that 'grep -q' will close() the STDIN stream upon first matching, breaking the pipe.
- This in combination with the bash 'pipefail' option will result in a non-zero exit code (ecit code 141).We can inspect this behaviour inside one of the affected containers:
~~~
### Connect to ironic-proxy POD
$ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}### Set pipefail option
sh-5.1# set -o pipefail### Test without -q
sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/'
br-ex UNKNOWN 198.19.0.12/24 169.254.169.2/29 198.19.0.1/32
sh-5.1# echo $?
0
### Test with -q
sh-5.1# ip -br addr show | grep -q -F ' 198.19.0.12/'
sh-5.1# echo $?
141
### Test proposed solution without using -q
sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/' >/dev/null 2>&1
sh-5.1# echo $?
0
~~~
## Proposed resolution:
=====================
- We can achieve the desired result without using -q by redirecting the output to /dev/null:
~~~
until ip -br addr show | grep -F " ${IRONIC_IP}/" >/dev/null 2>&1; do
echo "Waiting for ${IRONIC_IP} to be configured on an interface???"
sleep 1
done
~~~
## Manual workaround:
=====================
- Because the script is already running in the pod, we cannot patch it, but it is calling 'grep' every time.We can work around the issue manually by:
- Entering the affected pods/containers (ironic-proxy and metal3:metal3-ironic-inspector)
- Create a copy of the original grep binary
- Create a wrapper around grep, detecting this use of '-q' and calling the original grep using a workaround
- Replacing the grep binary with the wrapper, which now wraps around the copy of the original grep
Procedure:
~~~
# Connect to the affected pod
$ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}# Create the grep wrapper
cat > /root/grep <<EOF
#!/bin/bashif [[ "\$1" == "-q" ]]; then
shift; exec /usr/bin/grep.orig \$@ >/dev/null
else
exec /usr/bin/grep.orig \$@
fi
EOF
.
.
# Make wrapper executable
chmod 755 /root/grep# Make a copy of the original grep
cp /usr/bin/grep /usr/bin/grep.orig# Replace grep with the wrapper
cp /root/grep /usr/bin/grep
~~~
- This unstuck the initialization and immediately started the registering and inspecting process of the node.Due to the complexity however, we cannot use this workaround in automation or production.