Loading...

Type: Bug
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Bare Metal Hardware Provisioning / cluster-api-provider
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
Metal Platform 248, Metal Platform 249
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Original: Description of problem:

After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes. This makes it very difficult for automation because during that time you also don't know if something is failing or not.

Version-Release number of selected component (if applicable):

How reproducible:

Additional info:

## Issue:
=====================
- After adding a BareMetalHost, provisioning does not initiate the registering/inspecting phase for at least 15 minutes.
- This makes it very difficult for automation because during that time you also don't know if something is failing or not.


## ENV:
=====================
- Platform: BareMetal
- Install Type: IPI
- Current:        4.12.42
- Desired:        4.12.42



## Configuration:
=====================
- We have a 'cluster-ext' network for OOB (iDRAC) and customer IP range (eg: 192.168.120.0/24).
- We have a 'cluster-int' network which is the 'baremetal' network (198.19.0.0/24).
- We do not have a 'provisioning' network because of HW/BOM limitations.
- We are using 'provisioningInterface' + 'provisioningIP' for getting a reachable idrac-virtualmedia URL.```
~~~
$ oc get provisionings.metal3.io provisioning-configuration -o yaml
apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  ...
  name: provisioning-configuration
  ...
spec:
  preProvisioningOSDownloadURLs: {}
  provisioningIP: 192.168.120.47
  provisioningInterface: cluster-ext
  provisioningMacAddresses:
  - 0c:29:ef:8d:c3:d0
  - 0c:29:ef:8d:c3:dd
  - 0c:29:ef:8d:c3:ea
  provisioningNetwork: Disabled
  provisioningNetworkCIDR: 192.168.120.0/24
  virtualMediaViaExternalNetwork: true

~~~

## Conditions:
=====================
- This happens at least in the following circumstances:
- Always: When cluster has just been installed and provisioning config has been created/updated
- Mostly: Provisioning config has been changed
- Sometimes: Cluster has been rebooted




## Investigation:
=====================
- The metal3 complains the provisioner is not ready:
~~~
$ METAL3_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state -o name)

$ oc logs -n openshift-machine-api ${METAL3_POD}
~~~{"level":"info","ts":1701943415.7587945,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-0","RequeueAfter:":30}
{"level":"info","ts":1701943415.758767,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-2","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"}
{"level":"info","ts":1701943415.7587974,"logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~worker-3","endpoint":"https://localhost:6385/v1/","error":"Get \"https://localhost:6385/v1\": dial tcp [::1]:6385: connect: connection refused"}
{"level":"info","ts":1701943415.7588215,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-3","RequeueAfter:":30}
{"level":"info","ts":1701943415.7588148,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":"openshift-machine-api/worker-2","RequeueAfter:":30}
~~~

During the first few minutes the metal3-ironic-inspector is throwing this meesages:
~~~
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.common.ironic [-] Failed to establish a connection with ironic, reason: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.: keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager [-] The periodic ironic_inspector.conductor.manager.sync_with_ironic failed with: Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 290, in run
    work()
  File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 64, in __call__
    return self.callback(*self.args, **self.kwargs)
  File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 178, in decorator
    return f(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/ironic_inspector/conductor/manager.py", line 226, in sync_with_ironic
    ironic = ir_utils.get_client()
  File "/usr/lib/python3.9/site-packages/ironic_inspector/common/ironic.py", line 61, in get_client
    return openstack.connection.Connection(
  File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 87, in __get__
    proxy = self._make_proxy(instance)
  File "/usr/lib/python3.9/site-packages/openstack/service_description.py", line 223, in _make_proxy
    data = proxy_obj.get_endpoint_data()
  File "/usr/lib/python3.9/site-packages/keystoneauth1/adapter.py", line 312, in get_endpoint_data
    return self.session.get_endpoint_data(auth or self.auth, **kwargs)
  File "/usr/lib/python3.9/site-packages/keystoneauth1/session.py", line 1260, in get_endpoint_data
    return auth.get_endpoint_data(self, **kwargs)
  File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 377, in get_endpoint_data
    return super(FixedEndpointPlugin, self).get_endpoint_data(
  File "/usr/lib/python3.9/site-packages/keystoneauth1/plugin.py", line 142, in get_endpoint_data
    return endpoint_data.get_versioned_data(
  File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1057, in get_versioned_data
    new_data._set_version_info(
  File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1136, in _set_version_info
    self._run_discovery(
  File "/usr/lib/python3.9/site-packages/keystoneauth1/discover.py", line 1250, in _run_discovery
    raise exceptions.DiscoveryFailure(
keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at https://192.168.120.47:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
: None: None
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager NoneType: None
2023-12-07 10:04:14.420 1 ERROR ironic_inspector.conductor.manager 
~~~

The ironic-proxy pods are waiting for an IP which is already present:
~~~
$ IRONIC_PROXY_POD=$(oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=ironic-proxy -o name | head -1)
$ oc logs -n openshift-machine-api ${IRONIC_PROXY_POD}
...
Waiting for 198.19.0.12 to be configured on an interface
+ ip -br addr show
+ grep -q -F ' 198.19.0.12/'
+ echo 'Waiting for 198.19.0.12 to be configured on an interface'
+ sleep 1
Waiting for 198.19.0.12 to be configured on an interface
+ ip -br addr show
+ grep -q -F ' 198.19.0.12/'
+ echo 'Waiting for 198.19.0.12 to be configured on an interface'
+ sleep 1
~~~


Verify the IP is present:
~~~
# oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}
sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/'
br-ex            UNKNOWN        198.19.0.12/24 169.254.169.2/29 198.19.0.1/32 
sh-5.1# echo $?
0
~~~

After a long time we see the same happening in the metal3-ironic-inspector pod waiting for the provisioningIP to be present:

$ oc logs -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD}
~~~
Waiting for 192.168.120.47 to be configured on an interface
+ echo 'Waiting for 192.168.120.47 to be configured on an interface'
+ sleep 1
+ ip -br addr show
+ grep -q -F ' 192.168.120.47/'
Waiting for 192.168.120.47 to be configured on an interface
+ echo 'Waiting for 192.168.120.47 to be configured on an interface'
+ sleep 1
+ ip -br addr show
+ grep -q -F ' 192.168.120.47/'
~~~


Verify the IP is present:
~~~
$ oc rsh -n openshift-machine-api -c metal3-ironic-inspector ${METAL3_POD}
sh-5.1# ip -br addr show | grep -F ' 192.168.120.47/'
cluster-ext@infra_10g UP             192.168.120.47/24 
~~~


## Reason:
=====================
- Both the metal3-ironic-inspector and ironic-proxy pod are executing the `/usr/bin/ironic-common.sh` script.
- This script contains the following loop:
~~~
    until ip -br addr show | grep -q -F " ${IRONIC_IP}/"; do
      echo "Waiting for ${IRONIC_IP} to be configured on an interface???"
      sleep 1
    done
~~~


- The reason is that 'grep -q' will close() the STDIN stream upon first matching, breaking the pipe.
- This in combination with the bash 'pipefail' option will result in a non-zero exit code (ecit code 141).We can inspect this behaviour inside one of the affected containers:
~~~
### Connect to ironic-proxy POD
$ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}### Set pipefail option
sh-5.1# set -o pipefail### Test without -q
sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/'
br-ex            UNKNOWN        198.19.0.12/24 169.254.169.2/29 198.19.0.1/32

sh-5.1# echo $?
0

### Test with -q
sh-5.1# ip -br addr show | grep -q -F ' 198.19.0.12/'
sh-5.1# echo $?
141

### Test proposed solution without using -q
sh-5.1# ip -br addr show | grep -F ' 198.19.0.12/' >/dev/null 2>&1
sh-5.1# echo $?
0
~~~


## Proposed resolution:
=====================
- We can achieve the desired result without using -q by redirecting the output to /dev/null:
~~~
    until ip -br addr show | grep -F " ${IRONIC_IP}/" >/dev/null 2>&1; do
      echo "Waiting for ${IRONIC_IP} to be configured on an interface???"
      sleep 1
    done
~~~


## Manual workaround:  
=====================
- Because the script is already running in the pod, we cannot patch it, but it is calling 'grep' every time.We can work around the issue manually by:
- Entering the affected pods/containers (ironic-proxy and metal3:metal3-ironic-inspector)
- Create a copy of the original grep binary
- Create a wrapper around grep, detecting this use of '-q' and calling the original grep using a workaround
- Replacing the grep binary with the wrapper, which now wraps around the copy of the original grep

Procedure:
~~~
# Connect to the affected pod
$ oc rsh -n openshift-machine-api ${IRONIC_PROXY_POD}# Create the grep wrapper
cat > /root/grep <<EOF
#!/bin/bashif [[ "\$1" == "-q" ]]; then
  shift; exec /usr/bin/grep.orig \$@ >/dev/null
else
  exec /usr/bin/grep.orig \$@
fi
EOF
.
.
# Make wrapper executable
chmod 755 /root/grep# Make a copy of the original grep
cp /usr/bin/grep /usr/bin/grep.orig# Replace grep with the wrapper
cp /root/grep /usr/bin/grep
~~~

- This unstuck the initialization and immediately started the registering and inspecting process of the node.Due to the complexity however, we cannot use this workaround in automation or production.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates