-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.13.z
-
No
-
Metal Platform 240
-
1
-
False
-
Description of problem:
We are finding that deployed SNOs via the ZTP SiteConfig plugin are having problems communicating back to the metal3-ironic-inspector. The ironic-agent container which is started in the Discovery ISO, is attempting to reach out to the Hub cluster running RHACM, over the API endpoint https://api.yukon.cars2.lab:5050. However, from the podman logs on the Discovery ISO SNO server host, we see this: 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred: 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent Traceback (most recent call last): 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent httplib_response = self._make_request( 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 382, in _make_request 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent self._validate_conn(conn) 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent conn.connect() 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 353, in connect 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent conn = self._new_conn() 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 181, in _new_conn 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent raise NewConnectionError( 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f478b37a760>: Failed to establish a new connection: [Errno 111] ECONNREFUSED 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred: 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent Traceback (most recent call last): 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent resp = conn.urlopen( 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent retries = retries.increment( 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent raise MaxRetryError(_pool, url, error or ResponseError(cause)) 2023-07-11 11:51:05.030 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='192.168.38.22', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f478b37a760>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')) We see the same behavior trying to curl the metal3-ironic-inspector through the cluster API: # curl -k https://api.yukon.cars2.lab:5050 -4 curl: (7) Failed to connect to api.yukon.cars2.lab port 5050: Connection refused It is worth noting that direct curls to the host running the metal3 pod run successfully: # curl -k https://cp2.yukon.cars2.lab:5050 -4 {"versions":[{"id":"1.18","links":[{"href":"http://cp2.yukon.cars2.lab:5050/v1","rel":"self"}],"status":"CURRENT"}]} If we look in the openshift-machine-api project on the Hub cluster, we can delete / restart the metal3 pod, and the pod will move to another node. If we repeat the curls, the connections succeed: # curl -k https://api.yukon.cars2.lab:5050 -4 {"versions":[{"id":"1.18","links":[{"href":"http://api.yukon.cars2.lab:5050/v1","rel":"self"}],"status":"CURRENT"}]} This behavior (when it happens it is sporadic) affects our cluster installs negatively, and we shouldn't have to "bounce" the metal3 pod to get it to respond over the router endpoint. We have not noticed this behavior in OpenShift 4.12 / RHACM 2.6 and 2.7. It is also worth noting that other VIPs that are advertised through the router function normally (console, downloads, etc). Please could we get some help gathering lots and collecting information and troubleshooting on this?
Version-Release number of selected component (if applicable):
ACM: 2.8.0 MCH: 2.3.0 Hub OCP: 4.12.22 Live iso: rhcos-4.13.0-x86_64-live.x86_64.iso Managed cluster OCP: 4.13.4
How reproducible:
Sporadic, but has occurred on two different clusters
Steps to Reproduce:
1. Install OpenShift 4.13.4 2. Install components responsible for ZTP (RHACM, GitOps, ZTP Plugin, TALM) 3. Install a managed SNO cluster using ZTP and it will be unable to contact the metal3-ironic-inspector container running on port 5050
Actual results:
Expected results:
No connection refused messages from the ironic-agent container
Additional info:
- is related to
-
OCPBUGS-16169 [4.12] Ironic inspector service should be proxied
- Closed