Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16061

metal3 / ironic python agent buggy in 4.13 paired with RHACM 2.8

XMLWordPrintable

    • No
    • Metal Platform 240
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      We are finding that deployed SNOs via the ZTP SiteConfig plugin are having problems communicating back to the metal3-ironic-inspector.  The ironic-agent container which is started in the Discovery ISO, is attempting to reach out to the Hub cluster running RHACM, over the API endpoint https://api.yukon.cars2.lab:5050.
      
      However, from the podman logs on the Discovery ISO SNO server host, we see this:
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred:
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent 
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent Traceback (most recent call last):
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     httplib_response = self._make_request(
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 382, in _make_request
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     self._validate_conn(conn)
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     conn.connect()
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 353, in connect
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     conn = self._new_conn()
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 181, in _new_conn
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     raise NewConnectionError(
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f478b37a760>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent 
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred:
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent 
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent Traceback (most recent call last):
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     resp = conn.urlopen(
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     retries = retries.increment(
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent     raise MaxRetryError(_pool, url, error or ResponseError(cause))
      2023-07-11 11:51:05.030 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='192.168.38.22', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f478b37a760>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))
      
      We see the same behavior trying to curl the metal3-ironic-inspector through the cluster API: 
      # curl -k https://api.yukon.cars2.lab:5050 -4
      curl: (7) Failed to connect to api.yukon.cars2.lab port 5050: Connection refused
      
      It is worth noting that direct curls to the host running the metal3 pod run successfully:
      # curl -k https://cp2.yukon.cars2.lab:5050 -4
      {"versions":[{"id":"1.18","links":[{"href":"http://cp2.yukon.cars2.lab:5050/v1","rel":"self"}],"status":"CURRENT"}]}
      
      If we look in the openshift-machine-api project on the Hub cluster, we can delete / restart the metal3 pod, and the pod will move to another node.  If we repeat the curls, the connections succeed:
      # curl -k https://api.yukon.cars2.lab:5050 -4
      {"versions":[{"id":"1.18","links":[{"href":"http://api.yukon.cars2.lab:5050/v1","rel":"self"}],"status":"CURRENT"}]}
      
      This behavior (when it happens it is sporadic) affects our cluster installs negatively, and we shouldn't have to "bounce" the metal3 pod to get it to respond over the router endpoint.  We have not noticed this behavior in OpenShift 4.12 / RHACM 2.6 and 2.7.  It is also worth noting that other VIPs that are advertised through the router function normally (console, downloads, etc).
      
      Please could we get some help gathering lots and collecting information and troubleshooting on this?

      Version-Release number of selected component (if applicable):

      ACM: 2.8.0
      MCH: 2.3.0
      Hub OCP: 4.12.22
      Live iso: rhcos-4.13.0-x86_64-live.x86_64.iso
      Managed cluster OCP: 4.13.4

      How reproducible:

      Sporadic, but has occurred on two different clusters

      Steps to Reproduce:

      1. Install OpenShift 4.13.4
      2. Install components responsible for ZTP (RHACM, GitOps, ZTP Plugin, TALM)
      3. Install a managed SNO cluster using ZTP and it will be unable to contact the metal3-ironic-inspector container running on port 5050
      

      Actual results:

       

      Expected results:

      No connection refused messages from the ironic-agent container

      Additional info:

       

              rhn-engineering-dtantsur Dmitry Tantsur
              dcain@redhat.com Dave Cain
              Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: