Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61298

[ACM] ACM cannot access ironic-python-agent on dual-stack environment but only IPv6 is unreachable

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 5
    • Important
    • None
    • None
    • None
    • Rejected
    • Metal Platform 278
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Baremetal host is not detected randomly when we meet the following conditions:
      - ACM Hub Cluster is IPv4/IPv6 dual-stack
      - Managed cluster and NMStateConfig is IPv4/IPv6 dual-stack
      - Managed nodes are added by BMH resource, not discovery ISO
      - IPv4 is reachable between ACM and managed nodes
      - IPv6 is unreachable between ACM and managed nodes
      
      In this case, when we create a BMH resource, the managed node is not detected well randomly.
      ironic-python-agent has some issues, agent.service doesn't start, and Agent resource is not created.
      
      ironic-python-agent sometimes advertises an IPv6 URL to ACM:
      
      ~~~
      ironic-agent[XXX]: 2025-01-01 00:00:00.000 1 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://[2001:db8::1]:9999, API version is 1.68 heartbeat /usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py:186
      ~~~
      
      But as IPv6 is unreachable, ACM metal3-ironic shows the following error:
      ~~~
      2025-01-01 00:00:00.000 1 ERROR ironic.drivers.modules.agent_client [-] Failed to connect to the agent running on node XXXXXXXX to collect commands status. Error: HTTPSConnectionPool(host='2001:db8::1', port=9999): Max retries exceeded with url: /v1/commands/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0xXXXXXX>, 'Connection to 2001:db8::1 timed out. (connect timeout=60)')): requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='2001:db8::1', port=9999): Max retries exceeded with url: /v1/commands/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0xXXXXXX>, 'Connection to 2001:db8::1 timed out. (connect timeout=60)'))
      ~~~
      
      Then agent.service doesn't start on the managed node and the node is not detected by ACM.
      
      But sometimes ironic-python-agent advertises an IPv4 URL to ACM.
      In this case, everything works well.
      The issue occurrence is random.
      ~~~
      ironic-agent[XXX]: 2025-01-01 00:00:00.000 1 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://192.0.2.1:9999, API version is 1.68 heartbeat /usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py:186
      ~~~

      Version-Release number of selected component (if applicable):

      OCP 4.18
      ACM 2.13

      How reproducible:

      Steps to Reproduce:

      1. Prepare ACM hub cluster and managed nodes. IPv4 is reachable between ACM and managed nodes, but IPv6 is not reachable.
      2. Create an ACM Hub Cluster with IPv4/IPv6 dual-stack 3. Create NMStateConfig resource with IPv4/IPv6 dual-stack setting for managed nodes
      4. Create InfraEnv resource
      5. Create a BMH resoruce
      6. The managed node starts automatically, but the node is not detected by the above errors randomly

      Actual results:

      ACM metal3-ironic cannot connect to the managed nodes

      Expected results:

      It works with IPv4 even if IPv6 is not reachable

      Workaround:

      We can specify the advertised IP address using advertise_host setting.
      The issue can be solved by manually setting it on the managed node.
      
      ~~~
      # vi /etc/ironic-python-agent.conf
         :
      [DEFAULT]
      advertise_host = 192.0.2.1
          :
      
      # systemctl restart ironic-agent.service
      ~~~
      
      However, we need to make this config change manually on each managed node after the issue occurs.

      Additional info:

      I checked the source code.
      The advertised IP address is determined by the following code:
      
      https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/agent.py#L344
      https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/agent.py#L305-L324
      ~~~
          def _find_routable_addr(self):
              ips = set()
              for api_url in self.api_urls:
                  ironic_host = urlparse.urlparse(api_url).hostname
                  # Try resolving it in case it's not an IP address
                  try:
                      addrs = socket.getaddrinfo(ironic_host, 0)
                  except socket.gaierror:
                      LOG.debug('Could not resolve %s, maybe no DNS', ironic_host)
                      ips.add(ironic_host)
                      continue
                  ips.update(addr for _, _, _, _, (addr, *_) in addrs)
      
              for attempt in range(self.ip_lookup_attempts):
                  for ironic_host in ips:
                      found_ip = self._get_route_source(ironic_host)
                      if found_ip:
                          return found_ip
      
                  time.sleep(self.ip_lookup_sleep)
      ~~~
      
      api_urls has both IPv4 address and IPv6 address in dual-stack environment
      
      ~~~
      # cat /etc/ironic-python-agent.conf
      
      [DEFAULT]
      api_url = https://192.0.2.2:6385,https://[2001:db8::2]:6385
        :
      ~~~
      
      The above code determines the advertised IP address using "ip route" command.
      It doesn't check the actual connectivity.
      That's why unreachable IPv6 address can be selected.
      
      Additionally, it has "ips = set()".
      As set() doesn't keep the order of its elements, it randomly returns IPv4 address or IPv6 address in the subsequent "for" loop.
      If it returns IPv4 first, IPv4 URL is advertised and it works.
      If it returns IPv6 first, IPv6 URL is advertised and it doesn't work.
      That's why this issue occurrence is random.

      Fix ideas:

      I came up with some ideas to fix this issue.
      What do you think?
      
      A: In the _find_routable_addr() function, check the actual network connectivity, not only checking the routing table.
         If it's not reachable, use another IP address.
         I think this is the easiest and best fix.
      
      B: Add some parameters to InfraEnv CR to specify which network is used for the provisioning.
         If we can specify the reachable network address in the CR, it will avoid the issue.
         But I understand that this will be an RFE.
      
      C: Add some parameters to add KernelArguments in BMH resource.
         Ironic-python-agent can get setting parameters from kernel arguments, and currently InfraEnv can add KernelArguments to DisocveryISO.
         But to avoid the issue, we need to set different values to ipa-advertise-host per BMH.
         InfraEnv cannot set different settings per node.
         If BMH can have a parameter to set KernelArgs, we will be able to set different ipa-advertise-host per host, and we can avoid the issue.
         But I understand that this is also an RFE.

              rpittau@redhat.com Riccardo Pittau
              rhn-support-yatanaka Yamato Tanaka
              Yamato Tanaka
              Steven Skeard
              Vladislav Kolodny Vladislav Kolodny
              None
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: