Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44026

BaremetalHosts stuck in 'provisioning' state despite deploy_step success

XMLWordPrintable

    • Important
    • None
    • 5
    • Metal Platform 264
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      During a ZTP install (by rhn-gps-rspazzol), the BMHs remain stuck in the "provisioning" state while the Agent CRs show "Pending user interaction".
      It seems possible that the cause is a restart of the metal3 Pod with a fresh ironic database during deployment.

      Version-Release number of selected component (if applicable):

      Likely affects all extant versions of metal platform when used with the converged flow of ZTP.

      How reproducible:

      Reportedly happens regularly; haven't done repeated tests with the reproduction steps below but likely to be 100%.

      Steps to Reproduce:

      1. Start provisioning a cluster with ZTP
      2. While one or more BMHs are in the "provisioning" state, delete the metal3 pod
      3. Observe the BMHs stuck in provisioning forever

      Actual results:

      The assisted-installer completes writing CoreOS to disk successfully and stops the agent service ready to restart the host:

      Oct 30 14:20:37 localhost.localdomain assisted-installer[45058]: time="2024-10-30T14:20:37Z" level=info msg="Updating node installation stage: Rebooting - Ironic will reboot the node shortly" request_id=9d003e7b-b2f1-49b2-8c72-f122c7b69a68
      Oct 30 14:20:37 localhost.localdomain assisted-installer[45058]: time="2024-10-30T14:20:37Z" level=info msg="Running systemctl stop [agent.service]"
      Oct 30 14:20:37 localhost.localdomain systemd[1]: Stopping agent.service...
      

      However, by this point in time, IPA heartbeats have been failing for some time with the error:

      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent Traceback (most recent call last):
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent     self.api.heartbeat(
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent     raise errors.HeartbeatError(error)
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
      Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent 
      

      Checking the BMH object, it has not been disconnected from ironic, and it shows a different provisioning ID from the node ID that IPA is looking up in ironic (6f7546a2-f49e-4d8d-88f6-a462d53868b6). This suggests that ironic has been restarted with a fresh database and a new node created with a fresh ID.

      Just after the installation succeeds, we also get some heartbeat errors of the form:

      Oct 30 14:20:53 localhost.localdomain podman[3035]: 2024-10-30 14:20:53.707 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatConnectionError: Error attempting to heartbeat - Possible transitory network failure or
       blocking port may be present.: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
      

      This continues to happen intermittently, but mostly the logs are full of the "node could not be found" errors.

      Expected results:

      After the custom deploy method successfully completes, IPA should inform ironic of this:

      Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.939 1 DEBUG ironic_coreos_install [-] Assisted Agent is in state inactive (result success) _is_assisted_running /usr/lib/python3.9/site-packages/ironic_coreos_install.py:136
      Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.947 1 INFO ironic_coreos_install [-] Succesfully installed using the assisted agent
      Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.951 1 INFO root [-] Deploy step completed: {'args': {}, 'interface': 'deploy', 'priority': 80, 'step': 'start_assisted_install'}, result: None
      Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.951 1 INFO root [-] Asynchronous command execute_deploy_step completed: {'deploy_result': None, 'deploy_step': {'args': {}, 'interface': 'deploy', 'priority': 80, 'step': 'start_assisted_install'}}
      

      Once ironic knows that deployment is complete, it will reboot the host, which will come up as a Node and join the cluster, and the BMH provisioning state will change to provisioned.

      This would presumably work if we either restarted IPA or looked up the node ID again when heartbeating fails. (We may want to limit this behaviour to MetalĀ³, since changes of node ID are not expected in regular OpenStack deployments.)

      Additional info:

      Note that when an ironic node is in the deploying state, MetalĀ³ cannot even attempt to deprovision the host (which is also required for deleting the BMH); it must wait for the state to change to active (which won't happen when this bug is triggered) or deploy-wait (which doesn't seem to happen either). So this bug is a possible trigger of other reported bugs where deprovisioning never completes, such as OCPBUGS-38253.

      It seems likely that the same issue would affect regular (non-assisted) baremetal deployments using the install_coreos custom deploy method also.

              rpittau@redhat.com Riccardo Pittau
              zabitter Zane Bitter
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: