-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16
-
None
-
Important
-
None
-
Metal Platform 263
-
1
-
False
-
Description of problem:
During a ZTP install (by rhn-gps-rspazzol), the BMHs remain stuck in the "provisioning" state while the Agent CRs show "Pending user interaction".
It seems possible that the cause is a restart of the metal3 Pod with a fresh ironic database during deployment.
Version-Release number of selected component (if applicable):
Likely affects all extant versions of metal platform when used with the converged flow of ZTP.
How reproducible:
Reportedly happens regularly; haven't done repeated tests with the reproduction steps below but likely to be 100%.
Steps to Reproduce:
- Start provisioning a cluster with ZTP
- While one or more BMHs are in the "provisioning" state, delete the metal3 pod
- Observe the BMHs stuck in provisioning forever
Actual results:
The assisted-installer completes writing CoreOS to disk successfully and stops the agent service ready to restart the host:
Oct 30 14:20:37 localhost.localdomain assisted-installer[45058]: time="2024-10-30T14:20:37Z" level=info msg="Updating node installation stage: Rebooting - Ironic will reboot the node shortly" request_id=9d003e7b-b2f1-49b2-8c72-f122c7b69a68 Oct 30 14:20:37 localhost.localdomain assisted-installer[45058]: time="2024-10-30T14:20:37Z" level=info msg="Running systemctl stop [agent.service]" Oct 30 14:20:37 localhost.localdomain systemd[1]: Stopping agent.service...
However, by this point in time, IPA heartbeats have been failing for some time with the error:
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found. Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent Traceback (most recent call last): Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent self.api.heartbeat( Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent raise errors.HeartbeatError(error) Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found. Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent
Checking the BMH object, it has not been disconnected from ironic, and it shows a different provisioning ID from the node ID that IPA is looking up in ironic (6f7546a2-f49e-4d8d-88f6-a462d53868b6). This suggests that ironic has been restarted with a fresh database and a new node created with a fresh ID.
Just after the installation succeeds, we also get some heartbeat errors of the form:
Oct 30 14:20:53 localhost.localdomain podman[3035]: 2024-10-30 14:20:53.707 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatConnectionError: Error attempting to heartbeat - Possible transitory network failure or blocking port may be present.: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
This continues to happen intermittently, but mostly the logs are full of the "node could not be found" errors.
Expected results:
After the custom deploy method successfully completes, IPA should inform ironic of this:
Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.939 1 DEBUG ironic_coreos_install [-] Assisted Agent is in state inactive (result success) _is_assisted_running /usr/lib/python3.9/site-packages/ironic_coreos_install.py:136 Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.947 1 INFO ironic_coreos_install [-] Succesfully installed using the assisted agent Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.951 1 INFO root [-] Deploy step completed: {'args': {}, 'interface': 'deploy', 'priority': 80, 'step': 'start_assisted_install'}, result: None Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.951 1 INFO root [-] Asynchronous command execute_deploy_step completed: {'deploy_result': None, 'deploy_step': {'args': {}, 'interface': 'deploy', 'priority': 80, 'step': 'start_assisted_install'}}
Once ironic knows that deployment is complete, it will reboot the host, which will come up as a Node and join the cluster, and the BMH provisioning state will change to provisioned.
This would presumably work if we either restarted IPA or looked up the node ID again when heartbeating fails. (We may want to limit this behaviour to MetalĀ³, since changes of node ID are not expected in regular OpenStack deployments.)
Additional info:
Note that when an ironic node is in the deploying state, MetalĀ³ cannot even attempt to deprovision the host (which is also required for deleting the BMH); it must wait for the state to change to active (which won't happen when this bug is triggered) or deploy-wait (which doesn't seem to happen either). So this bug is a possible trigger of other reported bugs where deprovisioning never completes, such as OCPBUGS-38253.
It seems likely that the same issue would affect regular (non-assisted) baremetal deployments using the install_coreos custom deploy method also.
- relates to
-
OCPBUGS-38253 BMH is not deprovisioning and unable to re-provision hosts
- New
-
OCPBUGS-37796 Some BMH resources stuck in "deprovisioning" or "deleting" state
- ASSIGNED