Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Bare Metal Hardware Provisioning / ironic
Labels:
- triaged

Severity:
Important
Regression:
None
Story Points:
5
Sprint:
Metal Platform 264, Metal Platform 266
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

During a ZTP install (by rhn-gps-rspazzol), the BMHs remain stuck in the "provisioning" state while the Agent CRs show "Pending user interaction".
It seems possible that the cause is a restart of the metal3 Pod with a fresh ironic database during deployment.

Version-Release number of selected component (if applicable):

Likely affects all extant versions of metal platform when used with the converged flow of ZTP.

How reproducible:

Reportedly happens regularly; haven't done repeated tests with the reproduction steps below but likely to be 100%.

Steps to Reproduce:

Start provisioning a cluster with ZTP
While one or more BMHs are in the "provisioning" state, delete the metal3 pod
Observe the BMHs stuck in provisioning forever

Actual results:

The assisted-installer completes writing CoreOS to disk successfully and stops the agent service ready to restart the host:

Oct 30 14:20:37 localhost.localdomain assisted-installer[45058]: time="2024-10-30T14:20:37Z" level=info msg="Updating node installation stage: Rebooting - Ironic will reboot the node shortly" request_id=9d003e7b-b2f1-49b2-8c72-f122c7b69a68
Oct 30 14:20:37 localhost.localdomain assisted-installer[45058]: time="2024-10-30T14:20:37Z" level=info msg="Running systemctl stop [agent.service]"
Oct 30 14:20:37 localhost.localdomain systemd[1]: Stopping agent.service...

However, by this point in time, IPA heartbeats have been failing for some time with the error:

Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent Traceback (most recent call last):
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent     self.api.heartbeat(
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent     raise errors.HeartbeatError(error)
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
Oct 30 14:19:21 localhost.localdomain podman[3035]: 2024-10-30 14:19:21.355 1 ERROR ironic_python_agent.agent

Checking the BMH object, it has not been disconnected from ironic, and it shows a different provisioning ID from the node ID that IPA is looking up in ironic (6f7546a2-f49e-4d8d-88f6-a462d53868b6). This suggests that ironic has been restarted with a fresh database and a new node created with a fresh ID.

Just after the installation succeeds, we also get some heartbeat errors of the form:

Oct 30 14:20:53 localhost.localdomain podman[3035]: 2024-10-30 14:20:53.707 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatConnectionError: Error attempting to heartbeat - Possible transitory network failure or
 blocking port may be present.: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

This continues to happen intermittently, but mostly the logs are full of the "node could not be found" errors.

Expected results:

After the custom deploy method successfully completes, IPA should inform ironic of this:

Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.939 1 DEBUG ironic_coreos_install [-] Assisted Agent is in state inactive (result success) _is_assisted_running /usr/lib/python3.9/site-packages/ironic_coreos_install.py:136
Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.947 1 INFO ironic_coreos_install [-] Succesfully installed using the assisted agent
Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.951 1 INFO root [-] Deploy step completed: {'args': {}, 'interface': 'deploy', 'priority': 80, 'step': 'start_assisted_install'}, result: None
Oct 30 14:20:48 localhost.localdomain podman[3035]: 2024-10-30 14:20:47.951 1 INFO root [-] Asynchronous command execute_deploy_step completed: {'deploy_result': None, 'deploy_step': {'args': {}, 'interface': 'deploy', 'priority': 80, 'step': 'start_assisted_install'}}

Once ironic knows that deployment is complete, it will reboot the host, which will come up as a Node and join the cluster, and the BMH provisioning state will change to provisioned.

This would presumably work if we either restarted IPA or looked up the node ID again when heartbeating fails. (We may want to limit this behaviour to Metal³, since changes of node ID are not expected in regular OpenStack deployments.)

Additional info:

Note that when an ironic node is in the deploying state, Metal³ cannot even attempt to deprovision the host (which is also required for deleting the BMH); it must wait for the state to change to active (which won't happen when this bug is triggered) or deploy-wait (which doesn't seem to happen either). So this bug is a possible trigger of other reported bugs where deprovisioning never completes, such as ~~OCPBUGS-38253~~.

It seems likely that the same issue would affect regular (non-assisted) baremetal deployments using the install_coreos custom deploy method also.

relates to

OCPBUGS-38253 BMH is not deprovisioning and unable to re-provision hosts

Closed

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates