-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.20
-
Quality / Stability / Reliability
-
False
-
-
2
-
Critical
-
No
-
x86_64
-
Dev
-
None
-
None
-
Rejected
-
None
-
Proposed
-
Known Issue
-
-
None
-
None
-
None
-
None
Description of problem:
When a Dell server's BMC (iDRAC) firmware update is initiated through the BareMetalHost servicing state, the node incorrectly transitions to service failed. This failure happens because Ironic does not tolerate the temporary Redfish API downtime that occurs when the BMC reboots. As the BMC becomes unavailable, Ironic logs a primary Redfish ConnectionError (ECONNREFUSED). This Redfish outage causes secondary failures as well. For instance, Ironic’s power-state wait loop times out because the conductor cannot poll the power status while the BMC is offline. This results in log entries such as Timed out after 180 secs waiting for power off followed by a LoopingCallTimeOut. These cascading timeouts reinforce that the process is not correctly handling the transient unavailability. Although the BMC firmware update itself ultimately succeeds, the overall servicing operation is marked as a failure.
Version-Release number of selected component (if applicable):
4.20
How reproducible:
Perform BIOS and BMC firmware update
Steps to Reproduce:
1. Initial State: Ensure the BareMetalHost (BMH) is provisioned and active with a known baseline version of the BIOS and BMC firmware. 2. Upgrade: Start a servicing operation to perform a firmware upgrade to a newer version of the BIOS and BMC. Wait for the process to complete. 3. Downgrade: Once the upgrade is finished, start another servicing operation to downgrade the firmware back to the original baseline version. 4. Repeat: Repeat steps 2 and 3 a few times to confirm if the failure occurs consistently during the upgrade/downgrade cycles.
Actual results:
The BareMetalHost (BMH) entered a service failed state and did not recover. Despite this error, the firmware update on the server was actually successful, as confirmed by checking the iDRAC interface directly. The HostFirmwareComponents status did not update to reflect the new firmware version running on the node.
Expected results:
The servicing state should anticipate and tolerate the expected, transient unavailability of the Redfish API during a BMC reboot. The operation should not fail due to connection errors but should instead wait for the BMC to come back online to confirm the final status of the firmware update.
Additional info:
- relates to
-
OCPBUGS-60708 Day2 firmware update on HPE failed with NetworkAdapters error
-
- POST
-
- links to