Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61871

BMH servicing fails when iDRAC reboots during BMC firmware update

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 2
    • Critical
    • No
    • x86_64
    • Dev
    • None
    • None
    • Rejected
    • None
    • Proposed
    • Known Issue
    • Hide
      When updating a Dell server BMC firmware, the Redfish API is temporarily disrupted, which causes an Ironic service failure and marks the update as a failure. To work around this problem, manually update BMC firmware outside Ironic to avoid service disruption.
      Show
      When updating a Dell server BMC firmware, the Redfish API is temporarily disrupted, which causes an Ironic service failure and marks the update as a failure. To work around this problem, manually update BMC firmware outside Ironic to avoid service disruption.
    • None
    • None
    • None
    • None

      Description of problem:

      When a Dell server's BMC (iDRAC) firmware update is initiated through the BareMetalHost servicing state, the node incorrectly transitions to service failed. This failure happens because Ironic does not tolerate the temporary Redfish API downtime that occurs when the BMC reboots. As the BMC becomes unavailable, Ironic logs a primary Redfish ConnectionError (ECONNREFUSED).
      
      This Redfish outage causes secondary failures as well. For instance, Ironic’s power-state wait loop times out because the conductor cannot poll the power status while the BMC is offline. This results in log entries such as Timed out after 180 secs waiting for power off followed by a LoopingCallTimeOut. These cascading timeouts reinforce that the process is not correctly handling the transient unavailability. Although the BMC firmware update itself ultimately succeeds, the overall servicing operation is marked as a failure.

      Version-Release number of selected component (if applicable):

          4.20

      How reproducible:

          Perform BIOS and BMC firmware update

      Steps to Reproduce:

      1. Initial State: Ensure the BareMetalHost (BMH) is provisioned and active with a known baseline version of the BIOS and BMC firmware.
      2. Upgrade: Start a servicing operation to perform a firmware upgrade to a newer version of the BIOS and BMC. Wait for the process to complete.
      3. Downgrade: Once the upgrade is finished, start another servicing operation to downgrade the firmware back to the original baseline version.
      4. Repeat: Repeat steps 2 and 3 a few times to confirm if the failure occurs consistently during the upgrade/downgrade cycles.     

      Actual results:

      The BareMetalHost (BMH) entered a service failed state and did not recover. Despite this error, the firmware update on the server was actually successful, as confirmed by checking the iDRAC interface directly. The HostFirmwareComponents status did not update to reflect the new firmware version running on the node.  

      Expected results:

      The servicing state should anticipate and tolerate the expected, transient unavailability of the Redfish API during a BMC reboot. The operation should not fail due to connection errors but should instead wait for the BMC to come back online to confirm the final status of the firmware update. 

      Additional info:

        

              janders@redhat.com Jacob Anders
              tali@redhat.com Tao Liu
              None
              None
              Jad Haj Yahya Jad Haj Yahya
              Lluis Cavalle Lluis Cavalle
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: