Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74987

BMO cannot abort inspection when deletion is requested

    • None
    • False
    • Hide

      None

      Show
      None
    • 3
    • Important
    • None
    • None
    • None
    • Metal Platform 283
    • 1
    • Done
    • Bug Fix
    • Hide
      Before this update, during the cluster deletion got stuck during the inspection phase due to a power off stage transition. As a consequence, the cluster was not deleted. With this release, the BareMetal host (BMH) is prevented from getting stuck during deletion in a ZTP environment. As a result, the cluster removal is prevented from getting stuck during the inspection phase, which improves the efficiency of the ZTP environment. (link:https://issues.redhat.com/browse/OCPBUGS-74987[OCPBUGS-74987])
      Show
      Before this update, during the cluster deletion got stuck during the inspection phase due to a power off stage transition. As a consequence, the cluster was not deleted. With this release, the BareMetal host (BMH) is prevented from getting stuck during deletion in a ZTP environment. As a result, the cluster removal is prevented from getting stuck during the inspection phase, which improves the efficiency of the ZTP environment. (link: https://issues.redhat.com/browse/OCPBUGS-74987 [ OCPBUGS-74987 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-68369. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-65571. The following is the description of the original issue:

      Description of problem:

      Downtream bug for upstream bug: https://github.com/metal3-io/baremetal-operator/issues/2478 (by rhn-engineering-dtantsur )

      When you try to remove a cluster during inspection phase, it moves to power off stage that cannot happen during inspection.

      {"level":"info","ts":1762939414.235443,"logger":"controllers.BareMetalHost","msg":"host ready to be powered off","baremetalhost":{"name":"vsno5","namespace":"vsno5"},"provisioningState":"powering off before delete"}
      {"level":"info","ts":1762939414.2354486,"logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"vsno5~vsno5"}
      {"level":"info","ts":1762939414.244772,"logger":"provisioner.ironic","msg":"changing power state","host":"vsno5~vsno5"}
      {"level":"info","ts":1762939414.244786,"logger":"provisioner.ironic","msg":"host in state that does not allow power change, try again after delay","host":"vsno5~vsno5","state":"inspect wait","target state":"manageable"} 

      This can happen if you create and remove very fast. But it would happen that the inspection phase never finishes, because of any miss configuration.  

      The deletion of the BMH object will get stuck on deleting until 3 retries:

      {"level":"info","ts":1762941220.183747,"logger":"provisioner.ironic","msg":"power off error","host":"vsno5~vsno5","msg":"timeout reached while inspecting the node"}
      {"level":"info","ts":1762941220.1837687,"logger":"controllers.BareMetalHost","msg":"Giving up on host power off after 3 attempts.","baremetalhost":{"name":"vsno5","namespace":"vsno5"},"provisioningState":"powering off before delete"}
      {"level":"info","ts":1762941220.183773,"logger":"controllers.BareMetalHost","msg":"changing provisioning state","baremetalhost":{"name":"vsno5","namespace":"vsno5"},"provisioningState":"powering off before delete","old":"powering off before delete","new":"deleting"}
      {"level":"info","ts":1762941220.1837842,"logger":"controllers.BareMetalHost","msg":"saving host status","baremetalhost":{"name":"vsno5","namespace":"vsno5"},"provisioningState":"powering off before delete","operational status":"error","provisioning state":"deleting"}
      {"level":"info","ts":1762941220.1923869,"logger":"controllers.BareMetalHost","msg":"publishing event","baremetalhost":{"name":"vsno5","namespace":"vsno5"},"reason":"PowerManagementError","message":"timeout reached while inspecting the node"} 

      on an ZTP environment, this timeout would make other controlers (like siteconfig controller) to abort the deletion, and the cluster is not removed after the timeout.

       
      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

              rpittau@redhat.com Riccardo Pittau
              jgato@redhat.com Jose Gato Luis
              None
              Jose Gato Luis
              Steeve Goveas Steeve Goveas
              Lluis Cavalle Lluis Cavalle
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: