Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4625

Deleting provisioned bmh when trying to replace master node get stuck during deprovisioning

XMLWordPrintable

    • Low
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      12/21: per Ian M, not quite a dup of OCPBUGS-3029 ; it's low severity, not release gating
      12/20: not gating 4.12, but needs a release note ; Ian M will check if this can be closed as a duplicate OCPBUGS-3029
      Show
      12/21: per Ian M, not quite a dup of OCPBUGS-3029 ; it's low severity, not release gating 12/20: not gating 4.12, but needs a release note ; Ian M will check if this can be closed as a duplicate OCPBUGS-3029

      Description of problem:

      Deleting provisioned bmh when trying to replace master node get stuck during deprovisioning

      Version-Release number of selected component (if applicable):

      4.12.0-rc.2

      How reproducible:

      One time so far but this has been reported in BZ#2077059 

      Steps to Reproduce:

      1. Follow steps replacing unhealthy etcd memeber
      
      https://docs.openshift.com/container-platform/4.11/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member
      
      Steps 1 and 2
      
      This was next step to replace bare metal control plane node
      
      1. Follow https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-expanding-the-cluster.html#replacing-a-bare-metal-control-plane-node_ipi-install-expanding
      2. Step 2: Remove the old BareMetalHost and Machine objects: 
      oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
      baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted
      
      ^^ get hung 
      
      There were two known bugs reported that have still not resolve it.
      https://bugzilla.redhat.com/show_bug.cgi?id=2077059
      https://bugzilla.redhat.com/show_bug.cgi?id=2087213
      
      The work around is to delete the finalizer.  It seems been a very common issue and we should address it
      especially seen in our Nokia System Test BM  [kni@registry.kni-qe-31 MASTER_NODE_REPLACEMENT]$ oc get clusterversion
      NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-rc.2   True        False         2d4h    Error while reconciling 4.12.0-rc.2: the cluster operator etcd is degrade
      oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
      baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted** Gets stuck ^^
      oc edit bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
      REMOVE finalizer "baremetalhost.metal3.io"# we can now see it removed
      oc get bmh -n openshift-machine-api
      NAME                                         STATE       CONSUMER                   ONLINE   ERROR   AGE
      master-0.kni-qe-31.lab.eng.rdu2.redhat.com   unmanaged   kni-qe-31-dsbn7-master-0   true             2d5h
      master-1.kni-qe-31.lab.eng.rdu2.redhat.com   unmanaged   kni-qe-31-dsbn7-master-1   true             2d5h
      
      I am not sure which logs you want but there was mention from this image-customization in previous bugs. Please let me know 
      what logs you need?
      
      
      cat metal3-image-customization-8d6986878-5gjpt.log
      {"level":"info","ts":1670423759.2775626,"logger":"setup","msg":"Go Version: go1.19.2"}
      {"level":"info","ts":1670423759.277595,"logger":"setup","msg":"Go OS/Arch: linux/amd64"}
      {"level":"info","ts":1670423759.2775996,"logger":"setup","msg":"Git commit: unknown"}
      {"level":"info","ts":1670423759.2776031,"logger":"setup","msg":"Build time: unknown"}
      {"level":"info","ts":1670423759.277605,"logger":"setup","msg":"Component: openshift/image-customization-controller was not built with version info"}
      I1207 14:36:00.328619       1 request.go:665] Waited for 1.041999052s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
      {"level":"info","ts":1670423762.1321034,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
      {"level":"info","ts":1670423762.132259,"logger":"setup","msg":"starting manager"}
      {"level":"info","ts":1670423762.132457,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="}
      {"level":"info","ts":1670423762.1324975,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="}
      {"level":"info","ts":1670423762.1324658,"msg":"starting metrics server","path":"/metrics"}
      {"level":"info","ts":1670423762.1325104,"logger":"controller.preprovisioningimage","msg":"Starting Controller","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage"}
      {"level":"info","ts":1670423762.2337852,"logger":"controller.preprovisioningimage","msg":"Starting workers","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","worker count":1}
      
      

      Actual results:

      get hung and won't delete bmh

      Expected results:

      delete bmh per user document

      Additional info:

      I was deleting again today (12/8/22) and it appears to the user stuck again.  I was looking at some ironic command and it showed Provisioning State = clean failed.  
      
      baremetal node list
      +--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
      | UUID                                 | Name                                                             | Instance UUID                        | Power State | Provisioning State | Maintenance |
      +--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
      | 6542c987-8efa-40d7-8936-c34d6b2ce437 | openshift-machine-api~master-2.kni-qe-31.lab.eng.rdu2.redhat.com | 41316c94-f551-44b1-978e-6ca1b6345950 | power off   | clean failed       | False       |
      +--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
      
      
      baremetal node show 6542c987-8efa-40d7-8936-c34d6b2ce437 | grep Timeout
      | last_error             | Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.
       

       

              rhn-engineering-dtantsur Dmitry Tantsur
              mlammon@redhat.com Mike Lammon
              Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
              Mike Lammon
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: