-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.12
-
Low
-
None
-
False
-
-
-
Description of problem:
Deleting provisioned bmh when trying to replace master node get stuck during deprovisioning
Version-Release number of selected component (if applicable):
4.12.0-rc.2
How reproducible:
One time so far but this has been reported in BZ#2077059
Steps to Reproduce:
1. Follow steps replacing unhealthy etcd memeber https://docs.openshift.com/container-platform/4.11/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member Steps 1 and 2 This was next step to replace bare metal control plane node 1. Follow https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-expanding-the-cluster.html#replacing-a-bare-metal-control-plane-node_ipi-install-expanding 2. Step 2: Remove the old BareMetalHost and Machine objects: oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted ^^ get hung There were two known bugs reported that have still not resolve it. https://bugzilla.redhat.com/show_bug.cgi?id=2077059 https://bugzilla.redhat.com/show_bug.cgi?id=2087213 The work around is to delete the finalizer. It seems been a very common issue and we should address it especially seen in our Nokia System Test BM [kni@registry.kni-qe-31 MASTER_NODE_REPLACEMENT]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-rc.2 True False 2d4h Error while reconciling 4.12.0-rc.2: the cluster operator etcd is degrade oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted** Gets stuck ^^ oc edit bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com REMOVE finalizer "baremetalhost.metal3.io"# we can now see it removed oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE master-0.kni-qe-31.lab.eng.rdu2.redhat.com unmanaged kni-qe-31-dsbn7-master-0 true 2d5h master-1.kni-qe-31.lab.eng.rdu2.redhat.com unmanaged kni-qe-31-dsbn7-master-1 true 2d5h I am not sure which logs you want but there was mention from this image-customization in previous bugs. Please let me know what logs you need? cat metal3-image-customization-8d6986878-5gjpt.log {"level":"info","ts":1670423759.2775626,"logger":"setup","msg":"Go Version: go1.19.2"} {"level":"info","ts":1670423759.277595,"logger":"setup","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1670423759.2775996,"logger":"setup","msg":"Git commit: unknown"} {"level":"info","ts":1670423759.2776031,"logger":"setup","msg":"Build time: unknown"} {"level":"info","ts":1670423759.277605,"logger":"setup","msg":"Component: openshift/image-customization-controller was not built with version info"} I1207 14:36:00.328619 1 request.go:665] Waited for 1.041999052s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s {"level":"info","ts":1670423762.1321034,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"info","ts":1670423762.132259,"logger":"setup","msg":"starting manager"} {"level":"info","ts":1670423762.132457,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="} {"level":"info","ts":1670423762.1324975,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="} {"level":"info","ts":1670423762.1324658,"msg":"starting metrics server","path":"/metrics"} {"level":"info","ts":1670423762.1325104,"logger":"controller.preprovisioningimage","msg":"Starting Controller","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage"} {"level":"info","ts":1670423762.2337852,"logger":"controller.preprovisioningimage","msg":"Starting workers","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","worker count":1}
Actual results:
get hung and won't delete bmh
Expected results:
delete bmh per user document
Additional info:
I was deleting again today (12/8/22) and it appears to the user stuck again. I was looking at some ironic command and it showed Provisioning State = clean failed. baremetal node list +--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+ | 6542c987-8efa-40d7-8936-c34d6b2ce437 | openshift-machine-api~master-2.kni-qe-31.lab.eng.rdu2.redhat.com | 41316c94-f551-44b1-978e-6ca1b6345950 | power off | clean failed | False | +--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+ baremetal node show 6542c987-8efa-40d7-8936-c34d6b2ce437 | grep Timeout | last_error | Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.