-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
Low
-
None
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Deleting provisioned bmh when trying to replace master node get stuck during deprovisioning
Version-Release number of selected component (if applicable):
4.12.0-rc.2
How reproducible:
One time so far but this has been reported in BZ#2077059
Steps to Reproduce:
1. Follow steps replacing unhealthy etcd memeber
https://docs.openshift.com/container-platform/4.11/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member
Steps 1 and 2
This was next step to replace bare metal control plane node
1. Follow https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-expanding-the-cluster.html#replacing-a-bare-metal-control-plane-node_ipi-install-expanding
2. Step 2: Remove the old BareMetalHost and Machine objects:
oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted
^^ get hung
There were two known bugs reported that have still not resolve it.
https://bugzilla.redhat.com/show_bug.cgi?id=2077059
https://bugzilla.redhat.com/show_bug.cgi?id=2087213
The work around is to delete the finalizer. It seems been a very common issue and we should address it
especially seen in our Nokia System Test BM [kni@registry.kni-qe-31 MASTER_NODE_REPLACEMENT]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-rc.2 True False 2d4h Error while reconciling 4.12.0-rc.2: the cluster operator etcd is degrade
oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted** Gets stuck ^^
oc edit bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
REMOVE finalizer "baremetalhost.metal3.io"# we can now see it removed
oc get bmh -n openshift-machine-api
NAME STATE CONSUMER ONLINE ERROR AGE
master-0.kni-qe-31.lab.eng.rdu2.redhat.com unmanaged kni-qe-31-dsbn7-master-0 true 2d5h
master-1.kni-qe-31.lab.eng.rdu2.redhat.com unmanaged kni-qe-31-dsbn7-master-1 true 2d5h
I am not sure which logs you want but there was mention from this image-customization in previous bugs. Please let me know
what logs you need?
cat metal3-image-customization-8d6986878-5gjpt.log
{"level":"info","ts":1670423759.2775626,"logger":"setup","msg":"Go Version: go1.19.2"}
{"level":"info","ts":1670423759.277595,"logger":"setup","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1670423759.2775996,"logger":"setup","msg":"Git commit: unknown"}
{"level":"info","ts":1670423759.2776031,"logger":"setup","msg":"Build time: unknown"}
{"level":"info","ts":1670423759.277605,"logger":"setup","msg":"Component: openshift/image-customization-controller was not built with version info"}
I1207 14:36:00.328619 1 request.go:665] Waited for 1.041999052s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
{"level":"info","ts":1670423762.1321034,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1670423762.132259,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1670423762.132457,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="}
{"level":"info","ts":1670423762.1324975,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="}
{"level":"info","ts":1670423762.1324658,"msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1670423762.1325104,"logger":"controller.preprovisioningimage","msg":"Starting Controller","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage"}
{"level":"info","ts":1670423762.2337852,"logger":"controller.preprovisioningimage","msg":"Starting workers","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","worker count":1}
Actual results:
get hung and won't delete bmh
Expected results:
delete bmh per user document
Additional info:
I was deleting again today (12/8/22) and it appears to the user stuck again. I was looking at some ironic command and it showed Provisioning State = clean failed.
baremetal node list
+--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| 6542c987-8efa-40d7-8936-c34d6b2ce437 | openshift-machine-api~master-2.kni-qe-31.lab.eng.rdu2.redhat.com | 41316c94-f551-44b1-978e-6ca1b6345950 | power off | clean failed | False |
+--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
baremetal node show 6542c987-8efa-40d7-8936-c34d6b2ce437 | grep Timeout
| last_error | Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.