Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Bare Metal Hardware Provisioning / ironic
Labels:

Severity:
Low
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:
Latest Status Summary:

Hide
12/21: per Ian M, not quite a dup of ~~OCPBUGS-3029~~ ; it's low severity, not release gating
12/20: not gating 4.12, but needs a release note ; Ian M will check if this can be closed as a duplicate ~~OCPBUGS-3029~~

Show
12/21: per Ian M, not quite a dup of OCPBUGS-3029 ; it's low severity, not release gating 12/20: not gating 4.12, but needs a release note ; Ian M will check if this can be closed as a duplicate OCPBUGS-3029

SFDC Cases Counter:
SFDC Cases Links:

Description of problem:

Deleting provisioned bmh when trying to replace master node get stuck during deprovisioning

Version-Release number of selected component (if applicable):

4.12.0-rc.2

How reproducible:

One time so far but this has been reported in BZ#2077059

Steps to Reproduce:

1. Follow steps replacing unhealthy etcd memeber

https://docs.openshift.com/container-platform/4.11/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member

Steps 1 and 2

This was next step to replace bare metal control plane node

1. Follow https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-expanding-the-cluster.html#replacing-a-bare-metal-control-plane-node_ipi-install-expanding
2. Step 2: Remove the old BareMetalHost and Machine objects: 
oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted

^^ get hung 

There were two known bugs reported that have still not resolve it.
https://bugzilla.redhat.com/show_bug.cgi?id=2077059
https://bugzilla.redhat.com/show_bug.cgi?id=2087213

The work around is to delete the finalizer.  It seems been a very common issue and we should address it
especially seen in our Nokia System Test BM  [kni@registry.kni-qe-31 MASTER_NODE_REPLACEMENT]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-rc.2   True        False         2d4h    Error while reconciling 4.12.0-rc.2: the cluster operator etcd is degrade
oc delete bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
baremetalhost.metal3.io "master-2.kni-qe-31.lab.eng.rdu2.redhat.com" deleted** Gets stuck ^^
oc edit bmh -n openshift-machine-api master-2.kni-qe-31.lab.eng.rdu2.redhat.com
REMOVE finalizer "baremetalhost.metal3.io"# we can now see it removed
oc get bmh -n openshift-machine-api
NAME                                         STATE       CONSUMER                   ONLINE   ERROR   AGE
master-0.kni-qe-31.lab.eng.rdu2.redhat.com   unmanaged   kni-qe-31-dsbn7-master-0   true             2d5h
master-1.kni-qe-31.lab.eng.rdu2.redhat.com   unmanaged   kni-qe-31-dsbn7-master-1   true             2d5h

I am not sure which logs you want but there was mention from this image-customization in previous bugs. Please let me know 
what logs you need?


cat metal3-image-customization-8d6986878-5gjpt.log
{"level":"info","ts":1670423759.2775626,"logger":"setup","msg":"Go Version: go1.19.2"}
{"level":"info","ts":1670423759.277595,"logger":"setup","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1670423759.2775996,"logger":"setup","msg":"Git commit: unknown"}
{"level":"info","ts":1670423759.2776031,"logger":"setup","msg":"Build time: unknown"}
{"level":"info","ts":1670423759.277605,"logger":"setup","msg":"Component: openshift/image-customization-controller was not built with version info"}
I1207 14:36:00.328619       1 request.go:665] Waited for 1.041999052s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
{"level":"info","ts":1670423762.1321034,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1670423762.132259,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1670423762.132457,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="}
{"level":"info","ts":1670423762.1324975,"logger":"controller.preprovisioningimage","msg":"Starting EventSource","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","source":"kind source: /, Kind="}
{"level":"info","ts":1670423762.1324658,"msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1670423762.1325104,"logger":"controller.preprovisioningimage","msg":"Starting Controller","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage"}
{"level":"info","ts":1670423762.2337852,"logger":"controller.preprovisioningimage","msg":"Starting workers","reconciler group":"metal3.io","reconciler kind":"PreprovisioningImage","worker count":1}

Actual results:

get hung and won't delete bmh

Expected results:

delete bmh per user document

Additional info:

I was deleting again today (12/8/22) and it appears to the user stuck again.  I was looking at some ironic command and it showed Provisioning State = clean failed.  

baremetal node list
+--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name                                                             | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| 6542c987-8efa-40d7-8936-c34d6b2ce437 | openshift-machine-api~master-2.kni-qe-31.lab.eng.rdu2.redhat.com | 41316c94-f551-44b1-978e-6ca1b6345950 | power off   | clean failed       | False       |
+--------------------------------------+------------------------------------------------------------------+--------------------------------------+-------------+--------------------+-------------+


baremetal node show 6542c987-8efa-40d7-8936-c34d6b2ce437 | grep Timeout
| last_error             | Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

metal3_logs.tar.gz
1.86 MB
2022/12/19 2:53 PM

Assignee:: Dmitry Tantsur

Reporter:: Mike Lammon

QA Contact:: Pedro Jose Amoedo Martinez

Need Info From:: Mike Lammon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/12/07 9:59 PM

Updated:: 2023/04/25 8:31 AM

Resolved:: 2023/04/25 8:31 AM

Details

Description

Attachments

Attachments

Activity

People

Dates