-
Bug
-
Resolution: Done
-
Normal
-
None
Description of problem:
Use case:
in 17.1 OSPdO customer are testing backup and restore when they lose the physical node OCP where controller(virtualized) is running(in this case master of OCP).
They want test restore in this use case but was not able to restore as the get error:
Last Hearbeat Time: 2024-10-21T15:34:55Z
Last Transition Time: 2024-10-21T15:34:55Z
Message: admission webhook "vopenstackbaremetalset.kb.io" denied the request: unable to find 1 requested BaremetalHost count (0 in use, 0 available) with labels [role:totp-cpt-dpdk6] for OpenStackBaremetalSet totp-cpt-dpdk6
Reason: admission webhook "vopenstackbaremetalset.kb.io" denied the request: unable to find 1 requested BaremetalHost count (0 in use, 0 available) with labels [role:totp-cpt-dpdk6] for OpenStackBaremetalSet totp-cpt-dpdk6
Status: True
Type: Restore Error
It seems that it is trying to find baremetalhost in available state, while the node are there but in provisioned state , because I've restored the baremetal status,. (we don't want to loose the compute nodes, the backup is just for controller)
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE LABELS
openshift-machine-api dell01 provisioned totp-cpt-dpdk6 true 4h4m osp-director.openstack.org/controller=osp-baremetalset,osp-director.openstack.org/name=totp-cpt-dpdk6,osp-director.openstack.org/namespace=openstack,osp-director.openstack.org/osphostname=totp-cpt-dpdk6-0,osp-director.openstack.org/uid=c241e4fb-75cd-4096-aa58-aaa87415228c,role=totp-cpt-dpdk6,scope=openstack
openshift-machine-api totp-master0.nfv.cselt.it unmanaged ocp-totp-bjm7t-master-0 true 4d1h scope=openshift
openshift-machine-api totp-master1.nfv.cselt.it unmanaged ocp-totp-bjm7t-master-1 true 4d1h scope=openshift
openshift-machine-api totp-master2.nfv.cselt.it unmanaged ocp-totp-bjm7t-master-2 true 4d1h scope=openshift
customer attempted to restore an OSPdO backup that included OpenStackBareMetalSet resources.
The restore process encountered an error and failed to recover the OpenStackBareMetalSet.
The error message indicated a mismatch between the expected state of BaremetalHost (available) and the actual state (provisioned).
Seems that the only way (need have a confirmation)is restore everything:control plane and dataplane together.
Seems a expected behaviour but it i strange as the dataplane with workload shoudn't impacted.
Investigation:
customer confirmed that BaremetalHost nodes existed in the cluster and were in "provisioned" state.
The OpenStackBackup CR referenced the existing BaremetalHost nodes.
We suspected the issue stemmed from labels on the BaremetalHost nodes being changed after the backup.
Analysis:
OSPdO backup/restore functionality prioritizes its own Custom Resources (CRs) and doesn't directly manage underlying resources like BaremetalHost.
The restore process assumes it needs to provision BaremetalHost nodes based on the OpenStackBareMetalSet spec, even if they already exist.
Discrepancy between expected and actual BaremetalHost state caused the validation webhook to reject the request.
Seems that the only way to make a restore
Version-Release number of selected component (if applicable):
17.1.2 with OSPdO
How reproducible:
simulate OCP hardware fail where controller OSP is running
Steps to Reproduce:
1.backup follow official procedure in the documentation and restore combine rear with openshift part
Actual results:
Seems the ony way to restore one controller is restore all cluster
Expected results:
Additional info:no complete and official reference to make a backup and restore in this scenario