-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
None
-
False
-
None
-
False
-
-
-
None
Description of problem:
Needed a procedure for disaster recovery of a master on a baremetal cluster deployed and managed by RHACM. If possible the idea of the process is that it is gitops friendly
Version-Release number of selected component (if applicable):
OCP 4.16.x
How reproducible:
Always
Steps to Reproduce:
- On a cluster with RHACM installed and with baremetal infrastructure available deploy a cluster
- Once the cluster is deployed, destroy/delete/stop a master node on the managed cluster
- ...
Actual results:
Expected results:
Additional info:
Playing with a lab i developed a procedure that i would like to verify:
- Environment
- Hub cluster: OCP 4.16.4 (SNO + RHACM v2.10)
- Deployed cluster: OCP 4.16.8 (3 baremetal hosts running on libvirt with sushy for redfish interface) - Procedure
- First clean up things from the dead master on the Deployed Cluster:
$ oc -n openshift-machine-api delete machine mabajodu-cluster1-n9mtg-master-2
$ oc -n openshift-machine-api delete bmh node3.sbr-shift.gsslab.brq2.redhat.com
- Clean up objects on the Hub Cluster:
$ oc -n mabajodu-cluster1 delete bmh node3
$ oc -n mabajodu-cluster1 delete agent b5f04079-a40e-4452-87d2-e96fdb00f6e5
- Now create the BMH in the Hub Cluster:$ cat node4-bmh.yaml --- apiVersion: v1 kind: Secret type: Opaque data: password: dGVzdA== username: dGVzdA== metadata: labels: environment.metal3.io: baremetal name: bmc-node4 namespace: mabajodu-cluster1 --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: bmac.agent-install.openshift.io/hostname: node4.sbr-shift.gsslab.brq2.redhat.com inspect.metal3.io: disabled labels: infraenvs.agent-install.openshift.io: mabajodu-cluster1 name: node4 namespace: mabajodu-cluster1 spec: automatedCleaningMode: disabled bmc: address: redfish-virtualmedia+http://10.37.205.55:8000/redfish/v1/Systems/62bcada8-a93a-4f0f-a47a-6f2f124367a6 credentialsName: bmc-node4 disableCertificateVerification: true bootMACAddress: 52:54:00:71:3a:83 customDeploy: method: start_assisted_install online: true
- Wait until BMH shows as available and agent object is created:
$ oc get agent NAME CLUSTER APPROVED ROLE STAGE 391498e9-4d5a-4edb-a924-bc1d95d22ed5 mabajodu-cluster1 true master Done 62bcada8-a93a-4f0f-a47a-6f2f124367a6 true auto-assign cba0f39c-7fcc-4ad3-9037-5fd7c6594855 mabajodu-cluster1 true master Done
- Then edit agent object and go from this:
spec: approved: true hostname: node4.sbr-shift.gsslab.brq2.redhat.com role: ""
to this:
spec: approved: true hostname: node4.sbr-shift.gsslab.brq2.redhat.com clusterDeploymentName: name: mabajodu-cluster1 namespace: mabajodu-cluster1 role: master
- Wait until nodes is provisioned
- Wait some more time until cluster operators install and update configuration. All CO should be recovering at this point until the cluster is back to normal operation