Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Infrastructure Operator
Labels:
- BackupAndRecovery
- Triaged

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

PX Priority Data:
PX Impact Score:

Description of problem:

Needed a procedure for disaster recovery of a master on a baremetal cluster deployed and managed by RHACM. If possible the idea of the process is that it is gitops friendly

Version-Release number of selected component (if applicable):

OCP 4.16.x

How reproducible:

Always

Steps to Reproduce:

On a cluster with RHACM installed and with baremetal infrastructure available deploy a cluster
Once the cluster is deployed, destroy/delete/stop a master node on the managed cluster
...

Actual results:

Expected results:

Additional info:

Playing with a lab i developed a procedure that i would like to verify:

Environment
- Hub cluster: OCP 4.16.4 (SNO + RHACM v2.10)
- Deployed cluster: OCP 4.16.8 (3 baremetal hosts running on libvirt with sushy for redfish interface)

Procedure
- First clean up things from the dead master on the Deployed Cluster:
$ oc -n openshift-machine-api delete machine mabajodu-cluster1-n9mtg-master-2
$ oc -n openshift-machine-api delete bmh node3.sbr-shift.gsslab.brq2.redhat.com
- Clean up objects on the Hub Cluster:
$ oc -n mabajodu-cluster1 delete bmh node3
$ oc -n mabajodu-cluster1 delete agent b5f04079-a40e-4452-87d2-e96fdb00f6e5
- Now create the BMH in the Hub Cluster:

$ cat node4-bmh.yaml 
---
apiVersion: v1
kind: Secret
type: Opaque
data:
  password: dGVzdA==
  username: dGVzdA==
metadata:
  labels:
    environment.metal3.io: baremetal
  name: bmc-node4
  namespace: mabajodu-cluster1
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    bmac.agent-install.openshift.io/hostname: node4.sbr-shift.gsslab.brq2.redhat.com
    inspect.metal3.io: disabled
  labels:
    infraenvs.agent-install.openshift.io: mabajodu-cluster1
  name: node4
  namespace: mabajodu-cluster1
spec:
  automatedCleaningMode: disabled
  bmc:
    address: redfish-virtualmedia+http://10.37.205.55:8000/redfish/v1/Systems/62bcada8-a93a-4f0f-a47a-6f2f124367a6
    credentialsName: bmc-node4
    disableCertificateVerification: true
  bootMACAddress: 52:54:00:71:3a:83
  customDeploy:
    method: start_assisted_install
  online: true

- Wait until BMH shows as available and agent object is created:

$ oc get agent
NAME                                   CLUSTER             APPROVED   ROLE          STAGE
391498e9-4d5a-4edb-a924-bc1d95d22ed5   mabajodu-cluster1   true       master        Done
62bcada8-a93a-4f0f-a47a-6f2f124367a6                       true       auto-assign   
cba0f39c-7fcc-4ad3-9037-5fd7c6594855   mabajodu-cluster1   true       master        Done

- Then edit agent object and go from this:

spec:
  approved: true
  hostname: node4.sbr-shift.gsslab.brq2.redhat.com
  role: ""

to this:

spec:
  approved: true
  hostname: node4.sbr-shift.gsslab.brq2.redhat.com
  clusterDeploymentName:
    name: mabajodu-cluster1
    namespace: mabajodu-cluster1
  role: master

- Wait until nodes is provisioned
- Wait some more time until cluster operators install and update configuration. All CO should be recovering at this point until the cluster is back to normal operation

Assignee:: Chad Crum

Reporter:: Mario Abajo Duran

QA Contact:: Chad Crum

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/08/30 3:46 PM

Updated:: 2024/09/27 3:00 PM

Resolved:: 2024/09/27 3:00 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates