Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-13711

[OCP 4.16] DR Procedure for master substitution on a RHACM deployed cluster

XMLWordPrintable

    • False
    • None
    • False
    • None

      Description of problem:

      Needed a procedure for disaster recovery of a master on a baremetal cluster deployed and managed by RHACM. If possible the idea of the process is that it is gitops friendly

      Version-Release number of selected component (if applicable):

      OCP 4.16.x

      How reproducible:

      Always

      Steps to Reproduce:

      1. On a cluster with RHACM installed and with baremetal infrastructure available deploy a cluster
      2. Once the cluster is deployed, destroy/delete/stop  a master node on the managed cluster
      3. ...

      Actual results:

      Expected results:

      Additional info:

      Playing with a lab i developed a procedure that i would like to verify:

      • Environment
          - Hub cluster: OCP 4.16.4 (SNO + RHACM v2.10)
          - Deployed cluster: OCP 4.16.8 (3 baremetal hosts running on libvirt with sushy for redfish interface)
      • Procedure
          - First clean up things from the dead master on the Deployed Cluster:
            $ oc -n openshift-machine-api delete machine mabajodu-cluster1-n9mtg-master-2
            $ oc -n openshift-machine-api delete bmh node3.sbr-shift.gsslab.brq2.redhat.com
          - Clean up objects on the Hub Cluster:
            $ oc -n mabajodu-cluster1 delete bmh node3
            $ oc -n mabajodu-cluster1 delete agent b5f04079-a40e-4452-87d2-e96fdb00f6e5
          - Now create the BMH in the Hub Cluster:
        $ cat node4-bmh.yaml 
        ---
        apiVersion: v1
        kind: Secret
        type: Opaque
        data:
          password: dGVzdA==
          username: dGVzdA==
        metadata:
          labels:
            environment.metal3.io: baremetal
          name: bmc-node4
          namespace: mabajodu-cluster1
        ---
        apiVersion: metal3.io/v1alpha1
        kind: BareMetalHost
        metadata:
          annotations:
            bmac.agent-install.openshift.io/hostname: node4.sbr-shift.gsslab.brq2.redhat.com
            inspect.metal3.io: disabled
          labels:
            infraenvs.agent-install.openshift.io: mabajodu-cluster1
          name: node4
          namespace: mabajodu-cluster1
        spec:
          automatedCleaningMode: disabled
          bmc:
            address: redfish-virtualmedia+http://10.37.205.55:8000/redfish/v1/Systems/62bcada8-a93a-4f0f-a47a-6f2f124367a6
            credentialsName: bmc-node4
            disableCertificateVerification: true
          bootMACAddress: 52:54:00:71:3a:83
          customDeploy:
            method: start_assisted_install
          online: true

          - Wait until BMH shows as available and agent object is created:

        $ oc get agent
        NAME                                   CLUSTER             APPROVED   ROLE          STAGE
        391498e9-4d5a-4edb-a924-bc1d95d22ed5   mabajodu-cluster1   true       master        Done
        62bcada8-a93a-4f0f-a47a-6f2f124367a6                       true       auto-assign   
        cba0f39c-7fcc-4ad3-9037-5fd7c6594855   mabajodu-cluster1   true       master        Done

          - Then edit agent object and go from this:

        spec:
          approved: true
          hostname: node4.sbr-shift.gsslab.brq2.redhat.com
          role: ""

            to this:

        spec:
          approved: true
          hostname: node4.sbr-shift.gsslab.brq2.redhat.com
          clusterDeploymentName:
            name: mabajodu-cluster1
            namespace: mabajodu-cluster1
          role: master

          - Wait until nodes is provisioned
          - Wait some more time until cluster operators install and update configuration. All CO should be recovering at this point until the cluster is back to normal operation
          

              chadcrum Chad Crum
              rhn-support-mabajodu Mario Abajo Duran
              Chad Crum Chad Crum
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: