Uploaded image for project: 'Hybrid Cloud Infrastructure Documentation'
  1. Hybrid Cloud Infrastructure Documentation
  2. HCIDOCS-523

Fix "Installing CP node on unhealthy cluster" procedure

XMLWordPrintable

    • 5
    • HCIDOCS 2024#11, HCIDOCS 2024#12
    • 2

      Installing a primary control plane node on an unhealthy cluster

      The title, use case, and prerequisites for this procedure are confusing. The procedure is based on 411-unhealthy.md, which describes replacing a control plane node in a cluster with 2 healthy and 1 unhealthy CP node. (Prerequisite about having a day 2 control plane makes no sense because this is the procedure for creating a day 2 control plane node.)

      Task scope:

      • New title: "Replacing a control plane node in an unhealthy cluster". Remove "primary" because it does not make sense.
      • Use case: "You can replace an unhealthy control plane node in a cluster with 3 control plane nodes."
      • Prerequisites:
        • You have installed OpenShift Container Platform 4.11 or later, with the required etcd-operator version.
        • You have added a host to the cluster by using the UI or the API.
        • You have added the annotation role:master to the host to create a new control plane node.
      • 'oc rsh' commands. Will be fixed with HCIDOCS-518.
      • "Confirm initial state of the cluster:" -> "Check the node status to verify that a control plane node is not available:"
      • "Confirm the etcd-operator detects the cluster as unhealthy:" > "Check the etcd-operator log to verify that a control plane node is not available:"
      • "Confirm the etcdctl members: " > "Open a remote shell connection to etcd-worker-3"
      • Add "List the etcdctl members:" # etcdctl member list -w table
      • "Confirm that etcdctl reports an unhealthy member of the cluster: " > "Check the etcdetl endpoint health" -Fix prompt -  # etcdctl endpoint health
      • "Remove the unhealthy control plane by deleting the Machine Custom Resource:" > "Remove the unhealthy control plane node by deleting the Machine custom resource (CR): ". Move note to next step.
      • "Confirm that etcd-operator has not removed the unhealthy machine: " > "Check the etcd-operator log to verify that the machine CR was deleted:"
      • "Note: The Machine and Node Custom Resources (CRs) will not be deleted if the unhealthy cluster cannot run successfully." > "Note: The Machine and Node objects might not be deleted because they are protected by finalizers. If this occurs, you must delete the Machine CR manually."
      • "Remove the unhealthy etcdctl member manually: " > "Open a remote shell connection to ..."
      • Add "Get a list of the etcdctl members:" - #  etcdctl member list -w table
      • "Confirm that etcdctl reports an unhealthy member of the cluster: " > "Check the etcdetl endpoint health"
      • "Remove the unhealthy cluster by deleting the etcdctl member Custom Resource: " > "Remove the unhealthy etcdctl member from the cluster:"
      • "Confirm members of etcdctl by running the following command: " > "Verify that the unhealthy etcdctl member was removed by running the following command:"
      • "Confirm ready status of the control plane node: " > "Check the node status to verify that all control plane nodes are available:"
      • "Validate the Machine, Node and BareMetalHost Custom Resources. " - HOW? This step has no commands. Is it in the wrong place?
      • "Create Machine Custom Resource linked with BareMetalHost and Node. " Is this sentence perhaps intended to be a lead-in for a sub procedure? (Add BMH, Add Machine, link BMH, Machine, and Node).
      • "Add BareMetalHost Custom Resource: " > "Create a BareMetalHost CR for the new control plane node:"
      • "Add Machine Custom Resource: " > "Create a Machine CR for the new control plane node: "
      • "Link BareMetalHost, Machine, and Node by running the link-machine-and-node.sh script: " > "Save the link-machine-and-node.sh script on your local machine:"
      • Add "Make the link-machine-and-node.sh script executable by running the following command:
        $ chmod +x link-machine-and-node.sh"
      • Add "Link the BareMetalHost CR, the Machine CR, and the control plane node by running the link-machine-and-node.sh script: " + command
      • "Remove the unhealthy etcdctl member manually: " > "Open a remote shell connection to ..."
      • Add "Get a list of the etcdctl members:" - #  etcdctl member list -w table
      • "Confirm the etcd-operator configuration applies to all nodes:" -> "Monitor the etcd-operator configuration process:". The configuration can take a long time. That's why the user needs this command.
      • "Confirm health of etcdctl: "  > "Open a remote shell connection to etcd-worker-3"
      • Add  "Check the etcdetl endpoint health: # etcdctl endpoint health"
      • "Confirm the health of the ClusterOperators: " > "Verify that the Operators are available:"
      • "Confirm the ClusterVersion: " > "Verify that the cluster version is correct:"

              rhn-support-tshwartz Talia Shwartzberg
              apinnick@redhat.com Avital Pinnick
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: