-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
None
Installing a primary control plane node on an unhealthy cluster
The title, use case, and prerequisites for this procedure are confusing. The procedure is based on 411-unhealthy.md, which describes replacing a control plane node in a cluster with 2 healthy and 1 unhealthy CP node. (Prerequisite about having a day 2 control plane makes no sense because this is the procedure for creating a day 2 control plane node.)
Task scope:
- New title: "Replacing a control plane node in an unhealthy cluster". Remove "primary" because it does not make sense.
- Use case: "You can replace an unhealthy control plane node in a cluster with 3 control plane nodes."
- Prerequisites:
- You have installed OpenShift Container Platform 4.11 or later, with the required etcd-operator version.
- You have added a host to the cluster by using the UI or the API.
- You have added the annotation role:master to the host to create a new control plane node.
- 'oc rsh' commands. Will be fixed with HCIDOCS-518.
- "Confirm initial state of the cluster:" -> "Check the node status to verify that a control plane node is not available:"
- "Confirm the etcd-operator detects the cluster as unhealthy:" > "Check the etcd-operator log to verify that a control plane node is not available:"
- "Confirm the etcdctl members: " > "Open a remote shell connection to etcd-worker-3"
- Add "List the etcdctl members:" # etcdctl member list -w table
- "Confirm that etcdctl reports an unhealthy member of the cluster: " > "Check the etcdetl endpoint health" -Fix prompt - # etcdctl endpoint health
- "Remove the unhealthy control plane by deleting the Machine Custom Resource:" > "Remove the unhealthy control plane node by deleting the Machine custom resource (CR): ". Move note to next step.
- "Confirm that etcd-operator has not removed the unhealthy machine: " > "Check the etcd-operator log to verify that the machine CR was deleted:"
- "Note: The Machine and Node Custom Resources (CRs) will not be deleted if the unhealthy cluster cannot run successfully." > "Note: The Machine and Node objects might not be deleted because they are protected by finalizers. If this occurs, you must delete the Machine CR manually."
- "Remove the unhealthy etcdctl member manually: " > "Open a remote shell connection to ..."
- Add "Get a list of the etcdctl members:" - # etcdctl member list -w table
- "Confirm that etcdctl reports an unhealthy member of the cluster: " > "Check the etcdetl endpoint health"
- "Remove the unhealthy cluster by deleting the etcdctl member Custom Resource: " > "Remove the unhealthy etcdctl member from the cluster:"
- "Confirm members of etcdctl by running the following command: " > "Verify that the unhealthy etcdctl member was removed by running the following command:"
- "Confirm ready status of the control plane node: " > "Check the node status to verify that all control plane nodes are available:"
- "Validate the Machine, Node and BareMetalHost Custom Resources. " - HOW? This step has no commands. Is it in the wrong place?
- "Create Machine Custom Resource linked with BareMetalHost and Node. " Is this sentence perhaps intended to be a lead-in for a sub procedure? (Add BMH, Add Machine, link BMH, Machine, and Node).
- "Add BareMetalHost Custom Resource: " > "Create a BareMetalHost CR for the new control plane node:"
- "Add Machine Custom Resource: " > "Create a Machine CR for the new control plane node: "
- "Link BareMetalHost, Machine, and Node by running the link-machine-and-node.sh script: " > "Save the link-machine-and-node.sh script on your local machine:"
- Add "Make the link-machine-and-node.sh script executable by running the following command:
$ chmod +x link-machine-and-node.sh" - Add "Link the BareMetalHost CR, the Machine CR, and the control plane node by running the link-machine-and-node.sh script: " + command
- "Remove the unhealthy etcdctl member manually: " > "Open a remote shell connection to ..."
- Add "Get a list of the etcdctl members:" - # etcdctl member list -w table
- "Confirm the etcd-operator configuration applies to all nodes:" -> "Monitor the etcd-operator configuration process:". The configuration can take a long time. That's why the user needs this command.
- "Confirm health of etcdctl: " > "Open a remote shell connection to etcd-worker-3"
- Add "Check the etcdetl endpoint health: # etcdctl endpoint health"
- "Confirm the health of the ClusterOperators: " > "Verify that the Operators are available:"
- "Confirm the ClusterVersion: " > "Verify that the cluster version is correct:"
- clones
-
HCIDOCS-522 Fix "Installing CP node on healthy cluster" procedure
- Review
- is related to
-
HCIDOCS-412 Fix node names
- Review
- relates to
-
HCIDOCS-555 The node names mentioned on 'Installing a primary control plane node on a healthy cluster' documentation looks incorrect.
- Review
- mentioned on