-
Task
-
Resolution: Done
-
Undefined
-
ACM 2.11.0
-
False
-
None
-
False
-
-
-
No
Create an informative issue (See each section, incomplete templates/issues won't be triaged)
Using the current documentation as a model, please complete the issue template.
Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.
Prerequisite: Start with what we have
Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:
- Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes
- Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs
Describe the changes in the doc and link to your dev story
Provide info for the following steps:
1. - [x] Mandatory Add the required version to the Fix version/s field.
2. - [x] Mandatory Choose the type of documentation change.
- [x] New topic in an existing section or new section: Perhaps in a new topic after this one https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.10/html/clusters/cluster_mce_overview#enable-node-auto-scaling-hosted-cluster
Or if there's already a section that describes machine health checks or managed cluster node recovery/replacement
- [ ] Update to an existing topic
3. - [x] Mandatory for GA content:
- [x] Add steps and/or other important conceptual information here: (See content below)
- [x] Add Required access level for the user to complete the task here: Same as creating hosted control planes
- [x] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?): (See content below)
- [x] Add link to dev story here: OCPSTRAT-1123 & MGMT-17492
—
Content
Introductory sub-section
Title: Auto-repair Bare Metal Managed Cluster Nodes
The hosted control planes with the Agent platform can use Machine Health Checks (link: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.15/html/machine_management/deploying-machine-health-checks) to replace unhealthy managed cluster nodes.
The Machine Health Check object allows managed cluster nodes to be automatically replaced when the node is considered unhealthy.
Enabling Machine Health Checks sub-section
Machine Health Checks can be created by editing the NodePool.
Steps to enable Machine Health Checks:
- Ensure `spec.nodeDrainTimeout` on your NodePool is greater than 0s
- To verify, run the following command:
- oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep nodeDrainTimeout
- Expected output:
- nodeDrainTimeout: 30s
- If it is not greater than 0s, run the following command, ensuring the time is set to a time greater than 0s
- oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec":{"nodeDrainTimeout": 30m}}' --type=merge
- To verify, run the following command:
- Enable Machine Health Check by setting spec.management.autoRepair in the NodePool to true using the following command
- oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management":
{"autoRepair":true}
}}' --type=merge
- oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management":
{"autoRepair":true}
-
- Verify by running the following command:
- oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep autoRepair
- Expected output:
- autoRepair: true
- Verify by running the following command:
Additional notes:
- Ideally, there are additional host machines (Agents) that are available and ready to be installed if the managed cluster nodes are unhealthy
- The Machine Health Check object created through this process is not configurable and set with these specifications
- Does not replace nodes until there are at least 2 nodes that have been unhealthy for at least 8 minutes
- Unhealthy node definition is when the spoke cluster Node condition shows:
- Ready is "False" or Unknown
Disabling Machine Health Checks sub-section
Steps to disable Machine Health Checks:
- Disable Machine Health Check by setting spec.management.autoRepair in the NodePool to false using the following command:
- oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management":
{"autoRepair":false}
}}' --type=merge
- oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management":
{"autoRepair":false}
-
- Verify by running the following command:
- oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep autoRepair
- Expected output:
- autoRepair: true
- Verify by running the following command: