-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
The SNR node remediation time can be long in case a node becomes unhealthy and the self-node-remediation-controller-manager is running on the unhealthy node.
This delay will occur with the following commit: https://github.com/medik8s/self-node-remediation/commit/748e5742284dd672d91a553a205496189288b592
This patch changes to the following behavior:
[Before merging this patch]
The SNR daemonset pod in each node tries to add/delete the out-of-service taint when a SNR object is created.
[After merging this patch]
Only the self-node-remediation-controller-manager pod and the SNR daemonset pod in the failed node tries to add/delete the out-of-service taint when a SNR object is created.
We can't trust the failed node, so only the self-node-remediation-controller-manager will be able to remediate the failed node. So when the manager is being evicted, the node remediation is delayed.
To avoid this remediation delay, the manager should be executed with 2 replicas.
- causes
-
RHWA-366 Investigate update strategy with topologySpreadConstraints
-
- New
-
- relates to
-
RHWA-364 [FAR] Improve HA by using 'topologySpreadConstraints' to enforce strict pod distribution for fence-agents-controller-manager replicas
-
- Review
-
-
RHWA-365 [NHC] Improve HA by using 'topologySpreadConstraints' to enforce strict pod distribution for node-healthcheck-controller-manager replicas
-
- Review
-
- links to