-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
-
False
-
-
From FAR documentation this is the remediation workflow:
- FAR adds NoExecute taint to the failed node => Ensure that any workloads are not executed after rebooting or powering off the failed node, and any stateless pods (that can’t tolerate FAR NoExecute taint) will be evicted immediately
- FAR executes the configured fence agent action on the failed node => Depending on the action (reboot or off), the node is either restarted or powered off. => After the action, there are no workloads in the failed node
- FAR forcefully deletes the pods in the failed node => The scheduler understands that it can schedule the failed pods on a different node
- After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again
The request here is to investigate whether the taint at point #1 is a good idea. Given the following:
1. FAR triggers when a node is not reachable, or at least this should be the main use case.
2. a NoExecute taint triggers an immediate and normal pod eviction process. If the node is unreachable, then this translate to a 6 minutes wait after which the pods are forcefully deleted with volume attachments (see also here: https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#storage-force-detach-on-timeout).
If the FAR controller is not able to fence the node withing those 6 minutes, there is a risk of volume inconsistency (especially fro those storage class who do not tolerate correctly the force volume attachment deletion.
For the case when the user selects the out-of-service taint, I suggest that the workflow should be the following:
1. FAR executes the configured fence agent action on the failed node
2. FAR applies the `out-of-service` taint
3. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again
link to in-detail discussion threads: https://redhat-internal.slack.com/archives/CR8HZL4P3/p1757788018014019