-
Story
-
Resolution: Done
-
Major
-
None
-
False
-
-
False
-
The FAR remediation taint had been changed by key and effect. The effect has changed from "NoExecute" to "NoSchedule" to only prevent new workloads, without evicting any workloads, prior to running the fence agent.
-
Feature
-
Proposed
-
-
From FAR documentation this is the remediation workflow:
- FAR adds NoExecute taint to the failed node => Ensure that any workloads are not executed after rebooting or powering off the failed node, and any stateless pods (that can’t tolerate FAR NoExecute taint) will be evicted immediately
- FAR executes the configured fence agent action on the failed node => Depending on the action (reboot or off), the node is either restarted or powered off. => After the action, there are no workloads in the failed node
- FAR forcefully deletes the pods in the failed node => The scheduler understands that it can schedule the failed pods on a different node
- After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again
The request here is to investigate whether the taint at point #1 is a good idea. Given the following:
1. FAR triggers when a node is not reachable, or at least this should be the main use case.
2. a NoExecute taint triggers an immediate and normal pod eviction process. If the node is unreachable, then this translate to a 6 minutes wait after which the pods are forcefully deleted with volume attachments (see also here: https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#storage-force-detach-on-timeout).
If the FAR controller is not able to fence the node withing those 6 minutes, there is a risk of volume inconsistency (especially fro those storage class who do not tolerate correctly the force volume attachment deletion.
For the case when the user selects the out-of-service taint, I suggest that the workflow should be the following:
1. FAR executes the configured fence agent action on the failed node
2. FAR applies the `out-of-service` taint
3. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again
link to in-detail discussion threads: https://redhat-internal.slack.com/archives/CR8HZL4P3/p1757788018014019