Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Fence Agents Remediation
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

From FAR documentation this is the remediation workflow:

FAR adds NoExecute taint to the failed node => Ensure that any workloads are not executed after rebooting or powering off the failed node, and any stateless pods (that can’t tolerate FAR NoExecute taint) will be evicted immediately
FAR executes the configured fence agent action on the failed node => Depending on the action (reboot or off), the node is either restarted or powered off. => After the action, there are no workloads in the failed node
FAR forcefully deletes the pods in the failed node => The scheduler understands that it can schedule the failed pods on a different node
After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again

The request here is to investigate whether the taint at point #1 is a good idea. Given the following:
1. FAR triggers when a node is not reachable, or at least this should be the main use case.
2. a NoExecute taint triggers an immediate and normal pod eviction process. If the node is unreachable, then this translate to a 6 minutes wait after which the pods are forcefully deleted with volume attachments (see also here: https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#storage-force-detach-on-timeout).

If the FAR controller is not able to fence the node withing those 6 minutes, there is a risk of volume inconsistency (especially fro those storage class who do not tolerate correctly the force volume attachment deletion.

For the case when the user selects the out-of-service taint, I suggest that the workflow should be the following:

1. FAR executes the configured fence agent action on the failed node
2. FAR applies the `out-of-service` taint
3. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again

link to in-detail discussion threads: https://redhat-internal.slack.com/archives/CR8HZL4P3/p1757788018014019

Assignee:: Unassigned

Reporter:: Raffaele Spazzoli

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/09/15 3:18 PM

Updated:: 2025/09/17 7:03 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty