Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-311

FAR should not taint the nodes as its first step

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      From FAR documentation this is the remediation workflow:

      1. FAR adds NoExecute taint to the failed node => Ensure that any workloads are not executed after rebooting or powering off the failed node, and any stateless pods (that can’t tolerate FAR NoExecute taint) will be evicted immediately
      2. FAR executes the configured fence agent action on the failed node => Depending on the action (reboot or off), the node is either restarted or powered off. => After the action, there are no workloads in the failed node
      3. FAR forcefully deletes the pods in the failed node => The scheduler understands that it can schedule the failed pods on a different node
      4. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again

       

      The request here is to investigate whether the taint at point #1 is a good idea. Given the following:
      1. FAR triggers when a node is not reachable, or at least this should be the main use case.
      2. a NoExecute taint  triggers an immediate and normal pod eviction process. If the node is unreachable, then this translate to a 6 minutes wait after which the pods are forcefully deleted with volume attachments (see also here: https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#storage-force-detach-on-timeout).

      If the FAR controller is not able to fence the node withing those 6 minutes, there is a risk of volume inconsistency (especially fro those storage class who do not tolerate correctly the force volume attachment deletion.

       

      For the case when the user selects the out-of-service taint, I suggest that the workflow should be the following:

      1. FAR executes the configured fence agent action on the failed node
      2. FAR applies the `out-of-service` taint
      3. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again

       

       

      link to in-detail discussion threads: https://redhat-internal.slack.com/archives/CR8HZL4P3/p1757788018014019 

              Unassigned Unassigned
              rhn-gps-rspazzol Raffaele Spazzoli
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: