Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-451

Remove RHWA Known Taints on Failed Remediation

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Moderate

      In a use case where escalation remediation is set where FAR/SNR is the first remediator, the second remediator is a user-based remediator and the following has occurred:

      • On node failure, NHC creates a FAR/SNR remediation
      • FAR/SNR fails to remediate in the allocated time
      • NHC time-out first remediator (FAR/SNR)
        SNR removed the finalizer but not the added taints or FAR just stopped the running agent
      • NHC triggers a custom remediation
      • The second remediator remediates the node and fails in the process
      • NHC time-out second remediator

      FAR/SNR added remediation taint (e.g., medik8s.io/fence-agents-remediation or medik8s.io/remediation) and out-of-service taint (on OutOfServiceTaint remediationStrategy), but they were not cleaned/removed eventually.
      Having these taints at the end of remediation failed attempts could lead to some unsatisfied users' experience (the taints use NoExecute action, and the node would have limited pod availability), and it might be harder to continue troubleshooting the failed node.
      NHC can handle these known taints on failed escalation remediation:

      1. After each remediation- If two remediators assess (and add/remove) taints, then it could lead to race condition problems unless NHC pauses remediation for some time to reduce this chance.
      2. After time-out to the last remediator - NHC goes over known and possible taints (or other (if any) resources) and deletes them followed by emitting a new event ("Failed Remediation"), and an annotation on the unhealthy node (with timestamp/attempts). The event could be easily tracked by Prometheus or a watch for an event, and the annotation will be used to "signal" the old failed nhc attempt

       

      If the second remediator is again FAR/SNR then it seems ok to keep the taints as an indication that the node is still unhealthy but it comes with the cost of the taints effect when the user might wish to avoid having them after a failed attempt by NHC.

       

      Related Slack discussion - https://redhat-internal.slack.com/archives/C03M5GKJNBA/p1736162948282599 

              Unassigned Unassigned
              oraz@redhat.com Or Raz
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: