Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-3

FAR: Remove taints from node when an NHC timed out remediation is deleted

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Fence Agents Remediation (FAR) Operator CR is not removed when the node is deleted by the Machine Deletion Remediation (MDR) Operator. (RHWA-3)

      Cause: The Node Health Check (NHC) Operator does not remove orphaned Custom Resources (CR)s for deleted nodes, and the Fence Agents Remediation (FAR) Operator does not allow the removal of orphaned remediations. This might occur on escalating remediation, where FAR starts and fails in the allocated time, then the Machine Deletion Remediation (MDR) Operator is triggered and remediates by deleting the node.
      Consequence: The FAR Operator remediation, which belongs to the deleted node, is not removed.
      Fix: The NHC Operator triggers CR deletion when the node is deleted. When the FAR Operator detects a deleted remediation, and the NHC Operator marks it as a timeout, then the FAR Operator allows the removal of this orphaned remediation.
      Result: The NHC Operator deletes orphaned CRs for deleted nodes, while the FAR Operator removes these CRs when it detects a deleted remediation that is marked as a timeout by the NHC Operator.
      Show
      Fence Agents Remediation (FAR) Operator CR is not removed when the node is deleted by the Machine Deletion Remediation (MDR) Operator. ( RHWA-3 ) Cause: The Node Health Check (NHC) Operator does not remove orphaned Custom Resources (CR)s for deleted nodes, and the Fence Agents Remediation (FAR) Operator does not allow the removal of orphaned remediations. This might occur on escalating remediation, where FAR starts and fails in the allocated time, then the Machine Deletion Remediation (MDR) Operator is triggered and remediates by deleting the node. Consequence: The FAR Operator remediation, which belongs to the deleted node, is not removed. Fix: The NHC Operator triggers CR deletion when the node is deleted. When the FAR Operator detects a deleted remediation, and the NHC Operator marks it as a timeout, then the FAR Operator allows the removal of this orphaned remediation. Result: The NHC Operator deletes orphaned CRs for deleted nodes, while the FAR Operator removes these CRs when it detects a deleted remediation that is marked as a timeout by the NHC Operator.
    • Bug Fix

      Similar to SNR PR, when a timeout remediation is deleted node taints need to be removed as part of the cleanup process so the node may recover.

              mshitrit@redhat.com Michael Shitrit
              mshitrit@redhat.com Michael Shitrit
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: