Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-277

SNR CR isn't removed when node is deleted by MDR

XMLWordPrintable

    • False
    • False
    • Hide
      Cause: Node Health Check does not remove orphaned CRs for deleted nodes and SNR does not allow the removal of orphaned remediations. This might occur on escalating remediation, where SNR starts and fails in the allocated time, then MDR is triggered and remediates by deleting the node.
      Consequence: SNR remediation which belongs to the deleted node is not removed.
      Fix: NHC triggers CR deletion when the node is deleted. When SNR detects a deleted remediation and NHC marks it as a timeout, then SNR allows the removal of this orphaned remediation.
      Result: NHC deletes orphaned CRs for deleted nodes, while SNR removes these CRs when it detects a deleted remediation that is marked timeout by NHC.

      Rewrite
      Cause: The Node Health Check (NHC) Operator does not remove orphaned Custom Resources (CR)s for deleted nodes, and the Self Node Remediation (SNR) Operator does not allow the removal of orphaned remediations. This might occur on escalating remediation, where SNR starts and fails in the allocated time, then the Machine Deletion Remediation (MDR) Operator is triggered and remediates by deleting the node.
      Consequence: The SNR Operator remediation, which belongs to the deleted node, is not removed.
      Fix: The NHC Operator triggers CR deletion when the node is deleted. When the SNR Operator detects a deleted remediation, and the NHC Operator marks it as a timeout, then the SNR Operator allows the removal of this orphaned remediation.
      Result: The NHC Operator deletes orphaned CRs for deleted nodes, while the SNR Operator removes these CRs when it detects a deleted remediation that is marked as a timeout by the NHC Operator.
      Show
      Cause: Node Health Check does not remove orphaned CRs for deleted nodes and SNR does not allow the removal of orphaned remediations. This might occur on escalating remediation, where SNR starts and fails in the allocated time, then MDR is triggered and remediates by deleting the node. Consequence: SNR remediation which belongs to the deleted node is not removed. Fix: NHC triggers CR deletion when the node is deleted. When SNR detects a deleted remediation and NHC marks it as a timeout, then SNR allows the removal of this orphaned remediation. Result: NHC deletes orphaned CRs for deleted nodes, while SNR removes these CRs when it detects a deleted remediation that is marked timeout by NHC. Rewrite Cause: The Node Health Check (NHC) Operator does not remove orphaned Custom Resources (CR)s for deleted nodes, and the Self Node Remediation (SNR) Operator does not allow the removal of orphaned remediations. This might occur on escalating remediation, where SNR starts and fails in the allocated time, then the Machine Deletion Remediation (MDR) Operator is triggered and remediates by deleting the node. Consequence: The SNR Operator remediation, which belongs to the deleted node, is not removed. Fix: The NHC Operator triggers CR deletion when the node is deleted. When the SNR Operator detects a deleted remediation, and the NHC Operator marks it as a timeout, then the SNR Operator allows the removal of this orphaned remediation. Result: The NHC Operator deletes orphaned CRs for deleted nodes, while the SNR Operator removes these CRs when it detects a deleted remediation that is marked as a timeout by the NHC Operator.
    • Bug Fix
    • Proposed
    • Important

      In a use case where escalation remediation is set where SNR is the first re-mediator and MDR is the second the following has occurred:

      • On node failure, NHC creates an SNR remediation
      • SNR fails to remediate in the allocated time
      • NHC triggers MDR remediation
      • MDR remediates the by deleting the machine which will remove the node and provision a new one
      • SNR remediation which belongs to the old node isn't removed.

       This can also apply to other remediators (such as FAR when coupled with MDR)

      Link to Github ticket.

              frmoreno Francisco Javier Moreno Moreno
              mshitrit@redhat.com Michael Shitrit
              Marc Sluiter
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: