Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-369

Finalize Node Remediation Cleanup

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Cause: SNR fails to remove its associated NoExecute taints for a timeout remediation that were placed on the node during the remediation process, and fails to remove redundant remediations (with no associated node).
      Consequence: The NoExecute taints are kept, so the node is never fully returned to a schedulable state, even after the CR has been removed. The stale remediation is kept for no available node.
      Fix: When a Timeout CR is deleted, and the node exists, then in addition to removing the finalizer from the CR, we also remove the "NoExecute" taints from the node. Removes unnecessary check of the CR deletion timestamp.
      Result: The NoExecute taints by SNR are removed after CR deletion, even after a timeout remediation CR is deleted. When SNR detects a remediation that is both deleted and marked as Timeout by NHC, then it would remove the finaiizer.
      Show
      Cause: SNR fails to remove its associated NoExecute taints for a timeout remediation that were placed on the node during the remediation process, and fails to remove redundant remediations (with no associated node). Consequence: The NoExecute taints are kept, so the node is never fully returned to a schedulable state, even after the CR has been removed. The stale remediation is kept for no available node. Fix: When a Timeout CR is deleted, and the node exists, then in addition to removing the finalizer from the CR, we also remove the "NoExecute" taints from the node. Removes unnecessary check of the CR deletion timestamp. Result: The NoExecute taints by SNR are removed after CR deletion, even after a timeout remediation CR is deleted. When SNR detects a remediation that is both deleted and marked as Timeout by NHC, then it would remove the finaiizer.
    • Feature
    • Proposed

      Fix 2 issues

      • (#253) When a timeout remediation CR is deleted, the operator currently fails to remove the associated NoExecute taints that were placed on the node during the remediation process. These taints are used to evict workloads and isolate the unhealthy node. Leaving these taints in place means the node is never fully returned to a schedulable state, leaving the node remediation incomplete even after the CR has been removed.
      • (#249) SNR should allow removal of redundant remediations - a remediation of a node that no longer exists, since MDR deleted the node and triggered a provisioning of a new one.
        Currently, there is a bug in the code that prevents SNR from reaching that logic; this PR fixes it. When SNR detects a remediation that is both deleted and marked as Timeout by NHC, it should remove the finalizer.

              mshitrit@redhat.com Michael Shitrit
              mshitrit@redhat.com Michael Shitrit
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: