Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-14880

Handling non-graceful worker node shutdowns: SNR/NHC behavior and PVCs cleanup on LVMS with Galera & RabbitMQ

XMLWordPrintable

    • SNR/NHC: Managing PVCs on worker failures with Galera & RabbitMQ
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • Proposed
    • Proposed
    • Done
    • Proposed
    • rhos-ops-platform-services-pidone
    • Proposed
    • 0% To Do, 0% In Progress, 100% Done
    • Moderate

      Goal:

      The goal of this Epic is to investigate and define how Self Node Remediation (SNR) and Node Health Check (NHC) should handle non-graceful worker node shutdowns in an environment running Galera & RabbitMQ operators.

      Specifically, we aim to determine whether SNR should perform automatic cleanup of Persistent Volume Claims (PVCs) when a node fails, and how this interacts with Logical Volume Manager Storage (LVMS).

      By addressing this, we can ensure better pod rescheduling behavior, prevent storage inconsistencies, and improve cluster resilience when handling node failures.

      Acceptance Criteria:

      • Define the current behavior of SNR/NHC when a worker node with LVMS storage fails non-gracefully.
      • Assess whether automatic PVC deletion should be handled within SNR, NHC, or another component.
      • Identify potential risks of deleting PVC (e.g., volume attachments, KCM detach issues).
      • Evaluate if escalation remediation could be leveraged for delayed or conditional cleanup.
      • Determine if we should provide an optional user-configurable setting for PVC deletion.
      • Ensure that any new behavior aligns with Galera & RabbitMQ operator expectations regarding storage recovery and pod rescheduling.

      Open Questions:

      • How does SNR currently handle volume attachments, and does this interfere with KCM operations?
      • Should NHC be responsible for triggering storage cleanup, or should it be offloaded to an external process?
      • Is escalation remediation a viable alternative for handling cleanup, and how should it be implemented?
      • Is it expected that the Out-of-Service taint does not trigger the detach or deletion of PVCs?
      • Is it expected that the ResourceDeletion strategy does not handle PVC detachment or deletion?

              rhn-support-aromito Antonio Romito
              rhn-support-aromito Antonio Romito
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: