Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-366

Investigate update strategy with topologySpreadConstraints

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Cause: The RHWA operators used to have inconsistent deployment configuration with regards to replicas, node affinity and update strategy.
      Consequence: Potentially slower remediation in case the operator pod was running on an unhealthy node.
      Fix: Use 2 replicas for NHC, FAR and SNR, use topologySpreadConstraints for preventing running on the same node, and use updateStrategy for avoiding potential update locks in some edge cases.
      Result: Reduced chance of slower remediation.

      Rewrite
      ======
      Cause: The RHWA Operators had inconsistent deployment configurations with regards to replicas, node affinity, and the update strategy.
      Consequence: In cases where the Operator pod was running on an unhealthy node, this had the potential for remediation to be slower.
      Fix: With this release, first use two replicas for Node Health Check (NHC), Fence Agents Remedation (FAR) and Self Node Remedation (SNR). Also use the parameter 'topologySpreadConstraints' for preventing running on the same node. And finally, use the parameter 'updateStrategy' for avoiding potential update locks in some edge cases.
      Result: This results in a reduced chance of slower remediation.
      Show
      Cause: The RHWA operators used to have inconsistent deployment configuration with regards to replicas, node affinity and update strategy. Consequence: Potentially slower remediation in case the operator pod was running on an unhealthy node. Fix: Use 2 replicas for NHC, FAR and SNR, use topologySpreadConstraints for preventing running on the same node, and use updateStrategy for avoiding potential update locks in some edge cases. Result: Reduced chance of slower remediation. Rewrite ====== Cause: The RHWA Operators had inconsistent deployment configurations with regards to replicas, node affinity, and the update strategy. Consequence: In cases where the Operator pod was running on an unhealthy node, this had the potential for remediation to be slower. Fix: With this release, first use two replicas for Node Health Check (NHC), Fence Agents Remedation (FAR) and Self Node Remedation (SNR). Also use the parameter 'topologySpreadConstraints' for preventing running on the same node. And finally, use the parameter 'updateStrategy' for avoiding potential update locks in some edge cases. Result: This results in a reduced chance of slower remediation.
    • Enhancement
    • Proposed

      We configured NHC, FAR and SNR to use topologySpreadConstraints for spreading replicas across nodes. This might introduce an issue with updates in some corner cases, see comment on the SNR PR: https://github.com/medik8s/self-node-remediation/pull/180#discussion_r2419792014

      Investigate if this is a real issue, and update all 3 operators if needed.

              slintes Marc Sluiter
              slintes Marc Sluiter
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: