Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8472

[RFE] Apply rebootless configuration changes in-place for NodePools with 'Replace' upgrade strategy

XMLWordPrintable

    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request

      Apply rebootless configuration changes in-place for NodePools with 'Replace' upgrade strategy

      2. What is the nature and description of the request?

      This is a request to change the behavior of HyperShift NodePools that are configured with the upgradeStrategy: Replace.

      Current Behavior: When any change is made to the NodePool or HostedCluster specification (e.g., adding an additionalTrustBundle), NodePools using the Replace strategy will trigger a full, rolling replacement of every worker node. This involves draining, terminating, and recreating each machine, even if the change itself does not require a reboot (e.g., updating the node's CA trust store).

      Desired Behavior: The HyperShift control plane should be intelligent enough to differentiate between changes that require a node replacement (like an OS image change or Kubelet version upgrade) and "rebootless" configuration changes (like additionalTrustBundle, labels, or taints).

      If a change is identified as "rebootless," it should be applied in-place to the existing worker nodes, bypassing the Replace strategy's drain/recreate cycle. The Replace strategy should only be invoked for changes that fundamentally cannot be applied to a running node.

       

      3. Why does the customer need this? (List the business requirements here)
      The current behavior creates significant operational inefficiencies and instability for large clusters.

      Reduce Operational Cost and Time: For large clusters (e.g., HCP on KubeVirt with 100+ worker nodes), replacing every node for a minor change like adding a CA certificate is extremely time-consuming and resource-intensive. A change that takes moments on an Inplace cluster can take hours on a Replace cluster.

      Improve Cluster Availability and Stability: A full node replacement cycle involves mass pod drains and rescheduling. This causes unnecessary application disruption and potential downtime for a change that should be non-disruptive.

      Enable Operational Agility: Administrators are hesitant to apply simple, necessary configuration updates (like trusting a new internal CA) because they know it will trigger a massive, disruptive cluster event. This change would allow them to perform such tasks quickly and safely.

      Provide Logical Consistency: The upgradeStrategy should primarily govern upgrades. It should not force a disruptive replacement for a simple configuration update that the system is clearly capable of handling in-place (as demonstrated by the Inplace strategy).

      4. List any affected packages or components.
      Hosted Control Plane

              racedoro@redhat.com Ramon Acedo
              rhn-support-dpateriy Divyam Pateriya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                None
                None