-
Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
openshift-4.20
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed title of this feature request
Apply rebootless configuration changes in-place for NodePools with 'Replace' upgrade strategy
2. What is the nature and description of the request?
This is a request to change the behavior of HyperShift NodePools that are configured with the upgradeStrategy: Replace.
Current Behavior: When any change is made to the NodePool or HostedCluster specification (e.g., adding an additionalTrustBundle), NodePools using the Replace strategy will trigger a full, rolling replacement of every worker node. This involves draining, terminating, and recreating each machine, even if the change itself does not require a reboot (e.g., updating the node's CA trust store).
Desired Behavior: The HyperShift control plane should be intelligent enough to differentiate between changes that require a node replacement (like an OS image change or Kubelet version upgrade) and "rebootless" configuration changes (like additionalTrustBundle, labels, or taints).
If a change is identified as "rebootless," it should be applied in-place to the existing worker nodes, bypassing the Replace strategy's drain/recreate cycle. The Replace strategy should only be invoked for changes that fundamentally cannot be applied to a running node.
3. Why does the customer need this? (List the business requirements here)
The current behavior creates significant operational inefficiencies and instability for large clusters.
Reduce Operational Cost and Time: For large clusters (e.g., HCP on KubeVirt with 100+ worker nodes), replacing every node for a minor change like adding a CA certificate is extremely time-consuming and resource-intensive. A change that takes moments on an Inplace cluster can take hours on a Replace cluster.
Improve Cluster Availability and Stability: A full node replacement cycle involves mass pod drains and rescheduling. This causes unnecessary application disruption and potential downtime for a change that should be non-disruptive.
Enable Operational Agility: Administrators are hesitant to apply simple, necessary configuration updates (like trusting a new internal CA) because they know it will trigger a massive, disruptive cluster event. This change would allow them to perform such tasks quickly and safely.
Provide Logical Consistency: The upgradeStrategy should primarily govern upgrades. It should not force a disruptive replacement for a simple configuration update that the system is clearly capable of handling in-place (as demonstrated by the Inplace strategy).
4. List any affected packages or components.
Hosted Control Plane