-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
100% To Do, 0% In Progress, 0% Done
-
6
-
0
Feature Overview (aka. Goal Summary)
Rolling out new versions of HyperShift Operator or Hosted Control Plane components such as HyperShift's Control Plane Operator will no longer carry the possibility of triggering a Node rollout that can affect customer workloads running on those nodes
Goals (aka. expected user outcomes)
Customer Nodepool rollouts exhaustive cause list will be:
- Due to customer direct scaling up/down of the Nodepool
- Due to customer change of Hosted Cluster or Nodepool configuration that is documented to incur in a rollout
Customers will have visibility on rollouts that are pending so that they can effect a rollout of their affected nodepools at their earliest convenience
Requirements (aka. Acceptance Criteria):
- Observability:
- It must be possible to account for all Nodepools with pending rollouts
- It must be possible to identify all the Hosted Clusters with Nodepools with pending rollouts
- It must be possible for a customer to see that a Nodepool has pending rollouts
- Kubernetes expectations on resource reconciliation must be upheld
- Queued rollouts must survive HyperShift restarts
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Managed (ROSA and ARO) |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported Managed Hosted Control Plane topologies and configurations |
Connected / Restricted Network | All supported Managed Hosted Control Plane topologies and configurations |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported Managed Hosted Control Plane topologies and configurations |
Operator compatibility | N/A |
Backport needed (list applicable versions) | All supported Managed Hosted Control Plane releases |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Yes. Console represtation of Nodepools/Machinepools should indicate pending rollouts and allow them to be triggered |
Other (please specify) |
Use Cases (Optional):
- Managed OpenShift fix requires a Node rollout to be applied
Questions to Answer (Optional):
- rh-ee-brcox
- Should we explicitly call out the OCP versions in Backport needed (list applicable versions)?
- asegurap1@redhat.com: Depends on what the supported OCP versions are in Managed OpenShift by the time this feature is delivered
- Is there a missing Goal of this feature, to force a rollout at a particular date? My line of thinking is what about CVE issues on the NodePool OCP version - are we giving them a warning like "hey you have a pending rollout because of a CVE; if you don't update the nodes yourself, we will on such & such date"?
- asegurap1@redhat.com: Date based rollouts are out of scope (see its section).
- Should we explicitly call out the OCP versions in Backport needed (list applicable versions)?
- jparrill@redhat.com
- What’s the expectations on a regular customer nodePool upgrade? The change will be directly applied or queued following the last requested change?
- asegurap1@redhat.com: Combinded single rollout.
- This only applies to NodePool changes or also would affect CP upgrades (thinking of OVN changes that could also affect the data plane)?
- asegurap1@redhat.com: CP upgrades that would trigger Nodepool rollouts are in scope. OVN changes should only apply if CNO or its daemonsets are going to cause reboots
- How the customer will trigger the pending rollouts? An alert will trigger in the hosted cluster console?
- asegurap1@redhat.com: I guess there are multiple options like scaling down and up and also adding some API to Nodepool
- I assume we will use a new status condition to reflect the current queue of pending rollouts, it’s that the case?.
- asegurap1@redhat.com: That's a good guess. Hopefully we can represent all we want with it or we constrain ourselves to what it can express
- With "Queued rollouts must survive HyperShift restarts"... What kind of details we wanna store there (“there” should be the place to persist the changes queued), the order, the number of rollouts, the destination Hashes, more info…?
- What’s the expectations on a regular customer nodePool upgrade? The change will be directly applied or queued following the last requested change?
-
-
- asegurap1@redhat.com: I'll leave that as an open question to refine
I'll If there are more than one change pending, we asume there will be more than one reboot?
- asegurap1@redhat.com: I'll leave that as an open question to refine
-
Out of Scope
- Maintenance windows
- Queuing of rollouts on user actions (as that does not meet the Kubernetes reconciliation expectations and is better addressed at either the Cluster Service API level or better yet, at the customer automation side).
- Forced rollouts of pending updates on a certain date. That is something that should be handled at the Cluster Service level if there is desire to provide it.
Background
Past incidents with fixes to ignition generation resulting in rollout unexpected by the customer with workload impact
Customer Considerations
There should be an easy way to see, understand the content and trigger queued updates
Documentation Considerations
SOPs for the observability above
ROSA documentation for queued updates
Interoperability Considerations
ROSA/HCP and ARO/HCP