-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview (aka. Goal Summary)
Enable OpenShift control plane z-stream upgrades to complete successfully regardless of customer cluster state/user configuration, including scenarios with unhealthy or missing worker nodes, failed cluster operators, or degraded data plane components. This ensures reliable and predictable upgrade paths for managed OpenShift services while maintaining control plane availability and security updates.
Goals (aka. expected user outcomes)
- Primary Users: SREs managing OpenShift clusters
- Outcomes:
- Control plane upgrades complete successfully, regardless of user defined cluster configuration
- Control plane upgrades complete successfully even when worker nodes are unhealthy, missing, or zero
- Upgrade processes are isolated from data plane health status
- Reduced upgrade failures due to external cluster conditions
- Improved upgrade reliability for managed OpenShift services
- Enhanced security posture through consistent control plane patching
Requirements (aka. Acceptance Criteria):
Functional Requirements:
- Control plane z-stream upgrades must complete successfully with zero worker nodes
- Upgrades must proceed regardless of ClusterOperator health status (console, image-registry, monitoring, etc.)
- Control plane components must upgrade independently of cluster state
- Support for forced upgrade scenarios using appropriate annotations or APIs
- OCP e2e test to verify all scenarios are upgradeable
Non-functional Requirements:
- Reliability: 99.9% success rate for control plane upgrades in managed environments
- Performance: Control plane upgrade time not impacted by data plane state
- Security: All security patches applied to control plane regardless of cluster state
- Monitoring: Comprehensive metrics and alerting for upgrade progression
- Maintainability: Clear operational procedures for SRE teams
Use Cases (Optional):
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement
status.
Questions to Answer (Optional):
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial
completion during Refinement status.
Out of Scope
- Self-managed cluster upgrade improvements (separate OCPSTRAT required)
- Data plane upgrade resilience (covered by separate initiatives)
- Cross-version upgrades (y-stream, major version upgrades)
- Non-HCP managed services integration
- Customer-initiated control plane upgrades (managed service only)
- Workload migration or data preservation during upgrades
Background
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
This feature addresses a critical gap in OpenShift upgrade reliability for managed services. Currently, control plane upgrades can fail due to data plane issues, creating operational burden for SRE teams and delaying security patches. The HyperShift architecture provides the foundation for control plane isolation, but upgrade processes still maintain dependencies on cluster operator health.
Related Work:
- CNTRLPLANE-529: Control plane upgrades should succeed regardless of data plane state
- ARO-21612: Managed Control Plane z-stream Upgrades
- XCMSTRAT-625: ARO HCP upgrade strategy design
- SD-ADR-0212: Architecture decision record for upgrade isolation
Customer Considerations
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
Initial completion during Refinement status.
Documentation Considerations
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing
functionality, provide a link to its current documentation. Initial completion during Refinement status.
Interoperability Considerations
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios
should be factored by the layered products? Initial completion during Refinement status.