-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview
This feature delivers a fully supported, validated, and officially documented procedure for cluster administrators to safely and reliably replace a single failed control plane (master) node on an OpenShift Container Platform (OCP) cluster). This standardized procedure moves a critical operational task from a community/support-documented workaround to an official product capability, significantly improving cluster maintainability and operator confidence.
Goals
- Key Goal: Provide a fully supported procedure for day-2 control plane node replacement that ensures cluster stability and maintainability.
- Primary User: The target user is the Cluster Administrator responsible for the Day-1 and Day-2 lifecycle management of the OpenShift cluster.
- Existing Functionality Improvement: The existing recommended procedure (currently a Red Hat Support Solution) is transformed into a fully supported and officially documented part of the OpenShift documentation, making it authoritative and easier to discover.
Requirements
A list of specific needs or objectives that this feature must deliver in order to be considered complete.
Functional requirements:
- The procedure must be capable of replacing a single failed control plane node while the cluster remains operational (quorum must be maintained by the remaining control plane nodes).
- The replacement procedure must be validated for clusters installed via the IPI, UPI, Assisted Installer, or Agent-based Installer (ABI).
- The final, validated procedure must be published in the official OpenShift documentation (e.g., the Installing an on-premise cluster with the agent-based installer documentation).
- The procedure must be tested with OpenShift Y-releases, specifically validated against OpenShift 4.19+.
Non-Functional requirements:
- Usability/Clarity: The final documented procedure must be clear, step-by-step, and executable by a Cluster Administrator with intermediate OpenShift operational experience.
- Reliability: The procedure must consistently restore the control plane to a healthy, three-node state (or four/five-node, if applicable) without any residual impact on cluster services or configuration.
Use Case
As a Cluster Administrator, I want to use an officially supported procedure for Day-2 control plane node replacement when one control plane node is down for any reason and I need to create a new one, so that I can quickly and confidently restore the control plane's high availability and operational integrity using a reliable, documented method.
Out of Scope
The following items are explicitly not included in the scope of this feature:
- Automation: The procedure remains a manual, operator-driven process. Full or partial automation (e.g., using an Operator or specialized tooling) is out of scope.
- Complex Failure Modes: The focus is on replacing a single, non-recoverable failed node, not on handling different, complex failure modes (e.g., network partitions, or simultaneous multiple node failures).
Links
- Request for Enhancement (RFE): https://issues.redhat.com/browse/RFE-6515 (Simplify day-2 master node replacement operation)
- Current Support Solution (to be formalized):[ https://access.redhat.com/solutions/7130140|https://access.redhat.com/solutions/7130140]
- Documentation Target (to be updated): OpenShift Container Platform documentation for the Agent-based Installer. e.g., Installing an on-premise cluster with the agent-based installer
- Current OCP 4.19 target:[ https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html-single/installing_an_on-premise_cluster_with_the_agent-based_installer/index|https://www.google.com/search?q=https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html-single/installing_an_on-premise_cluster_with_the_agent-based_installer/index]
- is related to
-
AGENT-1070 Allow adding control-planes in day2
-
- To Do
-