Feature Request
Resolution: Unresolved
Not Selected
1. Proposed title of this feature request
Improvements on OCP Update processes, including Red Hat provided Day2 Operators, for Telco Environments.
2. What is the nature and description of the request?
Terms used:
- Updates:
- z-stream update (e.g. 4.12.19 to 4.12.21)
- minor version or y-version update (e.g. 4.12.21 to 4.13.1)
- EUS-to-EUS update (e.g. 4.10.51 to 4.12.21)
- MNO: Any topology that has redundancy on the control plane level, regardless of having dedicated nodes for that role or not (aka schedulable masters)
- Maintenance window (MW): Typically, MW duration is 4 hours. Any update process must take place within 2 hours in the MW. The rest 2 hours will be used for any other purposes, such as troubleshooting and restoring system in case of a failure.
- Outage time: the continuous time frame since CNF cannot serve traffic until the point CNF is starting, but not in a ready state yet. If multiple outages occur during an update procedure, outage time should be considered from the first occurance until the end of the last.
This RFE is requesting a list of improvements in all types of update processes for all valid topologies (SNO, SNO+1, MNO) used by Telco Partners in baremetal, disconnected environments. Partner expects that all requirements will be delivered by OCP 4.16, unless otherwise stated.
- MNO: A three-node cluster with schedulable or non-schedulable Control Plane (master) nodes and any number of additional Compute (worker) nodes, from 0 to 15.
The following list summarizes the improvements that should be introduced:
- Non-rolling update
- Worker nodes will be in a single pool (MCP), therefore all worker nodes should be updated in parallel.
- The duration of any update process should be no more than 80 minutes in total. A further reduction down to 60 minutes is expected in OCP 4.18.
- During the update process of the cluster, the maximum outage time allowed is 15 minutes.
- The aforementioned duration and outage time should include the update of Red Hat provided Day2 Operators.
- The upgrade process should allow the user to provide additional configuration (e.g. MachineConfig) that will be applied during the upgrade process, without causing more than one single reboot per node for the entire process.
- In case of an unsuccessful update, the cluster must be in a stable operational state without capacity or service degradation at the end of the Maintenance Window.
- Canary rollout update (in-service update) improvement will be track in RFE-4873
- Non-rolling update
3. Why does the customer need this? (List the business requirements here)
Communication Regulators and Communication Service Providers have very strict requirements in terms of availability (outage), execution time of any maintenance, and recovery plan for any unforeseen issues. Moreover, the requirement is to eliminate or minimize the outage faced on the service.
4. List any affected packages or components.
Operators used by Telco Partners
- clones
RFE-4208 Improvements on OCP SNO+1 Update & Upgrade processes, including Operators, for Telco Environments.
- Approved