-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
5
-
False
-
False
-
TELCOSTRAT-119 - OCP performance and scalability up to 4000 nodes
-
Undefined
-
-
OTA 203, OTA 204, OTA 205, OTA 206, OTA 207
Market problem: As an administrator, I would like to know how to upgrade my 500-node cluster in one or more maintenance windows.
Feature summary
Red Hat, end-customer expectations are to be able to fit any maintenance operation, including upgrades between OCP versions, into two hours windows. The two-hour windows are during the quiet time of the day, with a two hours window to rollback in case of maintenance operation failure. The clusters are kept in service without the need to consume their 5 or 6 nines downtime budget. This is usually labeled as “in-service-upgrade” see for instance Juniper ISSU or Ericsson demo (few public materials are available).
Goals:
- Provide a cost-efficient update/upgrade procedure (OPEX)
- Provide an update/upgrade procedure that can fit in ideally one maintenance window, or “a few” maintenance windows
- Cluster scale target for this feature: 500 nodes
- Provide a deterministic maintenance window time
- Provide a deterministic rollback procedure to ensure that the end to end service will always be fully available at the end of each maintenance window in case of any failure (OCP, plugin, CNF, any SW component)
Solution: We can leverage machine config pools on the worker nodes during upgrades to meet the goals.
Example:
A scenario where this might be utilized is, for example, an application that runs on 100 worker nodes. The 100 worker nodes could be divided, for example, into two Machine Config Pools 10/90. When it’s time to upgrade the cluster, MCP1 could be cordoned and drained so the applications run on MCP2. MCP1 is then resumed and updated. The administrator can then run the application on MCP1 to validate functionality. Only after the application is confirmed to be functioning properly on MCP1, MCP2 be upgraded. If the application does not run properly on MCP1, then the application is moved back to MCP2 and MCP1 will undergo diagnosis to figure out what the issue is and potentially back out to the previous version.
- is related to
-
MCO-75 Document paused pool behavior
- To Do
- links to