Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- groomed

Story Points:
5
Blocked:
False
Ready:
False
Epic Link:
OTA-375
Feature Link:
TELCOSTRAT-119 - OCP performance and scalability up to 4000 nodes
Release Note Text:
Undefined
Market:

Sprint:
OTA 203, OTA 204, OTA 205, OTA 206, OTA 207

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Market problem: As an administrator, I would like to know how to upgrade my 500-node cluster in one or more maintenance windows.

Feature summary

Red Hat, end-customer expectations are to be able to fit any maintenance operation, including upgrades between OCP versions, into two hours windows. The two-hour windows are during the quiet time of the day, with a two hours window to rollback in case of maintenance operation failure. The clusters are kept in service without the need to consume their 5 or 6 nines downtime budget. This is usually labeled as “in-service-upgrade” see for instance Juniper ISSU or Ericsson demo (few public materials are available).

Goals:

Provide a cost-efficient update/upgrade procedure (OPEX)
Provide an update/upgrade procedure that can fit in ideally one maintenance window, or “a few” maintenance windows
Cluster scale target for this feature: 500 nodes

Provide a deterministic maintenance window time
Provide a deterministic rollback procedure to ensure that the end to end service will always be fully available at the end of each maintenance window in case of any failure (OCP, plugin, CNF, any SW component)

Solution: We can leverage machine config pools on the worker nodes during upgrades to meet the goals.

Example:

A scenario where this might be utilized is, for example, an application that runs on 100 worker nodes. The 100 worker nodes could be divided, for example, into two Machine Config Pools 10/90. When it’s time to upgrade the cluster, MCP1 could be cordoned and drained so the applications run on MCP2. MCP1 is then resumed and updated. The administrator can then run the application on MCP1 to validate functionality. Only after the application is confirmed to be functioning properly on MCP1, MCP2 be upgraded. If the application does not run properly on MCP1, then the application is moved back to MCP2 and MCP1 will undergo diagnosis to figure out what the issue is and potentially back out to the previous version.

is related to

MCO-75 Document paused pool behavior

To Do

links to

Pull request in OpenShift Docs

Assignee:: Lalatendu Mohanty

Reporter:: Lalatendu Mohanty

QA Contact:: Johnny Liu

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2021/05/12 9:38 PM

Updated:: 2022/09/09 7:15 AM

Resolved:: 2021/09/09 8:41 PM

Details

Description

Market problem: As an administrator, I would like to know how to upgrade my 500-node cluster in one or more maintenance windows.

Feature summary

Goals:

Solution: We can leverage machine config pools on the worker nodes during upgrades to meet the goals.

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates