XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: openshift-4.14
Affects Version/s: None
Component/s: None
Labels:
- groomed

Epic Name:
Draft Admin Centered Upgrade Documentation Phase 3
Epic Status:
To Do
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-180Improve upgrades - phase 1
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:
None
Ready:
False
Size:
None

Target Version:

openshift-4.14
Release Blocker:
None

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Revamp our Upgrade Documentation to include an appropriate level of detail for admins

Why is this important?

Currently Admins have nothing which explains to them how upgrades actually work and as a result when things don't go perfectly they panic
We do not sufficiently, or at least within context of Upgrade Docs, explain the differences between Degraded and Available statuses
We do not explain order of operations
We do not explain protections built into the platform which protect against total cluster failure, ie halting when components do not return to healthy state within exp

Scenarios

Move out channel management to its own chapter
Explain or link to existing documentation which addresses the differences between Degraded=True and Available=False
Explain Upgradeable=False conditions and other aspects of upgrade preflight strategy that Operators should be indicating when its unsafe to upgrade
Explain basics of how the upgrade is applied
1. CVO fetches release image
2. CVO updates operators in the following order
3. Each operator is expected to monitor for success
4. Provide example ordering of manifests and command to extract release specific manifests and infer the ordering
Explain how operators indicate problems and generic processes for investigating them
Explain the special role of MCO and MCP mechanisms such as pausing pools
Provide some basic guidance for Control Plane duration, that is exclude worker pool rollout duration (90-120 minutes is normal)

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

There was an effort to write up how to use MachineConfig Pools to partition and optimize worker rollout in https://issues.redhat.com/browse/OTA-375

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

clones

OTA-453 Draft Admin Centered Upgrade Documentation Phase 1

Closed

is cloned by

OTA-1012 Draft Admin Centered Upgrade Documentation Phase 4

OTA-922 In Docs, motivate duration of control-plane component updates

Closed

is related to

OTA-809 Draft Admin Centered Upgrade Documentation Phase 2

Closed

Assignee:: Lalatendu Mohanty

Reporter:: Scott Dodson

QA Contact:: Evgeni V

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/10/26 4:18 PM

Updated:: 2025/06/27 9:01 AM

Resolved:: 2023/08/14 4:43 PM

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates