XML

Word

Printable

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Record accurate expected durations for each minor upgrade step exclusive of Worker MCP rollout
Identify areas for improvement and measure those improvements

Why is this important?

We've ignored upgrade duration for the last several releases to the point that we've had to relax existing CI tests which validated that upgrades complete within a certain time period
When customers prepare for an EUS 4.6 to a 4.10 upgrade it's important that it's as fast as possible and that we can predict for them how long the upgrade should take
We know there are portions of the upgrade which are slower than necessary including
- DaemonSets which roll out serially when a canary and more parallel manner is safe to do so
- Graceful shutdowns now releasing leases so that new controllers can immediate obtain a lease (MCO, others?)
- Fatter images?
- Higher cross operator parallelization?
- Nothing should be off the table as long as it makes upgrades faster while introducing no additional risk

CI - Need a periodic job which tracks the upgrade waterfall of a cluster of moderate size with moderate workload so that we can track this at a macro level and identify when we regress
Stories on the boards for components which require optimization
A report of common issues for which we should amend design patterns to identify and prevent introduction of slack
A report which measures our success "ie: On a cluster with 12 nodes operating with 800 pods, we've been able to reduce the upgrade time by 7 minutes in each minor version upgrade across 4.6, 4.7, 4.8, 4.9, and 4.10"

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>