-
Feature
-
Resolution: Done
-
Major
-
None
-
None
-
No
[Sept 3. Note: this might need to be broken into 2 issues]
Feature Overview and Background
The MCO team has reported several classes of issues with the MCO that can cause support cases or block upgrades. Additionally, there are changes that the team believes will improve the maintainability of the code and make it easier to troubleshoot.
4.7 Phase - Wait for All Worker Pools on Upgrade
- Today when the an upgrade is initiated, the CVO will report that the upgrade is complete after the master pool has been upgraded.
- Other pools may not have completed due to an upgrade problem or a perfectly valid condition like pausing reconciliation on one or more pools.
- In either an unintentional (there is an error preventing upgrade of a worker) or intentional situation (a pool is paused), the administrator can initiate another upgrade before the previous one has been rolled out to the full cluster.
- Why this is important
- Cluster administrators can get themselves into a state where the cluster itself states that it is upgraded when, in fact, it isn't fully. The end result is somewhere between releases especially on the compute side. We want to avoid a minor version skew between control plane and compute nodes (z stream skews are acceptable for k8s instead). This will lower the number of bug report that the team gets because the admin started an upgrade which degraded the compute pool w/o noticing and moved on to another upgrade leaving compute at 4.(y-2).
Future work
Fault Tolerant MCD - https://issues.redhat.com/browse/GRPA-2682
Best Effort Upgrade on Degraded MCO: https://issues.redhat.com/browse/GRPA-1641
Rework Kubeletconfig and Containerruntimeconfig Controllers - https://issues.redhat.com/browse/GRPA-2679
Validate pullsecret before writing it: https://issues.redhat.com/browse/GRPA-2699
Also related: Bootimage Updates: https://issues.redhat.com/browse/GRPA-2680