- MCO safely applies changes to the nodes in the pool for non user initiated changes even when pools have been paused by admin. This includes:
- kubelet cert rotation changes without workload disruption
- Alert admin and send event when paused pool needs to be unpaused to keep cluster healthy. This includes:
- kubelet cert update available but there are also other MachineConfig changes staged to be applied.
In an OCP cluster, Kubelet cert expires after every 365 days. To keep kubelet cert updated, new cert gets added after 292 days and expired cert get removed once 365 days have reached. In order to keep cluster functioning, MCO applies these updated cert on nodes whenever they gets updated.
MCO doesn't apply any config changes on the nodes in the paused pool. This can be problematic when admin has paused one or more Machine Config Pools (MCPs) in the cluster and during that time Kubelet Cert expires.
This epic should help in keeping cluster admin, support team and our customer life easier by:
- Keeping cluster in healthy state when kubelet cert gets refreshed.
- Making admin aware by providing alerts when there is possibility of cluster going unhealthy due kubelet CA expiration and MCO can't safely apply updated cert on nodes in paused pool.
- Providing more stability trust to customers who want to stay on EUS support.
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- MCO applies kubelet CA updates on paused pools as well when there are no additional pending MachineConfig changes to be applied.
- MCO logs event and alert admin when kubelet CA update is available but there are additional pending MachineConfig changes to be applied.
- In 4.7, MCO gained functionality to avoid drain and reboot for selected cases - https://issues.redhat.com/browse/GRPA-2715
- In 4.8, MCO added support for no drain and reboot for kubelet cert rotation - https://issues.redhat.com/browse/GRPA-3190 . This feature was later backported to 4.7 as an exception - https://bugzilla.redhat.com/show_bug.cgi?id=1939278
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>