We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.
One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby
on its master nodes contained an empty SYSTEM_RESERVED_ES value:
causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.
We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.
A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES
This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted.
For clusterB the conditions are more well-known of why the value is empty.
However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start.
We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?