Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4945

Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting


    • +
    • Moderate
    • None
    • False
    • Hide



      This is a clone of issue OCPBUGS-4805. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:

      Description of problem:

      We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.
      One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby 
      on its master nodes contained an empty SYSTEM_RESERVED_ES value:
      cat /etc/node-sizing.env 
      causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.
      We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.
      A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES
      This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. 
      For clusterB the conditions are more well-known of why the value is empty.
      However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. 
      We have some asks as a result:
      - Can MCO be made to recover from this situation if it occurs, perhaps  through application of a safe default if none exists, such that kubelet would start correctly?
      - Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

      Version-Release number of selected component (if applicable):


      How reproducible:

      Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

      Expected results:

      If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

      Additional info:


            harpatil@redhat.com Harshal Patil
            openshift-crt-jira-prow OpenShift Prow Bot
            Sunil Choudhary Sunil Choudhary
            Red Hat Employee
            0 Vote for this issue
            10 Start watching this issue