Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5831

Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

XMLWordPrintable

    • +
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-4945. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-4805. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:

      Description of problem:

      We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.
      
      One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby 
         /etc/node-sizing.env 
      on its master nodes contained an empty SYSTEM_RESERVED_ES value:
      
      ---
      cat /etc/node-sizing.env 
      SYSTEM_RESERVED_MEMORY=5.36Gi
      SYSTEM_RESERVED_CPU=0.11
      SYSTEM_RESERVED_ES=
      ---
      
      causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.
      
      We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.
      
      A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES
      
      This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. 
      
      For clusterB the conditions are more well-known of why the value is empty.
      
      However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. 
      
      We have some asks as a result:
      - Can MCO be made to recover from this situation if it occurs, perhaps  through application of a safe default if none exists, such that kubelet would start correctly?
      - Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

      Version-Release number of selected component (if applicable):

      4.11.17

      How reproducible:

      Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

      Expected results:

      If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

      Additional info:

       

              rphillip@redhat.com Ryan Phillips
              openshift-crt-jira-prow OpenShift Prow Bot
              Sunil Choudhary Sunil Choudhary
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: