Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-4751

[RFE] Avoid pushing wrong kubeletconfig CR into the OCP nodes to prevent nodes going into NotReady state

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • openshift-4.12, openshift-4.13, openshift-4.14, openshift-4.15, openshift-4.16, openshift-4.17
    • MCO, Node
    • False
    • None
    • False
    • Not Selected

      Steps to reproduce:

      1. Create a kubeletconfig CR for configuring garbage collection for containers and images and make a typo:

      # vim kubeletconfig.yaml
      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: worker-kubeconfig 
      spec:
        machineConfigPoolSelector:
          matchLabels:
            pools.operator.machineconfiguration.openshift.io/worker: "" 
        kubeletConfig:
          evictionSoft: 
            memory.available: "500Mi" 
            nodesfs.available: "10%"   -----> Made a typo here
            nodefs.inodesFree: "5%"
            imagefs.available: "15%"
            imagefs.inodesFree: "10%"
          evictionSoftGracePeriod:  
            memory.available: "1m30s"
            nodefs.available: "1m30s"
            nodefs.inodesFree: "1m30s"
            imagefs.available: "1m30s"
            imagefs.inodesFree: "1m30s"
          evictionHard: 
            memory.available: "200Mi"
            nodefs.available: "5%"
            nodefs.inodesFree: "4%"
            imagefs.available: "10%"
            imagefs.inodesFree: "5%"
          evictionPressureTransitionPeriod: 0s 
          imageMinimumGCAge: 5m 
          imageGCHighThresholdPercent: 80 
          imageGCLowThresholdPercent: 75 
      

      2. Apply the kubeletconfig CR:

      # oc apply -f kubeletconfig.yaml

      3. Check the MCP progress and node state:

      $ oc get mcp worker
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      worker   rendered-worker-3aa629bea780edf9b271a37fb54dc00f   False     True       False      2              0                   0                     0                      19h
      $ oc get nodes worker01
      worker01   NotReady,SchedulingDisabled   worker                 19h   v1.27.6+1648878
      
      
      

      4. Check the kubelet logs:

      Oct 04 13:36:36 worker01 kubenswrapper[8431]: E1004 13:36:36.418493    8431 run.go:74] "command failed" err="failed to run Kubelet: failed to create kubelet: unsupported eviction signal nodesfs.available"

      The objective of this RFE:

      There should be a dry-run performed before actually pushing the changes at the node level to ensure that the kubeletconfig CR changes will not bring the kubelet to a Dead state.

      Impact:

      Any typo in the kubeleconfig CR can bring the node into a NotReady state if the configuration parameter is incorrect. This would have a severe impact on the clusters running 100+ worker nodes where the MachineConfigPool applies the changes onto the nodes in a batch of 5/10 nodes at a time.

              rhn-support-mrussell Mark Russell
              rhn-support-dpateriy Divyam Pateriya
              Mark Russell
              Votes:
              10 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: