Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-39150

Restarting kubelet beyond systemd start-limit-hit leads to node being stuck in NotReady state and downtime for CNV VMIs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.15
    • Node / Kubelet
    • Low
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      If kubelet systemd service is restarted beyond start-limit-hit ( #DefaultStartLimitIntervalSec=10s ), OpenShift node is stuck in NotReady state as the kubelet service is stopped after that. This impacts all the VMI's running on the node:
      
      [root@cc37-h25-000-r750 ~]# oc get vmis --all-namespaces | grep  cc37-h33-000-r750
      benchmark-runner   windows-vm-a2bb6137-0     7d10h   Running   10.130.1.31    cc37-h33-000-r750   False
      benchmark-runner   windows-vm-a2bb6137-100   7d10h   Running   10.130.1.14    cc37-h33-000-r750   False
      
      Systemd settings on the RHCOS node:
      
      #DefaultRestartSec=100ms
      #DefaultStartLimitIntervalSec=10s
      
      Kubelet service logs:
      systemctl status kubelet× kubelet.service - Kubernetes Kubelet  Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d          └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-on-prem-wait-resolv.conf, 20-logging.conf, 20-nodenet.conf  Active: failed (Result: start-limit-hit) since Tue 2024-08-27 01:22:44 UTC; 4min 54s ago   Duration: 1.028s Process: 862624 ExecCondition=/bin/bash -c test -f /run/resolv-prepender-kni-conf-done || exit 255 (code=exited, status=0/SUCCESS) Process: 862625 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS) Process: 862628 ExecStartPre=/usr/sbin/restorecon /usr/local/bin/kubenswrapper /usr/bin/kubensenter (code=exited, status=0/SUCCESS) Process: 862630 ExecStart=/usr/local/bin/kubenswrapper /usr/bin/kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=>   Main PID: 862630 (code=exited, status=0/SUCCESS)     CPU: 2.414s
      Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: Failed to start Kubernetes Kubelet.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: kubelet.service: Start request repeated too quickly.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: kubelet.service: Failed with result 'start-limit-hit'.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: Failed to start Kubernetes Kubelet.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: kubelet.service: Start request repeated too quickly.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: kubelet.service: Failed with result 'start-limit-hit'.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: Failed to start Kubernetes Kubelet.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: kubelet.service: Start request repeated too quickly.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: kubelet.service: Failed with result 'start-limit-hit'.Aug 27 01:22:45 cc37-h35-000-r750 systemd[1]: Failed to start Kubernetes Kubelet.
      
      
      

      Version-Release number of selected component (if applicable):

      4.15

      How reproducible:

      Always

      Steps to Reproduce:

          1. Install OpenShift 4.15 cluster on baremetal
          2. Restart Kubelet on one of the worker node multiple times within 10 seconds duration
          3. Observe the status of the kubelet and node relevant node
          

      Actual results:

      Kubelet fails to start leading to node in NotReady state

      Expected results:

      Kubelet service is running and node is Ready to run workloads

      Additional info:

      Must-gather, journal logs: https://drive.google.com/drive/folders/1A73Uh0nFyPk9raBCmt4fFgGjwjFxqAYW?usp=sharing

              rh-ee-kwilczyn Krzysztof Wilczyński
              nelluri Naga Ravi Chaitanya Elluri
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: