Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37845

Failed to start Kubernetes Kubelet - service remains in a long restart loop until it manage to stay running

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      IPI kubelet service restarts multiple times until it manage to stay running.
          

      Version-Release number of selected component (if applicable):

      4.14 nightly 2024-08-01 08:25
          

      How reproducible:

      Most of the time, after a node boots
          

      Steps to Reproduce:

          1. Install OCP 4.14 in IPI baremetal
          2. After applying a MC change that performs a reboot a node might remain in NotReady status for a while and eventually becomes ready
          3. When looking at the journal logs it is possible to see that Kubelet service it's restarting in a loop.
          

      Actual results:

      Kubelet service fails to start and remains in a loop trying to start for a while.
          

      Expected results:

      Kubelet service should start without errors and not remain in a restart loop
          

      Additional info:

      # Here we can see some nodes took hundreds of kubelet restarts to finally remain running
      $ oc get nodes
      NAME       STATUS                        ROLES                  AGE     VERSION
      master-0   Ready                         control-plane,master   168m    v1.27.15+6147456
      master-1   NotReady,SchedulingDisabled   control-plane,master   3h43m   v1.27.15+6147456
      master-2   Ready                         control-plane,master   3h44m   v1.27.15+6147456
      worker-0   Ready                         worker                 3h1m    v1.27.15+6147456
      worker-1   NotReady,SchedulingDisabled   worker                 125m    v1.27.15+6147456
      worker-2   Ready                         worker                 3h1m    v1.27.15+6147456
      worker-3   Ready                         worker                 3h1m    v1.27.15+6147456
      
      $ ssh core@worker-1
      Warning: Permanently added 'worker-1,192.168.62.25' (ECDSA) to the list of known hosts.                            
      Red Hat Enterprise Linux CoreOS 414.92.202407300859-0                            
        Part of OpenShift 4.14, RHCOS is a Kubernetes native operating system
        managed by the Machine Config Operator (`clusteroperator/machine-config`).                                          
                                                                                       
      WARNING: Direct SSH access to machines is not recommended; instead,
      make configuration changes via `machineconfig` objects:                                                                                                                                                                                      
        https://docs.openshift.com/container-platform/4.14/architecture/architecture-rhcos.html
      
      ---
      [systemd]
      Failed Units: 3
        NetworkManager-wait-online.service
        on-prem-resolv-prepender.service
        systemd-network-generator.service
      
      [core@worker-1 ~]$ sudo journalctl -f                 
      Aug 01 15:44:05 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Aug 01 15:44:05 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
      Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 315.       
      Aug 01 15:44:15 worker-1 systemd[1]: Stopped Kubernetes Kubelet.
      Aug 01 15:44:15 worker-1 systemd[1]: Starting Kubernetes Kubelet...                        
      Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION
      Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Aug 01 15:44:15 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
      Aug 01 15:44:19 worker-1 sudo[6948]:     core : TTY=pts/0 ; PWD=/var/home/core ; USER=root ; COMMAND=/bin/journalctl -f
      Aug 01 15:44:19 worker-1 sudo[6948]: pam_unix(sudo:session): session opened for user root(uid=0) by core(uid=1000)
      Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 316.
      Aug 01 15:44:26 worker-1 systemd[1]: Stopped Kubernetes Kubelet.
      Aug 01 15:44:26 worker-1 systemd[1]: Starting Kubernetes Kubelet...                                      
      Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION
      Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Aug 01 15:44:26 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
      ...
      
      [core@worker-1 ~]$ uptime 
       20:18:01 up  3:01,  1 user,  load average: 0.35, 0.35, 0.40
      [core@worker-1 ~]$ last
      core     pts/0        192.168.62.20    Thu Aug  1 20:17   still logged in
      reboot   system boot  5.14.0-284.77.1. Thu Aug  1 17:16   still running
      core     pts/0        192.168.62.20    Thu Aug  1 15:44 - 15:51  (00:07)
      reboot   system boot  5.14.0-284.77.1. Thu Aug  1 14:48 - 17:15  (02:26)
      reboot   system boot  5.14.0-284.73.1. Thu Aug  1 14:44 - 14:46  (00:02)
      
      wtmp begins Thu Aug  1 14:44:51 2024
      [core@worker-1 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet"
      651
      
      $ ssh core@master-0
      [core@master-0 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet"
      650
      
      
      $ oc get nodes
      NAME       STATUS   ROLES                  AGE     VERSION
      master-0   Ready    control-plane,master   4h51m   v1.27.15+6147456
      master-1   Ready    control-plane,master   5h47m   v1.27.15+6147456
      master-2   Ready    control-plane,master   5h47m   v1.27.15+6147456
      worker-0   Ready    worker                 5h4m    v1.27.15+6147456
      worker-1   Ready    worker                 4h8m    v1.27.15+6147456
      worker-2   Ready    worker                 5h4m    v1.27.15+6147456
      worker-3   Ready    worker                 5h4m    v1.27.15+6147456
          

            mkowalsk@redhat.com Mat Kowalski
            rhn-gps-manrodri Manuel Rodriguez
            Sunil Choudhary Sunil Choudhary
            Manuel Rodriguez
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: