-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.14.z
-
None
-
False
-
Description of problem:
IPI kubelet service restarts multiple times until it manage to stay running.
Version-Release number of selected component (if applicable):
4.14 nightly 2024-08-01 08:25
How reproducible:
Most of the time, after a node boots
Steps to Reproduce:
1. Install OCP 4.14 in IPI baremetal 2. After applying a MC change that performs a reboot a node might remain in NotReady status for a while and eventually becomes ready 3. When looking at the journal logs it is possible to see that Kubelet service it's restarting in a loop.
Actual results:
Kubelet service fails to start and remains in a loop trying to start for a while.
Expected results:
Kubelet service should start without errors and not remain in a restart loop
Additional info:
# Here we can see some nodes took hundreds of kubelet restarts to finally remain running $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 168m v1.27.15+6147456 master-1 NotReady,SchedulingDisabled control-plane,master 3h43m v1.27.15+6147456 master-2 Ready control-plane,master 3h44m v1.27.15+6147456 worker-0 Ready worker 3h1m v1.27.15+6147456 worker-1 NotReady,SchedulingDisabled worker 125m v1.27.15+6147456 worker-2 Ready worker 3h1m v1.27.15+6147456 worker-3 Ready worker 3h1m v1.27.15+6147456 $ ssh core@worker-1 Warning: Permanently added 'worker-1,192.168.62.25' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 414.92.202407300859-0 Part of OpenShift 4.14, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.14/architecture/architecture-rhcos.html --- [systemd] Failed Units: 3 NetworkManager-wait-online.service on-prem-resolv-prepender.service systemd-network-generator.service [core@worker-1 ~]$ sudo journalctl -f Aug 01 15:44:05 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'. Aug 01 15:44:05 worker-1 systemd[1]: Failed to start Kubernetes Kubelet. Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 315. Aug 01 15:44:15 worker-1 systemd[1]: Stopped Kubernetes Kubelet. Aug 01 15:44:15 worker-1 systemd[1]: Starting Kubernetes Kubelet... Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'. Aug 01 15:44:15 worker-1 systemd[1]: Failed to start Kubernetes Kubelet. Aug 01 15:44:19 worker-1 sudo[6948]: core : TTY=pts/0 ; PWD=/var/home/core ; USER=root ; COMMAND=/bin/journalctl -f Aug 01 15:44:19 worker-1 sudo[6948]: pam_unix(sudo:session): session opened for user root(uid=0) by core(uid=1000) Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 316. Aug 01 15:44:26 worker-1 systemd[1]: Stopped Kubernetes Kubelet. Aug 01 15:44:26 worker-1 systemd[1]: Starting Kubernetes Kubelet... Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'. Aug 01 15:44:26 worker-1 systemd[1]: Failed to start Kubernetes Kubelet. ... [core@worker-1 ~]$ uptime 20:18:01 up 3:01, 1 user, load average: 0.35, 0.35, 0.40 [core@worker-1 ~]$ last core pts/0 192.168.62.20 Thu Aug 1 20:17 still logged in reboot system boot 5.14.0-284.77.1. Thu Aug 1 17:16 still running core pts/0 192.168.62.20 Thu Aug 1 15:44 - 15:51 (00:07) reboot system boot 5.14.0-284.77.1. Thu Aug 1 14:48 - 17:15 (02:26) reboot system boot 5.14.0-284.73.1. Thu Aug 1 14:44 - 14:46 (00:02) wtmp begins Thu Aug 1 14:44:51 2024 [core@worker-1 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet" 651 $ ssh core@master-0 [core@master-0 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet" 650 $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 4h51m v1.27.15+6147456 master-1 Ready control-plane,master 5h47m v1.27.15+6147456 master-2 Ready control-plane,master 5h47m v1.27.15+6147456 worker-0 Ready worker 5h4m v1.27.15+6147456 worker-1 Ready worker 4h8m v1.27.15+6147456 worker-2 Ready worker 5h4m v1.27.15+6147456 worker-3 Ready worker 5h4m v1.27.15+6147456
- duplicates
-
OCPBUGS-37769 MCD degraded on content mismatch for resolv-prepender script
- Closed