-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.14.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
IPI kubelet service restarts multiple times until it manage to stay running.
Version-Release number of selected component (if applicable):
4.14 nightly 2024-08-01 08:25
How reproducible:
Most of the time, after a node boots
Steps to Reproduce:
1. Install OCP 4.14 in IPI baremetal
2. After applying a MC change that performs a reboot a node might remain in NotReady status for a while and eventually becomes ready
3. When looking at the journal logs it is possible to see that Kubelet service it's restarting in a loop.
Actual results:
Kubelet service fails to start and remains in a loop trying to start for a while.
Expected results:
Kubelet service should start without errors and not remain in a restart loop
Additional info:
# Here we can see some nodes took hundreds of kubelet restarts to finally remain running
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 168m v1.27.15+6147456
master-1 NotReady,SchedulingDisabled control-plane,master 3h43m v1.27.15+6147456
master-2 Ready control-plane,master 3h44m v1.27.15+6147456
worker-0 Ready worker 3h1m v1.27.15+6147456
worker-1 NotReady,SchedulingDisabled worker 125m v1.27.15+6147456
worker-2 Ready worker 3h1m v1.27.15+6147456
worker-3 Ready worker 3h1m v1.27.15+6147456
$ ssh core@worker-1
Warning: Permanently added 'worker-1,192.168.62.25' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 414.92.202407300859-0
Part of OpenShift 4.14, RHCOS is a Kubernetes native operating system
managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
https://docs.openshift.com/container-platform/4.14/architecture/architecture-rhcos.html
---
[systemd]
Failed Units: 3
NetworkManager-wait-online.service
on-prem-resolv-prepender.service
systemd-network-generator.service
[core@worker-1 ~]$ sudo journalctl -f
Aug 01 15:44:05 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 01 15:44:05 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 315.
Aug 01 15:44:15 worker-1 systemd[1]: Stopped Kubernetes Kubelet.
Aug 01 15:44:15 worker-1 systemd[1]: Starting Kubernetes Kubelet...
Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION
Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 01 15:44:15 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
Aug 01 15:44:19 worker-1 sudo[6948]: core : TTY=pts/0 ; PWD=/var/home/core ; USER=root ; COMMAND=/bin/journalctl -f
Aug 01 15:44:19 worker-1 sudo[6948]: pam_unix(sudo:session): session opened for user root(uid=0) by core(uid=1000)
Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 316.
Aug 01 15:44:26 worker-1 systemd[1]: Stopped Kubernetes Kubelet.
Aug 01 15:44:26 worker-1 systemd[1]: Starting Kubernetes Kubelet...
Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION
Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 01 15:44:26 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
...
[core@worker-1 ~]$ uptime
20:18:01 up 3:01, 1 user, load average: 0.35, 0.35, 0.40
[core@worker-1 ~]$ last
core pts/0 192.168.62.20 Thu Aug 1 20:17 still logged in
reboot system boot 5.14.0-284.77.1. Thu Aug 1 17:16 still running
core pts/0 192.168.62.20 Thu Aug 1 15:44 - 15:51 (00:07)
reboot system boot 5.14.0-284.77.1. Thu Aug 1 14:48 - 17:15 (02:26)
reboot system boot 5.14.0-284.73.1. Thu Aug 1 14:44 - 14:46 (00:02)
wtmp begins Thu Aug 1 14:44:51 2024
[core@worker-1 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet"
651
$ ssh core@master-0
[core@master-0 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet"
650
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 4h51m v1.27.15+6147456
master-1 Ready control-plane,master 5h47m v1.27.15+6147456
master-2 Ready control-plane,master 5h47m v1.27.15+6147456
worker-0 Ready worker 5h4m v1.27.15+6147456
worker-1 Ready worker 4h8m v1.27.15+6147456
worker-2 Ready worker 5h4m v1.27.15+6147456
worker-3 Ready worker 5h4m v1.27.15+6147456
- duplicates
-
OCPBUGS-37769 MCD degraded on content mismatch for resolv-prepender script
-
- Closed
-