Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Networking / On-Prem Host Networking
Labels:
- triaged

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

IPI kubelet service restarts multiple times until it manage to stay running.

Version-Release number of selected component (if applicable):

4.14 nightly 2024-08-01 08:25

How reproducible:

Most of the time, after a node boots

Steps to Reproduce:

    1. Install OCP 4.14 in IPI baremetal
    2. After applying a MC change that performs a reboot a node might remain in NotReady status for a while and eventually becomes ready
    3. When looking at the journal logs it is possible to see that Kubelet service it's restarting in a loop.

Actual results:

Kubelet service fails to start and remains in a loop trying to start for a while.

Expected results:

Kubelet service should start without errors and not remain in a restart loop

Additional info:

# Here we can see some nodes took hundreds of kubelet restarts to finally remain running
$ oc get nodes
NAME       STATUS                        ROLES                  AGE     VERSION
master-0   Ready                         control-plane,master   168m    v1.27.15+6147456
master-1   NotReady,SchedulingDisabled   control-plane,master   3h43m   v1.27.15+6147456
master-2   Ready                         control-plane,master   3h44m   v1.27.15+6147456
worker-0   Ready                         worker                 3h1m    v1.27.15+6147456
worker-1   NotReady,SchedulingDisabled   worker                 125m    v1.27.15+6147456
worker-2   Ready                         worker                 3h1m    v1.27.15+6147456
worker-3   Ready                         worker                 3h1m    v1.27.15+6147456

$ ssh core@worker-1
Warning: Permanently added 'worker-1,192.168.62.25' (ECDSA) to the list of known hosts.                            
Red Hat Enterprise Linux CoreOS 414.92.202407300859-0                            
  Part of OpenShift 4.14, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).                                          
                                                                                 
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:                                                                                                                                                                                      
  https://docs.openshift.com/container-platform/4.14/architecture/architecture-rhcos.html

---
[systemd]
Failed Units: 3
  NetworkManager-wait-online.service
  on-prem-resolv-prepender.service
  systemd-network-generator.service

[core@worker-1 ~]$ sudo journalctl -f                 
Aug 01 15:44:05 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 01 15:44:05 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 315.       
Aug 01 15:44:15 worker-1 systemd[1]: Stopped Kubernetes Kubelet.
Aug 01 15:44:15 worker-1 systemd[1]: Starting Kubernetes Kubelet...                        
Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION
Aug 01 15:44:15 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 01 15:44:15 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
Aug 01 15:44:19 worker-1 sudo[6948]:     core : TTY=pts/0 ; PWD=/var/home/core ; USER=root ; COMMAND=/bin/journalctl -f
Aug 01 15:44:19 worker-1 sudo[6948]: pam_unix(sudo:session): session opened for user root(uid=0) by core(uid=1000)
Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 316.
Aug 01 15:44:26 worker-1 systemd[1]: Stopped Kubernetes Kubelet.
Aug 01 15:44:26 worker-1 systemd[1]: Starting Kubernetes Kubelet...                                      
Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Condition check process exited, code=exited, status=255/EXCEPTION
Aug 01 15:44:26 worker-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 01 15:44:26 worker-1 systemd[1]: Failed to start Kubernetes Kubelet.
...

[core@worker-1 ~]$ uptime 
 20:18:01 up  3:01,  1 user,  load average: 0.35, 0.35, 0.40
[core@worker-1 ~]$ last
core     pts/0        192.168.62.20    Thu Aug  1 20:17   still logged in
reboot   system boot  5.14.0-284.77.1. Thu Aug  1 17:16   still running
core     pts/0        192.168.62.20    Thu Aug  1 15:44 - 15:51  (00:07)
reboot   system boot  5.14.0-284.77.1. Thu Aug  1 14:48 - 17:15  (02:26)
reboot   system boot  5.14.0-284.73.1. Thu Aug  1 14:44 - 14:46  (00:02)

wtmp begins Thu Aug  1 14:44:51 2024
[core@worker-1 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet"
651

$ ssh core@master-0
[core@master-0 ~]$ sudo journalctl | grep -c "Failed to start Kubernetes Kubelet"
650


$ oc get nodes
NAME       STATUS   ROLES                  AGE     VERSION
master-0   Ready    control-plane,master   4h51m   v1.27.15+6147456
master-1   Ready    control-plane,master   5h47m   v1.27.15+6147456
master-2   Ready    control-plane,master   5h47m   v1.27.15+6147456
worker-0   Ready    worker                 5h4m    v1.27.15+6147456
worker-1   Ready    worker                 4h8m    v1.27.15+6147456
worker-2   Ready    worker                 5h4m    v1.27.15+6147456
worker-3   Ready    worker                 5h4m    v1.27.15+6147456

duplicates

OCPBUGS-37769 MCD degraded on content mismatch for resolv-prepender script

Closed

Assignee:: Mat Kowalski

Reporter:: Manuel Rodriguez

QA Contact:: Sunil Choudhary

Need Info From:: Manuel Rodriguez

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/08/01 8:37 PM

Updated:: 2024/08/13 7:22 AM

Resolved:: 2024/08/13 7:22 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates