Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12.z, 4.11.z
Component/s: Node / Kubelet
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Cluster's worker nodes go "NotReady" with a Reason of 

Ready False Mon, 21 Aug 2023 10:27:20 -0400 Mon, 07 Aug 2023 10:21:57 -0400 KubeletNotReady PLEG is not healthy: pleg was last seen active 336h8m26.816166564s ago; threshold is 3m0s

From an investigation into the 4.12.16 cluster (as all logs and nodes had been rolled before we could dive into the 4.11 cluster for troubleshooting) we found the following errors within the kubelet journal:

Aug 23 17:42:32 ip-10-29-175-200 kubenswrapper[2931940]: E0823 17:42:32.602204 2931940 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16784520 vs. 16777216)"
Aug 23 17:42:32 ip-10-29-175-200 kubenswrapper[2931940]: E0823 17:42:32.602219 2931940 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16784520 vs. 16777216)"

Version-Release number of selected component (if applicable):

Observed on Cluster versions:
4.11.37
4.12.16

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Additionally, we saw the following types of logs in the crio journal:

sh-4.4# journalctl -u crio
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300063956Z" level=warning msg="Could not restore sandbox IP for e1c24c1e81fd30bd3cb4bf4985bc5d108b365ec5e2e1f9cd4ef4025bacdf7892: failed to get network status for pod sandbox k8s_com>
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300081373Z" level=error msg="error checking loopback interface: failed to Statfs \"\": no such file or directory"
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300089332Z" level=warning msg="Could not restore sandbox IP for d6aa36378a16ab3c4d8cab4dace6a8e4d108d043a73b7c8196aec59db9b3d4ff: failed to get network status for pod sandbox k8s_ope>
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300102298Z" level=error msg="error checking loopback interface: failed to Statfs \"\": no such file or directory"

sh-4.4# crictl statsp
E0823 17:17:08.707058 2924077 remote_runtime.go:1148] "ListPodSandboxStats with filter from runtime service failed" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16811908 vs. 16777216)" filter="&PodSandboxStatsFilter{Id:,LabelSelector:map[string]string{},}"
FATA[0000] get pod stats: display pod stats: list pod sandbox stats: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16811908 vs. 16777216)

h-4.4# crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
071d960b1ff1a       0.04                37.55MB             8.192kB             16
073d8a84e2988       0.00                2.793MB             8.192kB             17
117baf93b2d90       0.00                14.04MB             17.19MB             27
13fd2b955ff3f       0.00                18.87MB             8.192kB             12
1956bdb9800e3       0.00                36.1MB              8.192kB             12
1bc82c13ed26d       0.01                147.8MB             8.192kB             14
26463008d46eb       0.06                322.9MB             12.29kB             28
36f2ba341f36a       0.00                21.57MB             8.192kB             13
3c657284d340e       0.00                30.8MB              8.192kB             16
45aca35334d81       0.00                19.65MB             8.192kB             20
4788e5fb8ffd2       0.00                9.241MB             12.29kB             15
5830ada43077a       0.17                259.3MB             108.5MB             212
7101f571a9fd2       0.00                22.68MB             8.192kB             12
7ca26df6caa24       0.02                140.4MB             8.192kB             12
7f99239a989cb       0.00                30.93MB             8.192kB             18
847f8e51d050e       0.11                54.76MB             36.86kB             37
87858c5d23532       0.00                1.008MB             8.192kB             12
8de85bdf38707       0.00                27.48MB             8.192kB             19
a69dee0870283       0.02                359.2MB             8.192kB             12
a9bc589cd0651       1.12                19.76MB             12.29kB             28
c5232c8a827b4       0.02                22.59MB             8.192kB             17
c85906b9495bb       0.00                23.33MB             8.192kB             14
d1c4f328c0a41       0.00                21MB                8.192kB             15
f0dd71b61d07a       0.01                21.39MB             8.192kB             13
f3fa23eaefb27       0.10                55.87MB             8.192kB             14
f5150d299cd8a       3.69                28.61MB             8.192kB             19
f94fe50756f0c       50.53               595.2MB             8.192kB             32

sh-4.4# systemctl status crio
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf
   Active: active (running) since Mon 2023-07-31 07:55:05 UTC; 3 weeks 2 days ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 1764 (crio)
    Tasks: 26
   Memory: 1.4G
      CPU: 16h 20min 44.669s
   CGroup: /system.slice/crio.service
           └─1764 /usr/bin/crio

Aug 23 16:54:32 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:54:32.735802375Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:aaaa,RepoTags:[registry.redhat.io/ubi9:latest]>
Aug 23 16:54:45 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:54:45.862217539Z" level=warning msg="Found defunct process with PID 2895076 (timeout)"
Aug 23 16:55:15 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:55:15.861505532Z" level=warning msg="Found defunct process with PID 2900654 (crictl)"
Aug 23 16:55:43 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:55:43.826018227Z" level=warning msg="Found defunct process with PID 2905842 (timeout)"
Aug 23 16:57:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:57:03.323528293Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abcd" id=a4bf40bd-e1>
Aug 23 16:57:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:57:03.323831685Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:237998c72f3e3da71a88f41c2f9cd61b5fa09c106512216ed6686099446e1209,RepoTags:[],RepoDigests:[quay.io/openshif>
Aug 23 17:02:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 17:02:03.341590475Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abcd" id=abcd-00>
Aug 23 17:02:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 17:02:03.341879183Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:aaaa,RepoTags:[],RepoDigests:[quay.io/openshif>

Here are some other relevant details that can be shared about the cluster:

When more than three nodes are in a NotReady state, the Machine API will no longer attempt to drain the nodes, and will just exit the reconcile loop.
This cluster runs primarily (if not exclusively) Tekton pipelines, which applies Pod Disruption Budgets to the workloads that do not allow the workload to get disrupted.
Worker Nodes in OSD/ROSA have a 480s timeout between when they go NotReady and when the MachineHealthCheck will attempt to replace the instance.

What I think happened - In this specific incident, ALL of the cluster's worker nodes were NotReady and three were NotReady,SchedulingDisabled. I theorize that one at a time, the nodes hit this GRPC issue and be marked as NotReady. The Machine Health Check would eventually time out and spin up a replacement machine, which spins up a new instance and attempts to migrate workloads by draining the existing node. The existing node can't confirm that it's been drained, because the kubelet cannot contact the container runtime engine because of the GRPC issue. This leaves the node on the cluster with a NotReady, SchedulingDisabled status indefinitely. After the first three had become NotReady and the cluster had attempted to recover them, when the others became NotReady they were NOT replaced because they hit the limit of a maximum of 3 NotReady within the MachineAPI.

Let me know if there's any more information that we can provide. If you look at the linked tickets, OHSS-26041 has sosreports that I was able to gather from a few of the affected nodes before replacing them. This is the second time in a few weeks that cluster has experienced the same symptoms of the issue (nodes NotReady) though I don't think we root-caused the first instance of this.

Assignee:: Sai Ramesh Vanka

Reporter:: Kirk Bater

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/08/28 8:46 PM

Updated:: 2025/07/25 5:42 PM

Resolved:: 2023/08/29 3:23 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates