-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.12.z, 4.11.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Cluster's worker nodes go "NotReady" with a Reason of Ready False Mon, 21 Aug 2023 10:27:20 -0400 Mon, 07 Aug 2023 10:21:57 -0400 KubeletNotReady PLEG is not healthy: pleg was last seen active 336h8m26.816166564s ago; threshold is 3m0s From an investigation into the 4.12.16 cluster (as all logs and nodes had been rolled before we could dive into the 4.11 cluster for troubleshooting) we found the following errors within the kubelet journal: Aug 23 17:42:32 ip-10-29-175-200 kubenswrapper[2931940]: E0823 17:42:32.602204 2931940 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16784520 vs. 16777216)" Aug 23 17:42:32 ip-10-29-175-200 kubenswrapper[2931940]: E0823 17:42:32.602219 2931940 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16784520 vs. 16777216)"
Version-Release number of selected component (if applicable):
Observed on Cluster versions: 4.11.37 4.12.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Additionally, we saw the following types of logs in the crio journal:
sh-4.4# journalctl -u crio
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300063956Z" level=warning msg="Could not restore sandbox IP for e1c24c1e81fd30bd3cb4bf4985bc5d108b365ec5e2e1f9cd4ef4025bacdf7892: failed to get network status for pod sandbox k8s_com>
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300081373Z" level=error msg="error checking loopback interface: failed to Statfs \"\": no such file or directory"
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300089332Z" level=warning msg="Could not restore sandbox IP for d6aa36378a16ab3c4d8cab4dace6a8e4d108d043a73b7c8196aec59db9b3d4ff: failed to get network status for pod sandbox k8s_ope>
Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300102298Z" level=error msg="error checking loopback interface: failed to Statfs \"\": no such file or directory"
sh-4.4# crictl statsp
E0823 17:17:08.707058 2924077 remote_runtime.go:1148] "ListPodSandboxStats with filter from runtime service failed" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16811908 vs. 16777216)" filter="&PodSandboxStatsFilter{Id:,LabelSelector:map[string]string{},}"
FATA[0000] get pod stats: display pod stats: list pod sandbox stats: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16811908 vs. 16777216)
h-4.4# crictl stats
CONTAINER CPU % MEM DISK INODES
071d960b1ff1a 0.04 37.55MB 8.192kB 16
073d8a84e2988 0.00 2.793MB 8.192kB 17
117baf93b2d90 0.00 14.04MB 17.19MB 27
13fd2b955ff3f 0.00 18.87MB 8.192kB 12
1956bdb9800e3 0.00 36.1MB 8.192kB 12
1bc82c13ed26d 0.01 147.8MB 8.192kB 14
26463008d46eb 0.06 322.9MB 12.29kB 28
36f2ba341f36a 0.00 21.57MB 8.192kB 13
3c657284d340e 0.00 30.8MB 8.192kB 16
45aca35334d81 0.00 19.65MB 8.192kB 20
4788e5fb8ffd2 0.00 9.241MB 12.29kB 15
5830ada43077a 0.17 259.3MB 108.5MB 212
7101f571a9fd2 0.00 22.68MB 8.192kB 12
7ca26df6caa24 0.02 140.4MB 8.192kB 12
7f99239a989cb 0.00 30.93MB 8.192kB 18
847f8e51d050e 0.11 54.76MB 36.86kB 37
87858c5d23532 0.00 1.008MB 8.192kB 12
8de85bdf38707 0.00 27.48MB 8.192kB 19
a69dee0870283 0.02 359.2MB 8.192kB 12
a9bc589cd0651 1.12 19.76MB 12.29kB 28
c5232c8a827b4 0.02 22.59MB 8.192kB 17
c85906b9495bb 0.00 23.33MB 8.192kB 14
d1c4f328c0a41 0.00 21MB 8.192kB 15
f0dd71b61d07a 0.01 21.39MB 8.192kB 13
f3fa23eaefb27 0.10 55.87MB 8.192kB 14
f5150d299cd8a 3.69 28.61MB 8.192kB 19
f94fe50756f0c 50.53 595.2MB 8.192kB 32
sh-4.4# systemctl status crio
● crio.service - Container Runtime Interface for OCI (CRI-O)
Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/crio.service.d
└─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf
Active: active (running) since Mon 2023-07-31 07:55:05 UTC; 3 weeks 2 days ago
Docs: https://github.com/cri-o/cri-o
Main PID: 1764 (crio)
Tasks: 26
Memory: 1.4G
CPU: 16h 20min 44.669s
CGroup: /system.slice/crio.service
└─1764 /usr/bin/crio
Aug 23 16:54:32 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:54:32.735802375Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:aaaa,RepoTags:[registry.redhat.io/ubi9:latest]>
Aug 23 16:54:45 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:54:45.862217539Z" level=warning msg="Found defunct process with PID 2895076 (timeout)"
Aug 23 16:55:15 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:55:15.861505532Z" level=warning msg="Found defunct process with PID 2900654 (crictl)"
Aug 23 16:55:43 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:55:43.826018227Z" level=warning msg="Found defunct process with PID 2905842 (timeout)"
Aug 23 16:57:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:57:03.323528293Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abcd" id=a4bf40bd-e1>
Aug 23 16:57:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:57:03.323831685Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:237998c72f3e3da71a88f41c2f9cd61b5fa09c106512216ed6686099446e1209,RepoTags:[],RepoDigests:[quay.io/openshif>
Aug 23 17:02:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 17:02:03.341590475Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abcd" id=abcd-00>
Aug 23 17:02:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 17:02:03.341879183Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:aaaa,RepoTags:[],RepoDigests:[quay.io/openshif>
Here are some other relevant details that can be shared about the cluster:
- When more than three nodes are in a NotReady state, the Machine API will no longer attempt to drain the nodes, and will just exit the reconcile loop.
- This cluster runs primarily (if not exclusively) Tekton pipelines, which applies Pod Disruption Budgets to the workloads that do not allow the workload to get disrupted.
- Worker Nodes in OSD/ROSA have a 480s timeout between when they go NotReady and when the MachineHealthCheck will attempt to replace the instance.
What I think happened - In this specific incident, ALL of the cluster's worker nodes were NotReady and three were NotReady,SchedulingDisabled. I theorize that one at a time, the nodes hit this GRPC issue and be marked as NotReady. The Machine Health Check would eventually time out and spin up a replacement machine, which spins up a new instance and attempts to migrate workloads by draining the existing node. The existing node can't confirm that it's been drained, because the kubelet cannot contact the container runtime engine because of the GRPC issue. This leaves the node on the cluster with a NotReady, SchedulingDisabled status indefinitely. After the first three had become NotReady and the cluster had attempted to recover them, when the others became NotReady they were NOT replaced because they hit the limit of a maximum of 3 NotReady within the MachineAPI.
Let me know if there's any more information that we can provide. If you look at the linked tickets, OHSS-26041 has sosreports that I was able to gather from a few of the affected nodes before replacing them. This is the second time in a few weeks that cluster has experienced the same symptoms of the issue (nodes NotReady) though I don't think we root-caused the first instance of this.