Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18256

PLEG is not Healthy causing nodes to go NotReady

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.12.z, 4.11.z
    • Node / Kubelet
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Cluster's worker nodes go "NotReady" with a Reason of 
      
      Ready False Mon, 21 Aug 2023 10:27:20 -0400 Mon, 07 Aug 2023 10:21:57 -0400 KubeletNotReady PLEG is not healthy: pleg was last seen active 336h8m26.816166564s ago; threshold is 3m0s
      
      From an investigation into the 4.12.16 cluster (as all logs and nodes had been rolled before we could dive into the 4.11 cluster for troubleshooting) we found the following errors within the kubelet journal:
      
      Aug 23 17:42:32 ip-10-29-175-200 kubenswrapper[2931940]: E0823 17:42:32.602204 2931940 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16784520 vs. 16777216)"
      Aug 23 17:42:32 ip-10-29-175-200 kubenswrapper[2931940]: E0823 17:42:32.602219 2931940 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16784520 vs. 16777216)"
      

      Version-Release number of selected component (if applicable):

      Observed on Cluster versions:
      4.11.37
      4.12.16
      

      How reproducible:

      
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      
      

      Expected results:

      
      

      Additional info:

      Additionally, we saw the following types of logs in the crio journal:
      
      sh-4.4# journalctl -u crio
      Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300063956Z" level=warning msg="Could not restore sandbox IP for e1c24c1e81fd30bd3cb4bf4985bc5d108b365ec5e2e1f9cd4ef4025bacdf7892: failed to get network status for pod sandbox k8s_com>
      Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300081373Z" level=error msg="error checking loopback interface: failed to Statfs \"\": no such file or directory"
      Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300089332Z" level=warning msg="Could not restore sandbox IP for d6aa36378a16ab3c4d8cab4dace6a8e4d108d043a73b7c8196aec59db9b3d4ff: failed to get network status for pod sandbox k8s_ope>
      Aug 23 17:27:48 ip-10-29-175-200 crio[2931480]: time="2023-08-23 17:27:48.300102298Z" level=error msg="error checking loopback interface: failed to Statfs \"\": no such file or directory"
      
      sh-4.4# crictl statsp
      E0823 17:17:08.707058 2924077 remote_runtime.go:1148] "ListPodSandboxStats with filter from runtime service failed" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16811908 vs. 16777216)" filter="&PodSandboxStatsFilter{Id:,LabelSelector:map[string]string{},}"
      FATA[0000] get pod stats: display pod stats: list pod sandbox stats: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (16811908 vs. 16777216)
      
      h-4.4# crictl stats
      CONTAINER           CPU %               MEM                 DISK                INODES
      071d960b1ff1a       0.04                37.55MB             8.192kB             16
      073d8a84e2988       0.00                2.793MB             8.192kB             17
      117baf93b2d90       0.00                14.04MB             17.19MB             27
      13fd2b955ff3f       0.00                18.87MB             8.192kB             12
      1956bdb9800e3       0.00                36.1MB              8.192kB             12
      1bc82c13ed26d       0.01                147.8MB             8.192kB             14
      26463008d46eb       0.06                322.9MB             12.29kB             28
      36f2ba341f36a       0.00                21.57MB             8.192kB             13
      3c657284d340e       0.00                30.8MB              8.192kB             16
      45aca35334d81       0.00                19.65MB             8.192kB             20
      4788e5fb8ffd2       0.00                9.241MB             12.29kB             15
      5830ada43077a       0.17                259.3MB             108.5MB             212
      7101f571a9fd2       0.00                22.68MB             8.192kB             12
      7ca26df6caa24       0.02                140.4MB             8.192kB             12
      7f99239a989cb       0.00                30.93MB             8.192kB             18
      847f8e51d050e       0.11                54.76MB             36.86kB             37
      87858c5d23532       0.00                1.008MB             8.192kB             12
      8de85bdf38707       0.00                27.48MB             8.192kB             19
      a69dee0870283       0.02                359.2MB             8.192kB             12
      a9bc589cd0651       1.12                19.76MB             12.29kB             28
      c5232c8a827b4       0.02                22.59MB             8.192kB             17
      c85906b9495bb       0.00                23.33MB             8.192kB             14
      d1c4f328c0a41       0.00                21MB                8.192kB             15
      f0dd71b61d07a       0.01                21.39MB             8.192kB             13
      f3fa23eaefb27       0.10                55.87MB             8.192kB             14
      f5150d299cd8a       3.69                28.61MB             8.192kB             19
      f94fe50756f0c       50.53               595.2MB             8.192kB             32
      
      sh-4.4# systemctl status crio
      ● crio.service - Container Runtime Interface for OCI (CRI-O)
         Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/crio.service.d
                 └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf
         Active: active (running) since Mon 2023-07-31 07:55:05 UTC; 3 weeks 2 days ago
           Docs: https://github.com/cri-o/cri-o
       Main PID: 1764 (crio)
          Tasks: 26
         Memory: 1.4G
            CPU: 16h 20min 44.669s
         CGroup: /system.slice/crio.service
                 └─1764 /usr/bin/crio
      
      Aug 23 16:54:32 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:54:32.735802375Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:aaaa,RepoTags:[registry.redhat.io/ubi9:latest]>
      Aug 23 16:54:45 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:54:45.862217539Z" level=warning msg="Found defunct process with PID 2895076 (timeout)"
      Aug 23 16:55:15 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:55:15.861505532Z" level=warning msg="Found defunct process with PID 2900654 (crictl)"
      Aug 23 16:55:43 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:55:43.826018227Z" level=warning msg="Found defunct process with PID 2905842 (timeout)"
      Aug 23 16:57:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:57:03.323528293Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abcd" id=a4bf40bd-e1>
      Aug 23 16:57:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 16:57:03.323831685Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:237998c72f3e3da71a88f41c2f9cd61b5fa09c106512216ed6686099446e1209,RepoTags:[],RepoDigests:[quay.io/openshif>
      Aug 23 17:02:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 17:02:03.341590475Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abcd" id=abcd-00>
      Aug 23 17:02:03 ip-10-29-175-200 crio[1764]: time="2023-08-23 17:02:03.341879183Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:aaaa,RepoTags:[],RepoDigests:[quay.io/openshif>
      

      Here are some other relevant details that can be shared about the cluster:

      • When more than three nodes are in a NotReady state, the Machine API will no longer attempt to drain the nodes, and will just exit the reconcile loop.
      • This cluster runs primarily (if not exclusively) Tekton pipelines, which applies Pod Disruption Budgets to the workloads that do not allow the workload to get disrupted.
      • Worker Nodes in OSD/ROSA have a 480s timeout between when they go NotReady and when the MachineHealthCheck will attempt to replace the instance.

      What I think happened - In this specific incident, ALL of the cluster's worker nodes were NotReady and three were NotReady,SchedulingDisabled. I theorize that one at a time, the nodes hit this GRPC issue and be marked as NotReady. The Machine Health Check would eventually time out and spin up a replacement machine, which spins up a new instance and attempts to migrate workloads by draining the existing node. The existing node can't confirm that it's been drained, because the kubelet cannot contact the container runtime engine because of the GRPC issue. This leaves the node on the cluster with a NotReady, SchedulingDisabled status indefinitely. After the first three had become NotReady and the cluster had attempted to recover them, when the others became NotReady they were NOT replaced because they hit the limit of a maximum of 3 NotReady within the MachineAPI.

      Let me know if there's any more information that we can provide. If you look at the linked tickets, OHSS-26041 has sosreports that I was able to gather from a few of the affected nodes before replacing them. This is the second time in a few weeks that cluster has experienced the same symptoms of the issue (nodes NotReady) though I don't think we root-caused the first instance of this.

              svanka@redhat.com Sai Ramesh Vanka
              iamkirkbater Kirk Bater
              None
              None
              Sunil Choudhary Sunil Choudhary
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: