Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41277

OCP 4.12 - Health Probes to pods on node failing intermittently on OVN-kubernetes (Node has zero performance or network load - all pods on host affected)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • None
    • 4.12, 4.15
    • Node / Kubelet
    • Critical
    • None
    • False
    • Hide

      None

      Show
      None
    • Customer Facing, Customer Reported

      Description of problem:

      After some number of days nodes will begin to degrade local pod health. Any pod with a healthProbe of type TCP/GET will begin to intermittently report that the pods are unhealthy/failing their probes, which may lead to pod restarts.
      
      This behavior affects 1-2 nodes in a cluster at a time, and a reboot will solve it temporarily. Multiple clusters are affected, and it's not clear what the commonality is.
      
      We have replicated this behavior on a single node in a multi-node cluster that is currently cordoned and drained, with a deployment of several NGINX containers running on it that are not being contacted by any service or route. The pod has zero load (CPU and MEM consumption at 1%) and Networking throughput to the node is minimal (basically just us sshed into the node looking at stuff periodically/pulling logs). 
      
      We've set the deployment up with a health probe, only so that we can observe the behavior in a more targeted way; we are seeing that packets that originate from kubelet and are passed through ovn-k8s-mp0 to pod eth0 are occasionally lost in transit. (e.g. both eth0 and ovn-k8s-mp0 show all packets that are passed to them, but we observe that during failure condition timeframes, kubelet will report a timeout, and the packet is not observed passing those interfaces (e.g. syn never arrives). Or, during a timeout period, a FIN/ACK is sent back from the pod, but kubelet sends an RST packet because it never recieved the closure packet). 
      
      We observe this behavior infrequently, but often enough that it's a problem, and when multiple pods/workloads are running on the host, it can disrupt pod handling. We've left 1 node alone for debugging/testing - we will not restart it to ensure we have an active replicator available, and we can suggest local tests to this endpoint for problem identification.
      
      Need help isolating where this behavior is originating from/correcting the possible bug. This may also need to involve OVS team - see additional notes below for log samples.

      Version-Release number of selected component (if applicable):

      4.12.26 
      Bare-metal
      UPI
      OVN-kube

      How reproducible: 

      Single node replicator is currently available on multi-node cluster. Appears to impact nodes after some amount of uptime. Draining/cordoning node does not fix behavior. Restarts on node will clear the issue for several weeks before it returns, which to me indicates plausibly a memory leak in a networking component or kubelet. 
       

      Steps to Reproduce:

      1. Allow cluster nodes to perform as usual for a few weeks
      
      2. observe one node out of many begin to fail healthprobes on all pods on that node infrequently but often enough to be a problem.
      
      3. restart node to clear issue (temporarily, observe behavior start on other nodes --> seems time related, not deployment/workload linked. No correlation with usage rates, traffic flow, or other routing concerns. Packets are lost in the local network stack between kubelet --> ovn-k8s-mp0 --> pod (eth0) [and return trip]
       

       

      Actual results:

      pods degrade, cluster becomes unstable for workloads until restarted - interrupting cluster stability/client deliverables. 

      Expected results:

      node should not degrade pod health status unless pods are actually unhealthy 

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      bare-metal
      - customer issue (VZW specific BareMetal deployments) 

       

      Additional data:

      see first comment update for testing results, documentation of issue, isolation steps, and highlighted log bundles.
      
      
      //Assistance is required to identify and resolve this issue - customer is on 4.12 EUS support and there are multiple clusters impacted in the same way. Case is sev 3 to provide testing time, but is becoming more urgent that we identify/resolve soon. - SEV allocation is accurate. 

              aos-node@redhat.com Node Team Bot Account
              rhn-support-wrussell Will Russell
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: