Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27746

ovnkube pod crashed when scale up 400 workers

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 250
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • 05/22 Resolved with OCPBUG-28745 according to the TAM. Can close

      Original: Description of problem:
      
      

       When scale up cluster worker to 400 by
       
      oc scale machineset zhsunaz2-42pvb-worker-eastus1 --replicas 200
      oc scale machineset zhsunaz2-42pvb-worker-eastus2 --replicas 100
      oc scale machineset zhsunaz2-42pvb-worker-eastus3 --replicas 100
       
      found ovnkube-node pod cannot be ready with logs:
       
      I0123 09:47:19.213970 134093 obj_retry.go:607] Update event received for *factory.egressNode zhsunaz2-42pvb-worker-eastus2-fg8j4
      I0123 09:47:19.213998 134093 obj_retry.go:555] Update event received for resource *factory.egressFwNode, old object is equal to new: true
      I0123 09:47:19.221642 134093 ovs.go:167] Exec(7410): stdout: ""
      I0123 09:47:19.221731 134093 ovs.go:168] Exec(7410): stderr: ""
      I0123 09:47:19.221761 134093 default_node_network_controller.go:645] Upgrade Hack: checkOVNSBNodeLRSR for node - 10.130.136.0/23 : match match="reg7 == 0 && ip4.dst == 10.130.136.0/23" : stdout - : stderr - : err <nil>
      F0123 09:47:19.221788 134093 default_node_network_controller.go:955] Upgrade hack: Timed out waiting for the remote ovnkube-controller to be ready even after 5 minutes, err : context deadline exceeded, upgrade hack: unable to find LRSR for node zhsunaz2-42pvb-worker-eastus2-qmxgw
       
       
       
      ovnkube-node-xmgd2                       8/9     Running            23 (4m25s ago)   158m   10.0.128.56    zhsunaz2-42pvb-worker-eastus1-gpzs6   <none>           <none>
      ovnkube-node-xmvgj                       8/9     Running            23 (5m45s ago)   154m   10.0.129.74    zhsunaz2-42pvb-worker-eastus3-8dxzl   <none>           <none>
      ovnkube-node-z4hl8                       8/9     Running            23 (2m15s ago)   155m   10.0.129.87    zhsunaz2-42pvb-worker-eastus3-4rc8k   <none>           <none>
      ovnkube-node-z4w9t                       8/9     Running            23 (3m7s ago)    157m   10.0.128.107   zhsunaz2-42pvb-worker-eastus1-qfx4x   <none>           <none>
      ovnkube-node-z8gl6                       8/9     Running            23 (5m51s ago)   158m   10.0.128.58    zhsunaz2-42pvb-worker-eastus1-zgdrd   <none>           <none>
      ovnkube-node-z9xxv                       8/9     Running            23 (3m12s ago)   156m   10.0.128.115   zhsunaz2-42pvb-worker-eastus1-j7c9f   <none>           <none>
      ovnkube-node-zbr5q                       8/9     CrashLoopBackOff   23 (2m15s ago)   156m   10.0.128.224   zhsunaz2-42pvb-worker-eastus2-chfl4   <none>           <none>
      ovnkube-node-zcpmd                       8/9     CrashLoopBackOff   23 (73s ago)     161m   10.0.128.26    zhsunaz2-42pvb-worker-eastus1-8d247   <none>           <none>
      ovnkube-node-zg2xn                       8/9     Running            23 (4m14s ago)   156m   10.0.128.232   zhsunaz2-42pvb-worker-eastus2-xsrnt   <none>           <none>
      ovnkube-node-zjkpj                       8/9     Running            23 (2m28s ago)   155m   10.0.129.40    zhsunaz2-42pvb-worker-eastus2-mgxl7   <none>           <none>
      ovnkube-node-zmlqb                       8/9     Error              23 (6m40s ago)   160m   10.0.128.65    zhsunaz2-42pvb-worker-eastus1-vx5t7   <none>           <none>
      ovnkube-node-zrcmw                       8/9     CrashLoopBackOff   22 (49s ago)     150m   10.0.129.115   zhsunaz2-42pvb-worker-eastus3-zvzzx   <none>           <none>
      ovnkube-node-zsj67                       8/9     Running            23 (92s ago)     154m   10.0.129.77    zhsunaz2-42pvb-worker-eastus3-xznxk   <none>           <none>
      ovnkube-node-zskbl                       8/9     Running            24 (3m19s ago)   152m   10.0.129.124   zhsunaz2-42pvb-worker-eastus3-29rdd   <none>           <none>
      ovnkube-node-zwt2z                       8/9     Running            23 (3m15s ago)   157m   10.0.128.114   zhsunaz2-42pvb-worker-eastus1-ldvjb   <none>           <none>{code}
      Version-Release number of selected component (if applicable):

      
      

          4.16.0-0.nightly-2024-01-21-154905{code}
      How reproducible:

      
      

          100%{code}
      Steps to Reproduce:

      
      

          1. setup cluster with ovn on azure
          2. scale up to 400 by machineset
          3.
          {code}
      Actual results:

      
      

          {code}
      Expected results:

      
      

          {code}
      Additional info:

      
      

          {code}

       

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal CI failure 
      2. customer issue / SD
      3. internal RedHat testing failure

       

      If it is an internal RedHat testing failure:

      • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

       

      If it is a CI failure:

       

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      • If it's a connectivity issue,
      • What is the srcNode, srcIP and srcNamespace and srcPodName?
      • What is the dstNode, dstIP and dstNamespace and dstPodName?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

       

      If it is a customer / SD issue:

       

      • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
      • Don’t presume that Engineering has access to Salesforce.
      • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
      • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

              rravaiol@redhat.com Riccardo Ravaioli
              zzhao1@redhat.com Zhanqi Zhao
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: