Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-71211

Flows is missing in Table 66 and OVN controller on the node showed > 90% CPU

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Node was cordoned after multiple workloads entered into "CrashLoopBackOff / error" state.
      In parallel, OVN controller on the node showed > 90% CPU usage with long poll intervals. This affected scheduling stability intermittently and service availability for workloads. 
      On this particular node, users noticed that some pods were unable to reach the kubernetes api service.
      We could trace the ovs flows used for these calls:
      
      $> ovs-appctl ofproto/trace br-int in_port=3967,tcp,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.22,nw_dst=10.224.0.1,tp_dst=443,nw_ttl=42,dp_hash=3 | tail -30
                  resubmit(,28)
              28. ip,metadata=0x5, priority 1, cookie 0xfe2c80da
                  push:NXM_NX_REG0[]
                  push:NXM_NX_XXREG0[96..127]
                  pop:NXM_NX_REG0[]
                   -> NXM_NX_REG0[] is now 0xa3c1201
                  set_field:00:00:00:00:00:00->eth_dst
                  resubmit(,66)
                  66. No match.
                          drop
                  pop:NXM_NX_REG0[]
                   -> NXM_NX_REG0[] is now 0xa3c1201
                  resubmit(,29)
              29. metadata=0x5, priority 0, cookie 0x1de89847
                  resubmit(,30)
              30. metadata=0x5, priority 0, cookie 0xcfe11f91
                  resubmit(,31)
              31. metadata=0x5, priority 0, cookie 0xd391eba8
                  resubmit(,32)
              32. ip,reg0=0xa3c1200/0xfffffe00,reg15=0x2,metadata=0x5, priority 110, cookie 0x3dc6457a
                  set_field:0/0xf0000000->reg10
                  resubmit(,33)
              33. ip,metadata=0x5,dl_dst=00:00:00:00:00:00, priority 100, cookie 0x81f3bdfb
                  controller(userdata=00.00.00.00.00.00.00.00.00.19.00.10.80.00.06.06.ff.ff.ff.ff.ff.ff.00.00.00.1c.00.18.00.20.00.40.00.00.00.00.00.01.de.10.80.00.2c.04.00.00.00.00.00.1c.00.18.00.20.00.60.00.00.00.00.00.01.de.10.80.00.2e.04.00.00.00.00.00.19.00.10.80.00.2a.02.00.01.00.00.00.00.00.00.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.28.00.00.00,pause,meter_id=41)
          pop:NXM_OF_IN_PORT[]
           -> NXM_OF_IN_PORT[] is now 3967
      
      Final flow: recirc_id=0x4c6d735,dp_hash=0x3,eth,tcp,reg0=0x300,reg11=0x3,reg12=0x2,reg13=0x66,reg14=0xfbf,reg15=0x1,metadata=0x3,in_port=3967,vlan_tci=0x0000,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.22,nw_dst=10.19.137.104,nw_tos=0,nw_ecn=0,nw_ttl=42,nw_frag=no,tp_src=0,tp_dst=6443,tcp_flags=0
      Megaflow: pkt_mark=0,recirc_id=0x4c6d735,ct_state=+new-est-rel-rpl-inv+trk,ct_mark=0x2/0xf,eth,tcp,in_port=3967,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.16/28,nw_dst=10.19.137.104,nw_ttl=42,nw_frag=no,tp_dst=6443
      Datapath actions: ct(commit,zone=102,mark=0/0x1,nat(src)),check_pkt_len(size=1414,gt(sample(sample=100.0%,actions(meter(100),userspace(pid=4294967295,controller(reason=1,dont_send=1,continuation=0,recirc_id=80139629,rule_cookie=0x57ec6443,controller_id=0,max_len=65535))))),le(set(eth(src=0a:58:64:40:00:01,dst=0a:58:64:40:00:f3)),set(ipv4(ttl=41)),check_pkt_len(size=1414,gt(sample(sample=100.0%,actions(meter(100),userspace(pid=4294967295,controller(reason=1,dont_send=1,continuation=0,recirc_id=99,rule_cookie=0x9dc0c4f3,controller_id=0,max_len=65535))))),le(set(eth(src=7c:1e:52:fb:5f:d9,dst=00:00:00:00:00:00)),set(ipv4(ttl=40))))))
      
      Here in table 66 that contains flows for arp, no flow is found matching, so ovn-controller is asked to resolve an arp. On the rest of the nodes the flow is present in table 66 and the traffic works.
      It looks like a bug (similar to OCPBUGS-53151) in ovn-controller, where ovn-controller fails to install flows for arp.
          

      Version-Release number of selected component (if applicable):

      Openshift 4.18
      

      How reproducible:

      One occurrence so far 

      Steps to Reproduce:

      Unknown
      

      Actual results:

      Flow is missing in table 66 on a node
      

      Expected results:

      Flows installed with no issue onto each node
      

      Additional info:

          

              rh-ee-arsen Arkadeep Sen (Aurko)
              rh-support-fgrosjea Franck Grosjean
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: