Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7856

[4.13] ovnkube pod crashed after enable ovs hardware offload in baremetal cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • 4.13.0
    • 4.13
    • Networking / SR-IOV
    • None
    • Critical
    • No
    • NHE Sprint 233
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • NA

    Description

      Description of problem:

      Deploy dualstack OCP cluster with baremetal worker nodes and then enable ovs harward offload by creating sriovnetworkpoolconfig with yaml below, ovnkube pods of the baremetal workers crashed. 
      Check 'ovs-vsctl show' in the worker nodes, physical inteface is no longer under br-ex.
      
      # cat sriov_pool.yaml 
      
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkPoolConfig
      metadata:
        name: sriovnetworkpoolconfig-offload
        namespace: openshift-sriov-network-operator
      spec:
        ovsHardwareOffloadConfig:
          name: sriov
      
      
      # oc get pods -n openshift-ovn-kubernetes -o wide
      NAME                   READY   STATUS             RESTARTS         AGE     IP               NODE                                       NOMINATED NODE   READINESS GATES
      ovnkube-master-77mlf   6/6     Running            0                3h29m   192.168.111.20   master-0.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-master-fm2lb   6/6     Running            0                3h29m   192.168.111.22   master-2.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-master-skdmr   6/6     Running            2 (3h20m ago)    3h29m   192.168.111.21   master-1.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-7jqmp     5/5     Running            1 (3h28m ago)    3h29m   192.168.111.21   master-1.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-84dqq     5/5     Running            1 (3h28m ago)    3h29m   192.168.111.20   master-0.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-dvkkg     5/5     Running            1 (3h28m ago)    3h29m   192.168.111.22   master-2.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-gc6nd     5/5     Running            1 (3h9m ago)     3h9m    192.168.111.23   worker-0.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-gqm9t     5/5     Running            1 (3h8m ago)     3h9m    192.168.111.24   worker-1.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-jpfg6     4/5     CrashLoopBackOff   26 (4m13s ago)   153m    192.168.111.40   openshift-qe-025.lab.eng.rdu2.redhat.com   <none>           <none>
      ovnkube-node-svljb     4/5     CrashLoopBackOff   24 (2m15s ago)   152m    192.168.111.47   openshift-qe-029.lab.eng.rdu2.redhat.com   <none>           <none>
      
      sh-4.4# ovs-vsctl show
      91e93171-1a86-48a4-a2a5-22f958c39ae8
          Bridge br-int
              fail_mode: secure
              datapath_type: system
              Port br-int
                  Interface br-int
                      type: internal
              Port ovn-2a6cca-0
                  Interface ovn-2a6cca-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="192.168.111.21"}
              Port ovn-f8b96d-0
                  Interface ovn-f8b96d-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="192.168.111.23"}
              Port ovn-19750f-0
                  Interface ovn-19750f-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="192.168.111.47"}
              Port ovn-7cc33c-0
                  Interface ovn-7cc33c-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="192.168.111.24"}
              Port ovn-451129-0
                  Interface ovn-451129-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="192.168.111.22"}
              Port ovn-fe9b22-0
                  Interface ovn-fe9b22-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="192.168.111.20"}
              Port ovn-k8s-mp0
                  Interface ovn-k8s-mp0
                      type: internal
              Port patch-br-int-to-br-ex_openshift-qe-025.lab.eng.rdu2.redhat.com
                  Interface patch-br-int-to-br-ex_openshift-qe-025.lab.eng.rdu2.redhat.com
                      type: patch
                      options: {peer=patch-br-ex_openshift-qe-025.lab.eng.rdu2.redhat.com-to-br-int}
          Bridge br-ex
              Port patch-br-ex_openshift-qe-025.lab.eng.rdu2.redhat.com-to-br-int
                  Interface patch-br-ex_openshift-qe-025.lab.eng.rdu2.redhat.com-to-br-int
                      type: patch
                      options: {peer=patch-br-int-to-br-ex_openshift-qe-025.lab.eng.rdu2.redhat.com}
          ovs_version: "2.17.6"

      Version-Release number of selected component (if applicable):

      4.13

      How reproducible:

       

      Steps to Reproduce:

      1. Deploy dualstack cluster and add baremetal hosts as worker nodes.
      2. install sriov network operator
      3. Enable ovs hardware offload by creating yaml below
      # cat sriov_pool.yaml  apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkPoolConfig metadata:   name: sriovnetworkpoolconfig-offload   namespace: openshift-sriov-network-operator spec:   ovsHardwareOffloadConfig:     name: sriov
      4. check ovnkube pods 
      5. check 'ovs-vsctl show' in worker node.
      
      

      Actual results:

      ovnkube pods crashed

      Expected results:

      ovnkube pods should not crash

      Additional info:

      ovnkube pods logs:
      [root@openshift-qe-026 offload]# oc get pods -n openshift-ovn-kubernetes -o wide
      NAME                   READY   STATUS             RESTARTS        AGE    IP               NODE                                       NOMINATED NODE   READINESS GATES
      ovnkube-master-77mlf   6/6     Running            0               3h2m   192.168.111.20   master-0.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-master-fm2lb   6/6     Running            0               3h2m   192.168.111.22   master-2.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-master-skdmr   6/6     Running            2 (173m ago)    3h2m   192.168.111.21   master-1.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-7jqmp     5/5     Running            1 (3h ago)      3h2m   192.168.111.21   master-1.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-84dqq     5/5     Running            1 (3h ago)      3h2m   192.168.111.20   master-0.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-dvkkg     5/5     Running            1 (3h ago)      3h2m   192.168.111.22   master-2.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-gc6nd     5/5     Running            1 (162m ago)    162m   192.168.111.23   worker-0.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-gqm9t     5/5     Running            1 (161m ago)    162m   192.168.111.24   worker-1.offload.openshift-qe.sdn.com      <none>           <none>
      ovnkube-node-jpfg6     4/5     CrashLoopBackOff   21 (2m8s ago)   125m   192.168.111.40   openshift-qe-025.lab.eng.rdu2.redhat.com   <none>           <none>
      ovnkube-node-svljb     4/5     CrashLoopBackOff   19 (25s ago)    124m   192.168.111.47   openshift-qe-029.lab.eng.rdu2.redhat.com   <none>           <none>
      [root@openshift-qe-026 offload]# 
      
      [root@openshift-qe-026 offload]# oc logs ovnkube-node-jpfg6 -n openshift-ovn-kubernetes 
      Defaulted container "ovn-controller" out of: ovn-controller, ovn-acl-logging, kube-rbac-proxy, kube-rbac-proxy-ovn-metrics, ovnkube-node
      2023-02-22T06:15:34+00:00 - starting ovn-controller
      2023-02-22T06:15:34Z|00001|vlog|INFO|opened log file /var/log/ovn/acl-audit-log.log
      2023-02-22T06:15:34.651Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
      2023-02-22T06:15:34.651Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
      2023-02-22T06:15:34.653Z|00004|main|INFO|OVN internal version is : [22.12.1-20.27.0-70.6]
      2023-02-22T06:15:34.653Z|00005|main|INFO|OVS IDL reconnected, force recompute.
      2023-02-22T06:15:34.656Z|00006|reconnect|INFO|ssl:192.168.111.21:9642: connecting...
      2023-02-22T06:15:34.656Z|00007|main|INFO|OVNSB IDL reconnected, force recompute.
      2023-02-22T06:15:34.661Z|00008|reconnect|INFO|ssl:192.168.111.21:9642: connected
      2023-02-22T06:15:34.748Z|00009|features|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2023-02-22T06:15:34.748Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2023-02-22T06:15:34.750Z|00011|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2023-02-22T06:15:34.750Z|00012|features|INFO|OVS Feature: ct_zero_snat, state: supported
      2023-02-22T06:15:34.750Z|00013|main|INFO|OVS feature set changed, force recompute.
      2023-02-22T06:15:34.750Z|00014|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2023-02-22T06:15:34.750Z|00015|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2023-02-22T06:15:34.751Z|00016|main|INFO|OVS feature set changed, force recompute.
      2023-02-22T06:15:34.751Z|00017|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2023-02-22T06:15:34.751Z|00018|binding|INFO|Releasing lport openshift-network-diagnostics_network-check-target-zl74q from this chassis (sb_readonly=0)
      2023-02-22T06:15:34.751Z|00019|if_status|WARN|Trying to release unknown interface openshift-network-diagnostics_network-check-target-zl74q
      2023-02-22T06:15:34.751Z|00020|binding|INFO|Releasing lport openshift-multus_network-metrics-daemon-jxqlb from this chassis (sb_readonly=0)
      2023-02-22T06:15:34.751Z|00021|binding|INFO|Releasing lport openshift-ingress-canary_ingress-canary-mnxvc from this chassis (sb_readonly=0)
      2023-02-22T06:15:34.751Z|00022|binding|INFO|Releasing lport openshift-cluster-csi-drivers_shared-resource-csi-driver-node-s2fdx from this chassis (sb_readonly=0)
      2023-02-22T06:15:34.751Z|00023|binding|INFO|Releasing lport openshift-dns_dns-default-rf5cq from this chassis (sb_readonly=0)
      2023-02-22T06:15:34.789Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2023-02-22T06:15:34.789Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2023-02-22T06:15:34.789Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2023-02-22T06:15:44.757Z|00024|memory|INFO|26836 kB peak resident set size after 10.1 seconds
      2023-02-22T06:15:44.757Z|00025|memory|INFO|idl-cells-OVN_Southbound:39984 idl-cells-Open_vSwitch:644 lflow-cache-entries-cache-expr:563 lflow-cache-entries-cache-matches:817 lflow-cache-size-KB:1467 local_datapath_usage-KB:1 ofctrl_desired_flow_usage-KB:682 ofctrl_installed_flow_usage-KB:528 ofctrl_sb_flow_ref_usage-KB:305
      2023-02-22T06:16:07.540Z|00026|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:17:46.291Z|00027|lflow_cache|INFO|Detected cache inactivity (last active 30002 ms ago): trimming cache
      2023-02-22T06:20:06.185Z|00028|lflow_cache|INFO|Detected cache inactivity (last active 30002 ms ago): trimming cache
      2023-02-22T06:24:08.043Z|00029|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:26:26.560Z|00030|lflow_cache|INFO|Detected cache inactivity (last active 30001 ms ago): trimming cache
      2023-02-22T06:28:03.488Z|00031|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:31:18.754Z|00032|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:34:14.578Z|00033|lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago): trimming cache
      2023-02-22T06:36:54.982Z|00034|lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago): trimming cache
      2023-02-22T06:42:19.599Z|00035|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:44:34.134Z|00036|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:45:37.803Z|00037|lflow_cache|INFO|Detected cache inactivity (last active 30002 ms ago): trimming cache
      [root@openshift-qe-026 offload]# 
      [root@openshift-qe-026 offload]# 
      [root@openshift-qe-026 offload]# oc logs ovnkube-node-svljb -n openshift-ovn-kubernetes
      Defaulted container "ovn-controller" out of: ovn-controller, ovn-acl-logging, kube-rbac-proxy, kube-rbac-proxy-ovn-metrics, ovnkube-node
      2023-02-22T06:27:30+00:00 - starting ovn-controller
      2023-02-22T06:27:30Z|00001|vlog|INFO|opened log file /var/log/ovn/acl-audit-log.log
      2023-02-22T06:27:30.480Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
      2023-02-22T06:27:30.480Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
      2023-02-22T06:27:30.482Z|00004|main|INFO|OVN internal version is : [22.12.1-20.27.0-70.6]
      2023-02-22T06:27:30.482Z|00005|main|INFO|OVS IDL reconnected, force recompute.
      2023-02-22T06:27:30.486Z|00006|reconnect|INFO|ssl:192.168.111.22:9642: connecting...
      2023-02-22T06:27:30.486Z|00007|main|INFO|OVNSB IDL reconnected, force recompute.
      2023-02-22T06:27:30.502Z|00008|reconnect|INFO|ssl:192.168.111.22:9642: connected
      2023-02-22T06:27:30.600Z|00009|features|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2023-02-22T06:27:30.600Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2023-02-22T06:27:30.604Z|00011|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2023-02-22T06:27:30.604Z|00012|features|INFO|OVS Feature: ct_zero_snat, state: supported
      2023-02-22T06:27:30.604Z|00013|main|INFO|OVS feature set changed, force recompute.
      2023-02-22T06:27:30.604Z|00014|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2023-02-22T06:27:30.604Z|00015|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2023-02-22T06:27:30.605Z|00016|main|INFO|OVS feature set changed, force recompute.
      2023-02-22T06:27:30.605Z|00017|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2023-02-22T06:27:30.605Z|00018|binding|INFO|Releasing lport openshift-ingress-canary_ingress-canary-sdqkp from this chassis (sb_readonly=0)
      2023-02-22T06:27:30.605Z|00019|if_status|WARN|Trying to release unknown interface openshift-ingress-canary_ingress-canary-sdqkp
      2023-02-22T06:27:30.605Z|00020|binding|INFO|Releasing lport openshift-network-diagnostics_network-check-target-9mvkj from this chassis (sb_readonly=0)
      2023-02-22T06:27:30.605Z|00021|binding|INFO|Releasing lport openshift-dns_dns-default-jlq8p from this chassis (sb_readonly=0)
      2023-02-22T06:27:30.605Z|00022|binding|INFO|Releasing lport openshift-cluster-csi-drivers_shared-resource-csi-driver-node-qfj6f from this chassis (sb_readonly=0)
      2023-02-22T06:27:30.605Z|00023|binding|INFO|Releasing lport openshift-multus_network-metrics-daemon-ghnsn from this chassis (sb_readonly=0)
      2023-02-22T06:27:30.651Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2023-02-22T06:27:30.651Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2023-02-22T06:27:30.651Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2023-02-22T06:27:40.509Z|00024|memory|INFO|24848 kB peak resident set size after 10.0 seconds
      2023-02-22T06:27:40.509Z|00025|memory|INFO|idl-cells-OVN_Southbound:39933 idl-cells-Open_vSwitch:644 lflow-cache-entries-cache-expr:563 lflow-cache-entries-cache-matches:817 lflow-cache-size-KB:1467 local_datapath_usage-KB:1 ofctrl_desired_flow_usage-KB:675 ofctrl_installed_flow_usage-KB:521 ofctrl_sb_flow_ref_usage-KB:303
      2023-02-22T06:28:03.486Z|00026|lflow_cache|INFO|Detected cache inactivity (last active 30003 ms ago): trimming cache
      2023-02-22T06:31:18.753Z|00027|lflow_cache|INFO|Detected cache inactivity (last active 30003 ms ago): trimming cache
      2023-02-22T06:34:14.578Z|00028|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:36:54.981Z|00029|lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago): trimming cache
      2023-02-22T06:42:19.598Z|00030|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache
      2023-02-22T06:44:34.134Z|00031|lflow_cache|INFO|Detected cache inactivity (last active 30003 ms ago): trimming cache
      2023-02-22T06:45:37.803Z|00032|lflow_cache|INFO|Detected cache inactivity (last active 30001 ms ago): trimming cache
      [root@openshift-qe-026 offload]# 
      
      
      must-gather logs: https://file.apac.redhat.com/~yingwang/must-gather.tar.gz 
      
      

      Attachments

        Issue Links

          Activity

            People

              wizhao@redhat.com William Zhao
              rhn-support-yingwang Ying Wang
              Ying Wang Ying Wang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: