Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17564

VF getting removed from bond when pod level bonding is used

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • None
    • 4.10
    • Networking / SR-IOV
    • Critical
    • No
    • 2
    • NHE Sprint 247
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      VF getting removed from bond when pod level bonding is used
      
      Active-Passive bond on 
      
      
      Affected Pod : fms-gateway-cmdty-6ddfb6cd7d-zt48w in vig-dev

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      occuring on customer env

      Steps to Reproduce:

      1. Configure Active-Passive bond using 2 VFs
      2. Check for cat /proc/net/bonding/net3 from pod
      3. We can device getting disconnected
      

      Actual results:

      The VF interface being link down & getting renamed
      [7203165.020151] mlx5_core 0000:98:07.4 net1: Link up
      [7203165.512579] mlx5_core 0000:98:1e.0 net2: Link up
      [7203166.047239] mlx5_core 0000:98:1e.0 ens2f1v111: renamed from net2
      [7203166.547277] mlx5_core 0000:98:07.4 ens2f0v58: renamed from net1
      
      The below is visible in events
      KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_yield-curve-scheduler-7f85fbc7b5-dkl7x_vig-dev_e4055a4b-99f7-4bd4-9c9c-193c81aa1265_0(ce6c2ffd61f63887bf4ebcaf5e8c220dbe30a9fbb49591146b50792efbf801b0): error removing pod vig-dev_yield-curve-scheduler-7f85fbc7b5-dkl7x from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"bond\": error in getting result from DelNetwork: Failed to retrieve link objects from configuration file (&{NetConf:{CNIVersion:0.3.1 Name:bond-net1 Type:bond Capabilities:map[] IPAM:{Type:whereabouts} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[] PrevResult:<nil>} Mode:active-backup LinksContNs:true FailOverMac:1 Miimon:100 Links:[map[name:net1] map[name:net2]] MTU:1500}), error: Failed to confirm that link (net1) exists, error: Failed to lookup link name net1, error: Link not found / delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: failed to get netlink device with name net2: \"Link not found\" / delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: failed to get netlink device with name net1: \"Link not found\""

      Expected results:

      Pod functioning normally with net3 bond working

      Additional info:

      nmstate operator auto-update last night at ~8:40pm central time seems to have triggered this issue on multiple prod clusters.

              wizhao@redhat.com William Zhao
              rhn-support-adubey Akash Dubey
              Evgeny Levin Evgeny Levin
              Salvatore Daniele
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: