Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14604

Pod deletion stuck because bond-cni reports "Link not found" wrongly

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 4.12
    • Networking / SR-IOV
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem: 

      Failed to delete the Pods with bond-cni and the Pod  is stuck in Deleting status:

      Events:
        Type     Reason             Age                   From     Message
        ----     ------             ----                  ----     -------
        Normal   Killing            161m                  kubelet  Stopping container vru-cudr-dsa-mp
        Warning  FailedPreStopHook  160m                  kubelet  Exec lifecycle hook ([/home/mcm_prestop]) for Container "vru-cudr-dsa-mp" in Pod "sc-cudr-dsa-mp-0-0-1-0_uspp-ft-5(19bb5a66-773a-4fd6-9c6c-12c27826b254)" failed - error: command '/home/mcm_prestop' exited with 137: , message: "send prestop msg success.\r\n"
        Warning  FailedKillPod      160m                  kubelet  error killing pod: failed to "KillPodSandbox" for "19bb5a66-773a-4fd6-9c6c-12c27826b254" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_sc-cudr-dsa-mp-0-0-1-0_uspp-ft-5_19bb5a66-773a-4fd6-9c6c-12c27826b254_0(2883ce6bac3e986aeeb4ad167f0878230c89ce422c161d3419d11b3c7d4df59b): error removing pod uspp-ft-5_sc-cudr-dsa-mp-0-0-1-0 from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"bond\": error in getting result from DelNetwork: Failed to retrieve link objects from configuration file (&{NetConf:{CNIVersion:0.3.1 Name:uspp-ft-5-bond-net-sig-kernel Type:bond Capabilities:map[] IPAM:{Type:whereabouts} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[] PrevResult:<nil>} Mode:active-backup LinksContNs:true FailOverMac:1 Miimon:100 Links:[map[name:svc-sigk-left0] map[name:svc-sigk-left1]] MTU:1800}), error: Failed to confirm that link (svc-sigk-left0) exists, error: Failed to lookup link name svc-sigk-left0, error: Link not found / delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: failed to get netlink device with name svc-sigk-left1: \"Link not found\" / delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: failed to get netlink device with name svc-sigk-left0: \"Link not found\""
        Warning  FailedKillPod      59s (x557 over 160m)  kubelet  error killing pod: failed to "KillPodSandbox" for "19bb5a66-773a-4fd6-9c6c-12c27826b254" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_sc-cudr-dsa-mp-0-0-1-0_uspp-ft-5_19bb5a66-773a-4fd6-9c6c-12c27826b254_0(2883ce6bac3e986aeeb4ad167f0878230c89ce422c161d3419d11b3c7d4df59b): error removing pod uspp-ft-5_sc-cudr-dsa-mp-0-0-1-0 from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"bond\": error in getting result from DelNetwork: Failed to retrieve link objects from configuration file (&{NetConf:{CNIVersion:0.3.1 Name:uspp-ft-5-bond-net-sig-kernel Type:bond Capabilities:map[] IPAM:{Type:whereabouts} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[] PrevResult:<nil>} Mode:active-backup LinksContNs:true FailOverMac:1 Miimon:100 Links:[map[name:svc-sigk-left0] map[name:svc-sigk-left1]] MTU:1800}), error: Failed to confirm that link (svc-sigk-left0) exists, error: Failed to lookup link name svc-sigk-left0, error: Link not found"
      
      

       

      "Failed to confirm that link" comes from here: https://github.com/openshift/bond-cni/blob/release-4.12/bond/bond.go#L95-L96

                  _, ok := err.(netlink.LinkNotFoundError)
                  if !ok || !isDel || !bondConf.LinksContNs {
                      return nil, fmt.Errorf("Failed to confirm that link (%+v) exists, error: %+v", linkName, err)
                  }
              } else {

      As we can see in the above error message, the error was "Link not found", so
      err.(netlink.LinkNotFoundError) should return ok. Also this should be Del cmd so isDel should be true. And net-attach-def, the "linksInContainer" has been set to true:

      spec: config: '{ "type": "bond", "cniVersion": "0.3.1", "name": "uspp-ft-5-bond-net-sig-kernel", "mode": "active-backup", "failOverMac": 1, "linksInContainer": true, "miimon": "100", "mtu": 1800, "links": [ {"name": "svc-sigk-left0"}, {"name": "svc-sigk-left1"} ], "ipam": { "type": "whereabouts", "range": "193.21.10.0/24", "range_end": "193.21.10.253", "range_start": "193.21.10.101", "gateway": "193.21.10.1" } }'

      Then this condition "if !ok || !isDel || !bondConf.LinksContNs" should be false and the code shouldn't enter it.

      Version-Release number of selected component (if applicable):

      4.12.15
      sriov-network-operator.v4.12.0-202305101515
      Pod using bond-cni with VF as slaves

      How reproducible:

      Sometimes in customer's site

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

      Attachments

        Activity

          People

            bnemeth@redhat.com Balazs Nemeth
            rhn-support-cchen Chen Chen
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: