Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37212

pod deletion doesn't occur fast enough resulting in new pod multus interface failing ipv6 duplicate address detection

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.14
    • Networking / multus
    • None
    • Critical
    • None
    • True
    • Hide

      This bug is critically impeding the Nokia UDM redundancy tests, which are essential for ensuring system reliability and stability before going live. Until this issue is resolved, we cannot proceed with these final tests, risking potential delays in the go-live.

      Show
      This bug is critically impeding the Nokia UDM redundancy tests, which are essential for ensuring system reliability and stability before going live. Until this issue is resolved, we cannot proceed with these final tests, risking potential delays in the go-live.

      Description of problem:

      On pod deletion, clean up takes too long intermittently resulting in the replacement pod multus interface failing IPv6 DAD.
      
      Sample reproduction:
      - Worker 14 begins to remove the pod at 14:21:14:
      
      Jul 17 14:21:14 worker14 kubenswrapper[9796]: I0717 14:21:14.904545    9796 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=[NAMESPACE/POD]
      
      - Worker 19 begins to add the pod at 14:21:14:
      
      Jul 17 14:21:14 worker19 kubenswrapper[9438]: I0717 14:21:14.952931    9438 kubelet.go:2425] "SyncLoop ADD" source="api" pods=[NAMESPACE/POD]
      
      - Worker 19 tries adding the network to the pod at Jul 17 14:21:15:
      
      Jul 17 14:21:15 worker19 crio[9376]: time="2024-07-17 14:21:15.294568336Z" level=info msg="Adding pod NAMESPACE/POD to CNI network \"multus-cni-network\" (type=multus-shim)"
      
      - But hiccups due to DAD failure at 14:21:17:
      
      Jul 17 14:21:17 worker19 kernel: IPv6: eth1: IPv6 duplicate address <IPv6_ADDRESS> used by <MAC> detected!
      
      - worker 14 has not finished tearing down the original pod and related netns:
      
      Jul 17 14:21:37 worker14 crio[9601]: time="2024-07-17 14:21:37.789184337Z" level=info msg="Got pod network &{Name:<POD> Namespace:<NAMESPACE> ID:a36d6da2c26fb668b3d9a665544ae25629377656b180bd3db2b4e199c59f9793 UID:9b7db4ae-b0bc-4987-ac57-35d3c42afdb3 NetNS:/var/run/netns/9b37d0a3-61c9-4b57-b5ea-51e1964b58c0 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
      Jul 17 14:21:37 worker14 crio[9601]: time="2024-07-17 14:21:37.789403797Z" level=info msg="Deleting pod <POD> from CNI network \"multus-cni-network\" (type=multus-shim)"
      Jul 17 14:21:38 worker14 kubenswrapper[9796]: I0717 14:21:38.924580    9796 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=[NAMESPACE/POD]
      Jul 17 14:21:38 worker14 kubenswrapper[9796]: I0717 14:21:38.936882    9796 kubelet.go:2435] "SyncLoop REMOVE" source="api" pods=[NAMESPACE/POD]
      
      It's clear this is a timing issue where the replacement pod tries assigning the IPv6 address before the original pod network has been cleaned up.
      
      

      Version-Release number of selected component (if applicable):

          4.14

      How reproducible:

          Somewhat intermittent but can reliably be reproduced

      Steps to Reproduce:

      Steps to reproduce:
      - Delete a pod
      - Wait for pod to be rescheduled and jump on to the new worker:
      - Determine network namespace:
      - - $ for ns in $(ip netns | awk '{print $1}'); do ip netns exec $ns ip a | grep -iq 'IP'; if [ $? == 0 ]; then echo $ns; fi; done
      - Validate eth1 is in tentative+dadfailed:
      - - $ ip netns exec <NS> ip a    

      Actual results:

       6: eth1@if24: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9000 qdisc noqueue state UNKNOWN link/ether 88:e9:a4:71:62:5c brd ff:ff:ff:ff:ff:ff inet6 IPv6_ADDRESS/64 scope global tentative dadfailed <--- FAILED 
      valid_lft forever preferred_lft forever inet6 fe80::88e9:a400:371:625c/64 scope link valid_lft forever preferred_lft forever     

      Expected results:

       No IPv6 DAD failure.

      Additional info:

          Note: This was not seen in the impacted cluster until upgraded to 4.14 so this might be regression or new bug.

       

            bpickard@redhat.com Ben Pickard
            rhn-support-coldford Cory Oldford
            Weibin Liang Weibin Liang
            Votes:
            2 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: