Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38647

The microshift-cleanup-data script does not delete all pods when multus RPM is installed

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.17.0, 4.16.z, 4.18.0
    • MicroShift
    • None
    • None
    • 3
    • uShift Sprint 259, uShift Sprint 260, uShift Sprint 261, uShift Sprint 262, uShift Sprint 264, uShift Sprint 265, uShift Sprint 266, uShift Sprint 267, uShift Sprint 268
    • 9
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When microshift-multus RPM package is installed, the `microshift-cleanup-data --all` command does not delete all pods

      Version-Release number of selected component (if applicable):

      4.16+ since microshift-multus was released

      How reproducible:

      100%

      Steps to Reproduce:

          1. sudo dnf install -y microshift-multus
          2. sudo systemctl restart microshift
          3. sudo microshift-cleanup-data --all    

      Actual results:

      'sudo crictl pods' command returns pods

      Expected results:

      'sudo crictl pods' command should returns an empty list

            [OCPBUGS-38647] The microshift-cleanup-data script does not delete all pods when multus RPM is installed

            Patryk Matuszak added a comment - Verified in https://issues.redhat.com/browse/OCPBUGS-45037

            https://github.com/openshift/multus-cni/pull/258 has merged and is included in the 4.19. Retest was successful on latest main.

            We can expect fixes in previous branches when following backports merge:

             

            > sudo microshift-cleanup-data --all
            DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads?
            1) Yes
            2) No
            #? 1
            Stopping MicroShift services
            Disabling MicroShift services
            Removing MicroShift pods
            Removing crio image storage
            Deleting the br-int interface
            Killing conmon, pause and OVN processes
            Removing MicroShift configuration
            Removing OVN configuration
            MicroShift service was stopped
            MicroShift service was disabled
            Cleanup succeeded
            
            > sudo crictl pods
            POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME 

            Patryk Matuszak added a comment - https://github.com/openshift/multus-cni/pull/258 has merged and is included in the 4.19. Retest was successful on latest main. We can expect fixes in previous branches when following backports merge: https://github.com/openshift/multus-cni/pull/259 https://github.com/openshift/multus-cni/pull/260 https://github.com/openshift/multus-cni/pull/261   > sudo microshift-cleanup-data --all DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads? 1) Yes 2) No #? 1 Stopping MicroShift services Disabling MicroShift services Removing MicroShift pods Removing crio image storage Deleting the br- int interface Killing conmon, pause and OVN processes Removing MicroShift configuration Removing OVN configuration MicroShift service was stopped MicroShift service was disabled Cleanup succeeded > sudo crictl pods POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME

            Retested with latest 4.16 and 4.17 - problem still persists, probably because the behavior for fixed for thick plugin (there's backing daemon that handles all the calls in  one place) but not thin plugin (the one microshift uses, binary on the disk does the job instead of calling daemon).

             

            [ec2-user@i-0e9bba74b0a96c359 ~]$ sudo microshift-cleanup-data --all
            DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads?
            1) Yes
            2) No
            #? 1
            Stopping MicroShift services
            Disabling MicroShift services
            Removing MicroShift pods
            Removing crio image storage
            Deleting the br-int interface
            Killing conmon, pause and OVN processes
            Removing MicroShift configuration
            Removing OVN configuration
            MicroShift service was stopped
            MicroShift service was disabled
            Cleanup succeeded
            [ec2-user@i-0e9bba74b0a96c359 ~]$ sudo crictl pods
            POD ID              CREATED             STATE               NAME                                      NAMESPACE              ATTEMPT             RUNTIME
            338a64dfda618       2 minutes ago       Ready               dns-default-qrmxg                         openshift-dns          0                   (default)
            60d30b6aa9ff0       2 minutes ago       Ready               csi-snapshot-webhook-84d79c8cbd-s48pz     kube-system            0                   (default)
            4d37d50188470       2 minutes ago       Ready               service-ca-85cfc5b679-6szgp               openshift-service-ca   0                   (default)
            2fea66ad6307a       2 minutes ago       Ready               csi-snapshot-controller-69fdbd47b-46tjf   kube-system            0                   (default)
            270ec2f9e56ee       2 minutes ago       Ready               router-default-6864cd4d78-6s67x           openshift-ingress      0                   (default 

            Patryk Matuszak added a comment - Retested with latest 4.16 and 4.17 - problem still persists, probably because the behavior for fixed for thick plugin (there's backing daemon that handles all the calls in  one place) but not thin plugin (the one microshift uses, binary on the disk does the job instead of calling daemon).   [ec2-user@i-0e9bba74b0a96c359 ~]$ sudo microshift-cleanup-data --all DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads? 1) Yes 2) No #? 1 Stopping MicroShift services Disabling MicroShift services Removing MicroShift pods Removing crio image storage Deleting the br- int interface Killing conmon, pause and OVN processes Removing MicroShift configuration Removing OVN configuration MicroShift service was stopped MicroShift service was disabled Cleanup succeeded [ec2-user@i-0e9bba74b0a96c359 ~]$ sudo crictl pods POD ID              CREATED             STATE               NAME                                      NAMESPACE              ATTEMPT             RUNTIME 338a64dfda618       2 minutes ago       Ready               dns- default -qrmxg                         openshift-dns          0                   ( default ) 60d30b6aa9ff0       2 minutes ago       Ready               csi-snapshot-webhook-84d79c8cbd-s48pz     kube-system            0                   ( default ) 4d37d50188470       2 minutes ago       Ready               service-ca-85cfc5b679-6szgp               openshift-service-ca   0                   ( default ) 2fea66ad6307a       2 minutes ago       Ready               csi-snapshot-controller-69fdbd47b-46tjf   kube-system            0                   ( default ) 270ec2f9e56ee       2 minutes ago       Ready               router- default -6864cd4d78-6s67x           openshift-ingress      0                   ( default

            Thanks pliurh and dosmith. One small caveat though: MicroShift uses thin plugin (no shim or daemon), so we need this ported from thick to thin (MicroShift 4.16 started shipping multus, so we don't need to backport any further than that)

            Patryk Matuszak added a comment - Thanks pliurh and dosmith . One small caveat though: MicroShift uses thin plugin (no shim or daemon), so we need this ported from thick to thin (MicroShift 4.16 started shipping multus, so we don't need to backport any further than that)

            I think this is a bug in Multus 4.x, I did fix it upstream in https://github.com/k8snetworkplumbingwg/multus-cni/pull/1279 – I'll follow up by finding out what versions its backported to. It should be backported to all supported releases that have Multus 4.x

            Douglas Smith added a comment - I think this is a bug in Multus 4.x, I did fix it upstream in https://github.com/k8snetworkplumbingwg/multus-cni/pull/1279 – I'll follow up by finding out what versions its backported to. It should be backported to all supported releases that have Multus 4.x

            Peng Liu added a comment -

            dosmith I remember that multus uses cache in pod deletion. Do you think if we could bypass accessing k8s apiserver here when it is not available?

            Peng Liu added a comment - dosmith I remember that multus uses cache in pod deletion. Do you think if we could bypass accessing k8s apiserver here when it is not available?

            My findings:

            • Only Pods surviving the cleanup are ones without `hostNetwork`, i.e. relying on the CNI
            • When cleanup is executed, the script first stops the microshift service, which turns down the API server
            • When Multus does anything, it calls API server to get Pod's annotations (to get list of delegate networks) and NADs (to know how to call the delegates)
            • If API Server is not accessible, Multus is stuck waiting for the connection
            • Even without KAS/kubelet, deleting Pod via crictl will result in CRI-O calling the CNIs to clean up networking things

             

            Workarounds:

            • Delete Pods before stopping MicroShift
              • This would really be about deleting DaemonSets/Deployments/StatefulSets/Pods
              • Requires running microshift during cleanup, but what if it's already stopped?
            • Temporarily relocate /etc/crio/crio.conf.d/12-microshift-multus.conf so CRI-O doesn't call Multus
              • That file is installed by microshift-multus RPM so messing with that sounds volatile
              • Not calling Multus and all the delegates can result in resources (like interfaces or CNI plugins' cache) not being cleaned

             

            Technically, if Multus doesn't get the kubeconfig and is running out of cluster, it should use some kind of cache. However, removing kubeconfig from /etc/cni/net.d/00-multus.conf doesn't help as CRI-O seems to be providing a cached version of the file  instead (CNIs get their config via stdin, they don't read directly from the file in /etc/cni/net.d; restarting crio doesn't help).

             

            pliurh zshi@redhat.com  do you have any ideas?

            Patryk Matuszak added a comment - My findings: Only Pods surviving the cleanup are ones without `hostNetwork`, i.e. relying on the CNI When cleanup is executed, the script first stops the microshift service, which turns down the API server When Multus does anything, it calls API server to get Pod's annotations (to get list of delegate networks) and NADs (to know how to call the delegates) If API Server is not accessible, Multus is stuck waiting for the connection Even without KAS/kubelet, deleting Pod via crictl will result in CRI-O calling the CNIs to clean up networking things   Workarounds: Delete Pods before stopping MicroShift This would really be about deleting DaemonSets/Deployments/StatefulSets/Pods Requires running microshift during cleanup, but what if it's already stopped? Temporarily relocate /etc/crio/crio.conf.d/12-microshift-multus.conf so CRI-O doesn't call Multus That file is installed by microshift-multus RPM so messing with that sounds volatile Not calling Multus and all the delegates can result in resources (like interfaces or CNI plugins' cache) not being cleaned   Technically, if Multus doesn't get the kubeconfig and is running out of cluster, it should use some kind of cache. However, removing kubeconfig from /etc/cni/net.d/00-multus.conf doesn't help as CRI-O seems to be providing a cached version of the file  instead (CNIs get their config via stdin, they don't read directly from the file in /etc/cni/net.d; restarting crio doesn't help).   pliurh zshi@redhat.com   do you have any ideas?

              pmatusza@redhat.com Patryk Matuszak
              ggiguash@redhat.com Gregory Giguashvili
              John George John George
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: