[OCPBUGS-38647] The microshift-cleanup-data script does not delete all pods when multus RPM is installed

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17.0, 4.16.z, 4.18.0
Component/s: MicroShift
Labels:
None

Regression:
None
Story Points:
3
Sprint:
uShift Sprint 259, uShift Sprint 260, uShift Sprint 261, uShift Sprint 262, uShift Sprint 264, uShift Sprint 265, uShift Sprint 266, uShift Sprint 267, uShift Sprint 268
sprint_count:
9
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.19.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When microshift-multus RPM package is installed, the `microshift-cleanup-data --all` command does not delete all pods

Version-Release number of selected component (if applicable):

4.16+ since microshift-multus was released

How reproducible:

100%

Steps to Reproduce:

    1. sudo dnf install -y microshift-multus
    2. sudo systemctl restart microshift
    3. sudo microshift-cleanup-data --all

Actual results:

'sudo crictl pods' command returns pods

Expected results:

'sudo crictl pods' command should returns an empty list

is blocked by

OCPBUGS-45037 Multus thin plugin's CmdDel waits for API server indefinitely

Verified

is cloned by

OCPBUGS-53292 [4.18] The microshift-cleanup-data script does not delete all pods when multus RPM is installed

is depended on by

OCPBUGS-53292 [4.18] The microshift-cleanup-data script does not delete all pods when multus RPM is installed

Patryk Matuszak added a comment - 2025/03/19 10:00 AM

Verified in https://issues.redhat.com/browse/OCPBUGS-45037

Patryk Matuszak added a comment - 2025/03/19 10:00 AM Verified in https://issues.redhat.com/browse/OCPBUGS-45037

Patryk Matuszak added a comment - 2025/03/03 8:31 AM

https://github.com/openshift/multus-cni/pull/258 has merged and is included in the 4.19. Retest was successful on latest main.

We can expect fixes in previous branches when following backports merge:

> sudo microshift-cleanup-data --all
DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads?
1) Yes
2) No
#? 1
Stopping MicroShift services
Disabling MicroShift services
Removing MicroShift pods
Removing crio image storage
Deleting the br-int interface
Killing conmon, pause and OVN processes
Removing MicroShift configuration
Removing OVN configuration
MicroShift service was stopped
MicroShift service was disabled
Cleanup succeeded

> sudo crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME

Patryk Matuszak added a comment - 2025/03/03 8:31 AM https://github.com/openshift/multus-cni/pull/258 has merged and is included in the 4.19. Retest was successful on latest main. We can expect fixes in previous branches when following backports merge: https://github.com/openshift/multus-cni/pull/259 https://github.com/openshift/multus-cni/pull/260 https://github.com/openshift/multus-cni/pull/261 > sudo microshift-cleanup-data --all DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads? 1) Yes 2) No #? 1 Stopping MicroShift services Disabling MicroShift services Removing MicroShift pods Removing crio image storage Deleting the br- int interface Killing conmon, pause and OVN processes Removing MicroShift configuration Removing OVN configuration MicroShift service was stopped MicroShift service was disabled Cleanup succeeded > sudo crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME

Patryk Matuszak added a comment - 2024/11/25 12:51 PM

Retested with latest 4.16 and 4.17 - problem still persists, probably because the behavior for fixed for thick plugin (there's backing daemon that handles all the calls in one place) but not thin plugin (the one microshift uses, binary on the disk does the job instead of calling daemon).

[ec2-user@i-0e9bba74b0a96c359 ~]$ sudo microshift-cleanup-data --all
DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads?
1) Yes
2) No
#? 1
Stopping MicroShift services
Disabling MicroShift services
Removing MicroShift pods
Removing crio image storage
Deleting the br-int interface
Killing conmon, pause and OVN processes
Removing MicroShift configuration
Removing OVN configuration
MicroShift service was stopped
MicroShift service was disabled
Cleanup succeeded
[ec2-user@i-0e9bba74b0a96c359 ~]$ sudo crictl pods
POD ID              CREATED             STATE               NAME                                      NAMESPACE              ATTEMPT             RUNTIME
338a64dfda618       2 minutes ago       Ready               dns-default-qrmxg                         openshift-dns          0                   (default)
60d30b6aa9ff0       2 minutes ago       Ready               csi-snapshot-webhook-84d79c8cbd-s48pz     kube-system            0                   (default)
4d37d50188470       2 minutes ago       Ready               service-ca-85cfc5b679-6szgp               openshift-service-ca   0                   (default)
2fea66ad6307a       2 minutes ago       Ready               csi-snapshot-controller-69fdbd47b-46tjf   kube-system            0                   (default)
270ec2f9e56ee       2 minutes ago       Ready               router-default-6864cd4d78-6s67x           openshift-ingress      0                   (default

Patryk Matuszak added a comment - 2024/11/25 12:51 PM Retested with latest 4.16 and 4.17 - problem still persists, probably because the behavior for fixed for thick plugin (there's backing daemon that handles all the calls in one place) but not thin plugin (the one microshift uses, binary on the disk does the job instead of calling daemon). [ec2-user@i-0e9bba74b0a96c359 ~]$ sudo microshift-cleanup-data --all DATA LOSS WARNING: Do you wish to stop and clean ALL MicroShift data AND cri-o container workloads? 1) Yes 2) No #? 1 Stopping MicroShift services Disabling MicroShift services Removing MicroShift pods Removing crio image storage Deleting the br- int interface Killing conmon, pause and OVN processes Removing MicroShift configuration Removing OVN configuration MicroShift service was stopped MicroShift service was disabled Cleanup succeeded [ec2-user@i-0e9bba74b0a96c359 ~]$ sudo crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME 338a64dfda618 2 minutes ago Ready dns- default -qrmxg openshift-dns 0 ( default ) 60d30b6aa9ff0 2 minutes ago Ready csi-snapshot-webhook-84d79c8cbd-s48pz kube-system 0 ( default ) 4d37d50188470 2 minutes ago Ready service-ca-85cfc5b679-6szgp openshift-service-ca 0 ( default ) 2fea66ad6307a 2 minutes ago Ready csi-snapshot-controller-69fdbd47b-46tjf kube-system 0 ( default ) 270ec2f9e56ee 2 minutes ago Ready router- default -6864cd4d78-6s67x openshift-ingress 0 ( default

Patryk Matuszak added a comment - 2024/08/27 7:03 AM

Thanks pliurh and dosmith. One small caveat though: MicroShift uses thin plugin (no shim or daemon), so we need this ported from thick to thin (MicroShift 4.16 started shipping multus, so we don't need to backport any further than that)

Patryk Matuszak added a comment - 2024/08/27 7:03 AM Thanks pliurh and dosmith . One small caveat though: MicroShift uses thin plugin (no shim or daemon), so we need this ported from thick to thin (MicroShift 4.16 started shipping multus, so we don't need to backport any further than that)

Douglas Smith added a comment - 2024/08/26 5:58 PM

I think this is a bug in Multus 4.x, I did fix it upstream in https://github.com/k8snetworkplumbingwg/multus-cni/pull/1279 – I'll follow up by finding out what versions its backported to. It should be backported to all supported releases that have Multus 4.x

Douglas Smith added a comment - 2024/08/26 5:58 PM I think this is a bug in Multus 4.x, I did fix it upstream in https://github.com/k8snetworkplumbingwg/multus-cni/pull/1279 – I'll follow up by finding out what versions its backported to. It should be backported to all supported releases that have Multus 4.x

Peng Liu added a comment - 2024/08/26 2:25 PM

dosmith I remember that multus uses cache in pod deletion. Do you think if we could bypass accessing k8s apiserver here when it is not available?

Peng Liu added a comment - 2024/08/26 2:25 PM dosmith I remember that multus uses cache in pod deletion. Do you think if we could bypass accessing k8s apiserver here when it is not available?

Patryk Matuszak added a comment - 2024/08/21 8:57 AM

My findings:

Only Pods surviving the cleanup are ones without `hostNetwork`, i.e. relying on the CNI
When cleanup is executed, the script first stops the microshift service, which turns down the API server
When Multus does anything, it calls API server to get Pod's annotations (to get list of delegate networks) and NADs (to know how to call the delegates)
If API Server is not accessible, Multus is stuck waiting for the connection
Even without KAS/kubelet, deleting Pod via crictl will result in CRI-O calling the CNIs to clean up networking things

Workarounds:

Delete Pods before stopping MicroShift
- This would really be about deleting DaemonSets/Deployments/StatefulSets/Pods
- Requires running microshift during cleanup, but what if it's already stopped?
Temporarily relocate /etc/crio/crio.conf.d/12-microshift-multus.conf so CRI-O doesn't call Multus
- That file is installed by microshift-multus RPM so messing with that sounds volatile
- Not calling Multus and all the delegates can result in resources (like interfaces or CNI plugins' cache) not being cleaned

Technically, if Multus doesn't get the kubeconfig and is running out of cluster, it should use some kind of cache. However, removing kubeconfig from /etc/cni/net.d/00-multus.conf doesn't help as CRI-O seems to be providing a cached version of the file instead (CNIs get their config via stdin, they don't read directly from the file in /etc/cni/net.d; restarting crio doesn't help).

pliurh zshi@redhat.com do you have any ideas?

Patryk Matuszak added a comment - 2024/08/21 8:57 AM My findings: Only Pods surviving the cleanup are ones without `hostNetwork`, i.e. relying on the CNI When cleanup is executed, the script first stops the microshift service, which turns down the API server When Multus does anything, it calls API server to get Pod's annotations (to get list of delegate networks) and NADs (to know how to call the delegates) If API Server is not accessible, Multus is stuck waiting for the connection Even without KAS/kubelet, deleting Pod via crictl will result in CRI-O calling the CNIs to clean up networking things Workarounds: Delete Pods before stopping MicroShift This would really be about deleting DaemonSets/Deployments/StatefulSets/Pods Requires running microshift during cleanup, but what if it's already stopped? Temporarily relocate /etc/crio/crio.conf.d/12-microshift-multus.conf so CRI-O doesn't call Multus That file is installed by microshift-multus RPM so messing with that sounds volatile Not calling Multus and all the delegates can result in resources (like interfaces or CNI plugins' cache) not being cleaned Technically, if Multus doesn't get the kubeconfig and is running out of cluster, it should use some kind of cache. However, removing kubeconfig from /etc/cni/net.d/00-multus.conf doesn't help as CRI-O seems to be providing a cached version of the file instead (CNIs get their config via stdin, they don't read directly from the file in /etc/cni/net.d; restarting crio doesn't help). pliurh zshi@redhat.com do you have any ideas?

Assignee:: Patryk Matuszak

Reporter:: Gregory Giguashvili

QA Contact:: John George

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/08/19 3:06 PM

Updated:: 2025/03/19 10:00 AM

Resolved:: 2025/03/19 10:00 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Patryk Matuszak added a comment - 2025/03/19 10:00 AM

Expand comment: Patryk Matuszak added a comment - 2025/03/19 10:00 AM

Collapse comment: Patryk Matuszak added a comment - 2025/03/03 8:31 AM

Expand comment: Patryk Matuszak added a comment - 2025/03/03 8:31 AM

Collapse comment: Patryk Matuszak added a comment - 2024/11/25 12:51 PM

Expand comment: Patryk Matuszak added a comment - 2024/11/25 12:51 PM

Collapse comment: Patryk Matuszak added a comment - 2024/08/27 7:03 AM

Expand comment: Patryk Matuszak added a comment - 2024/08/27 7:03 AM

Collapse comment: Douglas Smith added a comment - 2024/08/26 5:58 PM

Expand comment: Douglas Smith added a comment - 2024/08/26 5:58 PM

Collapse comment: Peng Liu added a comment - 2024/08/26 2:25 PM

Expand comment: Peng Liu added a comment - 2024/08/26 2:25 PM

Collapse comment: Patryk Matuszak added a comment - 2024/08/21 8:57 AM

Expand comment: Patryk Matuszak added a comment - 2024/08/21 8:57 AM

People

Dates