-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16
-
None
Description of problem:
sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource.
Version-Release number of selected component (if applicable):
4.16.0-rc.3 sriov-network-operator.v4.16.0-202405301906
How reproducible:
100%
Steps to Reproduce:
1. On an SNO with DU profile create the following SNNP resources: apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: snnp1 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci isRdma: false nicSelector: pfNames: - ens2f3#32-33 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 48 resourceName: snnp1 ######################################### apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: snnp2 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci isRdma: false nicSelector: pfNames: - ens2f3#34-35 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 48 resourceName: snnp2 2. Wait for the resources to show up in the node resources: oc get nodes -o json | jq -r .items[0].status.allocatable { "cpu": "60", "ephemeral-storage": "1725943497941", "hugepages-1Gi": "32Gi", "hugepages-2Mi": "0", "intel.com/intel_fec_acc100": "16", "management.workload.openshift.io/cores": "64k", "memory": "96370036Ki", "openshift.io/du_fh": "16", "openshift.io/du_mh": "16", "openshift.io/pci_sriov_net_f1": "2", "openshift.io/snnp1": "2", "openshift.io/snnp2": "2", "pods": "250" } 3. Delete snnp2 resource: oc -n openshift-sriov-network-operator delete sriovnetworknodepolicy snnp2 4. Check openshift-sriov-network-operator pods: oc -n openshift-sriov-network-operator get pods
Actual results:
sriov-device-plugin pod gets restarted continuously oc -n openshift-sriov-network-operator get pods NAME READY STATUS RESTARTS AGE sriov-device-plugin-2ntll 0/1 Terminating 0 4s sriov-device-plugin-4kw94 0/1 ContainerCreating 0 0s sriov-network-config-daemon-59k4c 1/1 Running 0 53m sriov-network-operator-58c996d746-nwktl 1/1 Running 0 57m This also impacts the other sriov resources reporting 0 allocatable: oc get nodes -o json | jq -r .items[0].status.allocatable { "cpu": "60", "ephemeral-storage": "1725943497941", "hugepages-1Gi": "32Gi", "hugepages-2Mi": "0", "intel.com/intel_fec_acc100": "16", "management.workload.openshift.io/cores": "64k", "memory": "96370036Ki", "openshift.io/du_fh": "0", "openshift.io/du_mh": "0", "openshift.io/pci_sriov_net_f1": "0", "openshift.io/snnp1": "0", "openshift.io/snnp2": "0", "pods": "250" }
Expected results:
Resources get updated correctly after deletion.
Additional info:
Attaching must-gather.
- blocks
-
OCPBUGS-36507 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
- Closed
- is cloned by
-
OCPBUGS-36507 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
- Closed
-
OCPBUGS-42158 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
- Closed
- links to
-
RHBA-2024:7598 OpenShift Container Platform 4.16.z extras update
(2 links to)