-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16
-
None
Description of problem:
sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource.
Version-Release number of selected component (if applicable):
4.16.0-rc.3 sriov-network-operator.v4.16.0-202405301906
How reproducible:
100%
Steps to Reproduce:
1. On an SNO with DU profile create the following SNNP resources: apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: snnp1 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci isRdma: false nicSelector: pfNames: - ens2f3#32-33 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 48 resourceName: snnp1 ######################################### apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: snnp2 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci isRdma: false nicSelector: pfNames: - ens2f3#34-35 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 48 resourceName: snnp2 2. Wait for the resources to show up in the node resources: oc get nodes -o json | jq -r .items[0].status.allocatable { "cpu": "60", "ephemeral-storage": "1725943497941", "hugepages-1Gi": "32Gi", "hugepages-2Mi": "0", "intel.com/intel_fec_acc100": "16", "management.workload.openshift.io/cores": "64k", "memory": "96370036Ki", "openshift.io/du_fh": "16", "openshift.io/du_mh": "16", "openshift.io/pci_sriov_net_f1": "2", "openshift.io/snnp1": "2", "openshift.io/snnp2": "2", "pods": "250" } 3. Delete snnp2 resource: oc -n openshift-sriov-network-operator delete sriovnetworknodepolicy snnp2 4. Check openshift-sriov-network-operator pods: oc -n openshift-sriov-network-operator get pods
Actual results:
sriov-device-plugin pod gets restarted continuously oc -n openshift-sriov-network-operator get pods NAME READY STATUS RESTARTS AGE sriov-device-plugin-2ntll 0/1 Terminating 0 4s sriov-device-plugin-4kw94 0/1 ContainerCreating 0 0s sriov-network-config-daemon-59k4c 1/1 Running 0 53m sriov-network-operator-58c996d746-nwktl 1/1 Running 0 57m This also impacts the other sriov resources reporting 0 allocatable: oc get nodes -o json | jq -r .items[0].status.allocatable { "cpu": "60", "ephemeral-storage": "1725943497941", "hugepages-1Gi": "32Gi", "hugepages-2Mi": "0", "intel.com/intel_fec_acc100": "16", "management.workload.openshift.io/cores": "64k", "memory": "96370036Ki", "openshift.io/du_fh": "0", "openshift.io/du_mh": "0", "openshift.io/pci_sriov_net_f1": "0", "openshift.io/snnp1": "0", "openshift.io/snnp2": "0", "pods": "250" }
Expected results:
Resources get updated correctly after deletion.
Additional info:
Attaching must-gather.
- blocks
-
OCPBUGS-36507 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
- is cloned by
-
OCPBUGS-36507 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
-
OCPBUGS-42158 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
- links to
-
RHBA-2024:7598 OpenShift Container Platform 4.16.z extras update
(2 links to)
Since the problem described in this issue should be resolved in a recent advisory, it has been closed.
For information on the advisory (OpenShift Container Platform 4.16.16 security and extras update), and where to find the updated files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2024:7598