-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16
-
None
This is a clone of issue OCPBUGS-36507. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-34934. The following is the description of the original issue:
—
Description of problem:
sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource.
Version-Release number of selected component (if applicable):
4.16.0-rc.3 sriov-network-operator.v4.16.0-202405301906
How reproducible:
100%
Steps to Reproduce:
1. On an SNO with DU profile create the following SNNP resources:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: snnp1
namespace: openshift-sriov-network-operator
spec:
deviceType: vfio-pci
isRdma: false
nicSelector:
pfNames:
- ens2f3#32-33
nodeSelector:
node-role.kubernetes.io/master: ""
numVfs: 48
resourceName: snnp1
#########################################
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: snnp2
namespace: openshift-sriov-network-operator
spec:
deviceType: vfio-pci
isRdma: false
nicSelector:
pfNames:
- ens2f3#34-35
nodeSelector:
node-role.kubernetes.io/master: ""
numVfs: 48
resourceName: snnp2
2. Wait for the resources to show up in the node resources:
oc get nodes -o json | jq -r .items[0].status.allocatable
{
"cpu": "60",
"ephemeral-storage": "1725943497941",
"hugepages-1Gi": "32Gi",
"hugepages-2Mi": "0",
"intel.com/intel_fec_acc100": "16",
"management.workload.openshift.io/cores": "64k",
"memory": "96370036Ki",
"openshift.io/du_fh": "16",
"openshift.io/du_mh": "16",
"openshift.io/pci_sriov_net_f1": "2",
"openshift.io/snnp1": "2",
"openshift.io/snnp2": "2",
"pods": "250"
}
3. Delete snnp2 resource:
oc -n openshift-sriov-network-operator delete sriovnetworknodepolicy snnp2
4. Check openshift-sriov-network-operator pods:
oc -n openshift-sriov-network-operator get pods
Actual results:
sriov-device-plugin pod gets restarted continuously
oc -n openshift-sriov-network-operator get pods
NAME READY STATUS RESTARTS AGE
sriov-device-plugin-2ntll 0/1 Terminating 0 4s
sriov-device-plugin-4kw94 0/1 ContainerCreating 0 0s
sriov-network-config-daemon-59k4c 1/1 Running 0 53m
sriov-network-operator-58c996d746-nwktl 1/1 Running 0 57m
This also impacts the other sriov resources reporting 0 allocatable:
oc get nodes -o json | jq -r .items[0].status.allocatable
{
"cpu": "60",
"ephemeral-storage": "1725943497941",
"hugepages-1Gi": "32Gi",
"hugepages-2Mi": "0",
"intel.com/intel_fec_acc100": "16",
"management.workload.openshift.io/cores": "64k",
"memory": "96370036Ki",
"openshift.io/du_fh": "0",
"openshift.io/du_mh": "0",
"openshift.io/pci_sriov_net_f1": "0",
"openshift.io/snnp1": "0",
"openshift.io/snnp2": "0",
"pods": "250"
}
Expected results:
Resources get updated correctly after deletion.
Additional info:
Attaching must-gather.
- blocks
-
OCPBUGS-36756 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
- clones
-
OCPBUGS-36507 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
- is blocked by
-
OCPBUGS-36507 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
- is cloned by
-
OCPBUGS-36756 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource
-
- Closed
-
- links to
-
RHBA-2024:4473
OpenShift Container Platform 4.15.z extras update