Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Networking / SR-IOV
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:

4.14.z, 4.15.z, 4.16.z
Target Version:

4.17.0
Release Blocker:
None
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
If you delete a `SriovNetworkNodePolicy` resource for a virtual function with a `vfio-pci` driver type, the SR-IOV Network Operator is unable to reconcile the policy. As a consequence the `sriov-device-plugin` pod enters a continuous restart loop. Now, the SR-IOV operator avoids reconfiguring the node in such circumstances.

Show
If you delete a `SriovNetworkNodePolicy` resource for a virtual function with a `vfio-pci` driver type, the SR-IOV Network Operator is unable to reconcile the policy. As a consequence the `sriov-device-plugin` pod enters a continuous restart loop. Now, the SR-IOV operator avoids reconfiguring the node in such circumstances.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource.

Version-Release number of selected component (if applicable):

4.16.0-rc.3
sriov-network-operator.v4.16.0-202405301906

How reproducible:

 100%

Steps to Reproduce:

    1. On an SNO with DU profile create the following SNNP resources:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: snnp1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  isRdma: false
  nicSelector:
    pfNames:
    - ens2f3#32-33
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numVfs: 48
  resourceName: snnp1
#########################################
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: snnp2
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  isRdma: false
  nicSelector:
    pfNames:
    - ens2f3#34-35
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numVfs: 48
  resourceName: snnp2


  2. Wait for the resources to show up in the node resources:

oc get nodes -o json | jq -r .items[0].status.allocatable
{
  "cpu": "60",
  "ephemeral-storage": "1725943497941",
  "hugepages-1Gi": "32Gi",
  "hugepages-2Mi": "0",
  "intel.com/intel_fec_acc100": "16",
  "management.workload.openshift.io/cores": "64k",
  "memory": "96370036Ki",
  "openshift.io/du_fh": "16",
  "openshift.io/du_mh": "16",
  "openshift.io/pci_sriov_net_f1": "2",
  "openshift.io/snnp1": "2",
  "openshift.io/snnp2": "2",
  "pods": "250"
}

  3. Delete snnp2 resource:

oc -n openshift-sriov-network-operator delete sriovnetworknodepolicy snnp2

4. Check openshift-sriov-network-operator pods:

 oc -n openshift-sriov-network-operator get pods

Actual results:

    sriov-device-plugin pod gets restarted continuously

oc -n openshift-sriov-network-operator get pods
NAME                                      READY   STATUS              RESTARTS   AGE
sriov-device-plugin-2ntll                 0/1     Terminating         0          4s
sriov-device-plugin-4kw94                 0/1     ContainerCreating   0          0s
sriov-network-config-daemon-59k4c         1/1     Running             0          53m
sriov-network-operator-58c996d746-nwktl   1/1     Running             0          57m

This also impacts the other sriov resources reporting 0 allocatable:

oc get nodes -o json | jq -r .items[0].status.allocatable
{
  "cpu": "60",
  "ephemeral-storage": "1725943497941",
  "hugepages-1Gi": "32Gi",
  "hugepages-2Mi": "0",
  "intel.com/intel_fec_acc100": "16",
  "management.workload.openshift.io/cores": "64k",
  "memory": "96370036Ki",
  "openshift.io/du_fh": "0",
  "openshift.io/du_mh": "0",
  "openshift.io/pci_sriov_net_f1": "0",
  "openshift.io/snnp1": "0",
  "openshift.io/snnp2": "0",
  "pods": "250"
}

Expected results:

Resources get updated correctly after deletion.

Additional info:

Attaching must-gather.

clones

OCPBUGS-34934 sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource

Closed

Assignee:: Andrea Panattoni

Reporter:: Marius Cornea

Need Info From:: None

Contributors:: None

QA Contact:: Marius Cornea

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/09/18 3:19 PM

Updated:: 2025/07/21 5:22 PM

Resolved:: 2024/09/18 3:29 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates