Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34934

sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • None
    • 4.16
    • Networking / SR-IOV
    • None
    • Critical
    • No
    • CNF Network Sprint 254, CNF Network Sprint 255
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      KNOWN ISSUE ALREADY DOCUMENTED IN THE 4.16 NOTES

      If you delete a `SriovNetworkNodePolicy` resource for a virtual function with a `vfio-pci` driver type, the SR-IOV Network Operator is unable to reconcile the policy. As a consequence the `sriov-device-plugin` pod enters a continuous restart loop. As a workaround, delete all remaining policies affecting the physical function, then re-create them. (link:https://issues.redhat.com/browse/(link:https://issues.redhat.com/browse/OCPBUGS-34934[*OCPBUGS-34934*])[*
      Show
      KNOWN ISSUE ALREADY DOCUMENTED IN THE 4.16 NOTES If you delete a `SriovNetworkNodePolicy` resource for a virtual function with a `vfio-pci` driver type, the SR-IOV Network Operator is unable to reconcile the policy. As a consequence the `sriov-device-plugin` pod enters a continuous restart loop. As a workaround, delete all remaining policies affecting the physical function, then re-create them. (link: https://issues.redhat.com/browse/(link:https://issues.redhat.com/browse/OCPBUGS-34934 [* OCPBUGS-34934 *])[*
    • Known Issue
    • Done

      Description of problem:

      sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource.    

      Version-Release number of selected component (if applicable):

      4.16.0-rc.3
      sriov-network-operator.v4.16.0-202405301906

      How reproducible:

       100%

      Steps to Reproduce:

          1. On an SNO with DU profile create the following SNNP resources:
      
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: snnp1
        namespace: openshift-sriov-network-operator
      spec:
        deviceType: vfio-pci
        isRdma: false
        nicSelector:
          pfNames:
          - ens2f3#32-33
        nodeSelector:
          node-role.kubernetes.io/master: ""
        numVfs: 48
        resourceName: snnp1
      #########################################
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: snnp2
        namespace: openshift-sriov-network-operator
      spec:
        deviceType: vfio-pci
        isRdma: false
        nicSelector:
          pfNames:
          - ens2f3#34-35
        nodeSelector:
          node-role.kubernetes.io/master: ""
        numVfs: 48
        resourceName: snnp2
      
      
        2. Wait for the resources to show up in the node resources:
      
      oc get nodes -o json | jq -r .items[0].status.allocatable
      {
        "cpu": "60",
        "ephemeral-storage": "1725943497941",
        "hugepages-1Gi": "32Gi",
        "hugepages-2Mi": "0",
        "intel.com/intel_fec_acc100": "16",
        "management.workload.openshift.io/cores": "64k",
        "memory": "96370036Ki",
        "openshift.io/du_fh": "16",
        "openshift.io/du_mh": "16",
        "openshift.io/pci_sriov_net_f1": "2",
        "openshift.io/snnp1": "2",
        "openshift.io/snnp2": "2",
        "pods": "250"
      }
      
        3. Delete snnp2 resource:
      
      oc -n openshift-sriov-network-operator delete sriovnetworknodepolicy snnp2
      
      4. Check openshift-sriov-network-operator pods:
      
       oc -n openshift-sriov-network-operator get pods     

      Actual results:

          sriov-device-plugin pod gets restarted continuously
      
      oc -n openshift-sriov-network-operator get pods
      NAME                                      READY   STATUS              RESTARTS   AGE
      sriov-device-plugin-2ntll                 0/1     Terminating         0          4s
      sriov-device-plugin-4kw94                 0/1     ContainerCreating   0          0s
      sriov-network-config-daemon-59k4c         1/1     Running             0          53m
      sriov-network-operator-58c996d746-nwktl   1/1     Running             0          57m
      
      This also impacts the other sriov resources reporting 0 allocatable:
      
      oc get nodes -o json | jq -r .items[0].status.allocatable
      {
        "cpu": "60",
        "ephemeral-storage": "1725943497941",
        "hugepages-1Gi": "32Gi",
        "hugepages-2Mi": "0",
        "intel.com/intel_fec_acc100": "16",
        "management.workload.openshift.io/cores": "64k",
        "memory": "96370036Ki",
        "openshift.io/du_fh": "0",
        "openshift.io/du_mh": "0",
        "openshift.io/pci_sriov_net_f1": "0",
        "openshift.io/snnp1": "0",
        "openshift.io/snnp2": "0",
        "pods": "250"
      }
      

      Expected results:

      Resources get updated correctly after deletion.    

      Additional info:

      Attaching must-gather.    

              apanatto@redhat.com Andrea Panattoni
              mcornea@redhat.com Marius Cornea
              Marius Cornea Marius Cornea
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: