Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34934

sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.16
    • Networking / SR-IOV
    • None
    • Critical
    • No
    • CNF Network Sprint 254, CNF Network Sprint 255
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: The sriov-network-operator does not reconcile any previously configured vfio-pci Virtual Function when the related SriovNetworkNodePolicy is deleted.
      Consequence: The sriov-device-plugin component goes into a loop restart cycle until the policy is recreated.
      Workaround: Delete every remaining policy that affects the physical function and recreate them afterward.
      Show
      Cause: The sriov-network-operator does not reconcile any previously configured vfio-pci Virtual Function when the related SriovNetworkNodePolicy is deleted. Consequence: The sriov-device-plugin component goes into a loop restart cycle until the policy is recreated. Workaround: Delete every remaining policy that affects the physical function and recreate them afterward.
    • Known Issue
    • Proposed

      Description of problem:

      sriov-device-plugin pod ends in a restart loop after deleting a SriovNetworkNodePolicy resource.    

      Version-Release number of selected component (if applicable):

      4.16.0-rc.3
      sriov-network-operator.v4.16.0-202405301906

      How reproducible:

       100%

      Steps to Reproduce:

          1. On an SNO with DU profile create the following SNNP resources:
      
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: snnp1
        namespace: openshift-sriov-network-operator
      spec:
        deviceType: vfio-pci
        isRdma: false
        nicSelector:
          pfNames:
          - ens2f3#32-33
        nodeSelector:
          node-role.kubernetes.io/master: ""
        numVfs: 48
        resourceName: snnp1
      #########################################
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: snnp2
        namespace: openshift-sriov-network-operator
      spec:
        deviceType: vfio-pci
        isRdma: false
        nicSelector:
          pfNames:
          - ens2f3#34-35
        nodeSelector:
          node-role.kubernetes.io/master: ""
        numVfs: 48
        resourceName: snnp2
      
      
        2. Wait for the resources to show up in the node resources:
      
      oc get nodes -o json | jq -r .items[0].status.allocatable
      {
        "cpu": "60",
        "ephemeral-storage": "1725943497941",
        "hugepages-1Gi": "32Gi",
        "hugepages-2Mi": "0",
        "intel.com/intel_fec_acc100": "16",
        "management.workload.openshift.io/cores": "64k",
        "memory": "96370036Ki",
        "openshift.io/du_fh": "16",
        "openshift.io/du_mh": "16",
        "openshift.io/pci_sriov_net_f1": "2",
        "openshift.io/snnp1": "2",
        "openshift.io/snnp2": "2",
        "pods": "250"
      }
      
        3. Delete snnp2 resource:
      
      oc -n openshift-sriov-network-operator delete sriovnetworknodepolicy snnp2
      
      4. Check openshift-sriov-network-operator pods:
      
       oc -n openshift-sriov-network-operator get pods     

      Actual results:

          sriov-device-plugin pod gets restarted continuously
      
      oc -n openshift-sriov-network-operator get pods
      NAME                                      READY   STATUS              RESTARTS   AGE
      sriov-device-plugin-2ntll                 0/1     Terminating         0          4s
      sriov-device-plugin-4kw94                 0/1     ContainerCreating   0          0s
      sriov-network-config-daemon-59k4c         1/1     Running             0          53m
      sriov-network-operator-58c996d746-nwktl   1/1     Running             0          57m
      
      This also impacts the other sriov resources reporting 0 allocatable:
      
      oc get nodes -o json | jq -r .items[0].status.allocatable
      {
        "cpu": "60",
        "ephemeral-storage": "1725943497941",
        "hugepages-1Gi": "32Gi",
        "hugepages-2Mi": "0",
        "intel.com/intel_fec_acc100": "16",
        "management.workload.openshift.io/cores": "64k",
        "memory": "96370036Ki",
        "openshift.io/du_fh": "0",
        "openshift.io/du_mh": "0",
        "openshift.io/pci_sriov_net_f1": "0",
        "openshift.io/snnp1": "0",
        "openshift.io/snnp2": "0",
        "pods": "250"
      }
      

      Expected results:

      Resources get updated correctly after deletion.    

      Additional info:

      Attaching must-gather.    

            apanatto@redhat.com Andrea Panattoni
            mcornea@redhat.com Marius Cornea
            Marius Cornea Marius Cornea
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: