Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14560

[4.12.20] SRIOV pods issues after system reboot. Manual cleanup is required

    XMLWordPrintable

Details

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated
    • Hide
      7/18: PR for OCPBUGS-14605 has been merged u/s. waiting for backports and then will close this bug
      Show
      7/18: PR for OCPBUGS-14605 has been merged u/s. waiting for backports and then will close this bug

    Description

      Description of problem:

      
      There are several problems on SRIOV pods after doing a system reboot on a SNO cluster.
      
      SRIOV pods appear as duplicated. The problem seems to be that SRIOV pods that exist before the reboot are not deleted properly.
      
      oc get pods:
      ...
      test32-deployment-6b5c896c96-6r9kf     6/6     Running                    0          19m
      test32-deployment-6b5c896c96-m5qzv     0/6     UnexpectedAdmissionError   0          32m
      ...
      
      

      Version-Release number of selected component (if applicable):

      
      New bug seen in 4.12.20. It hasn't been seen before.
      
      

      How reproducible:

      
      100% of times after doing system reboot, graceful shutdown or via redfish.
      
      

      Steps to Reproduce:

      1. Deploy DU application with several SRIOV pods
      2. Reboot the system (https://docs.openshift.com/container-platform/4.12/backup_and_restore/graceful-cluster-shutdown.html) or reboot via redfish
      3. Check SRIOV pods after reboot
      
      

      Actual results:

      
      Some SRIOV pods appear as duplicated and with errors:
      
      oc get pods:
      ...
      test32-deployment-6b5c896c96-6r9kf     6/6     Running                    0          19m
      test32-deployment-6b5c896c96-m5qzv     0/6     UnexpectedAdmissionError   0          32m
      ...
      
      
      

      Expected results:

      
      oc get pods --> pods running normally
      
      

      Additional info:

      
      System impact: Manual cleanup of SRIOV pods of the DU is required after any kind of reboot.
      
      
      Old pod32 describe:
      
      oc describe pod/test32-deployment-6b5c896c96-m5qzv
      
        Warning  UnexpectedAdmissionError  25m   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices openshift.io/pci_sriov_net_llscu, which is unexpected
      
      
      

      Attachments

        Activity

          People

            bnemeth@redhat.com Balazs Nemeth
            rlopezma@redhat.com Rodrigo Lopez Manrique
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            1 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: