Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14438

4.13.z: [Clone of OCPBugs-8287]SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

    XMLWordPrintable

Details

    • Important
    • CNF Compute Sprint 237
    • 1
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Copy from Release Note text in OCPBUGS-2180.

      Please note the following:
      1. Pods that fails at admission time continues to exist on the node and needs to be removed manually:
      kubectl delete pods --field-selector status.phase=Failed -n <Namespace>
      2. It is NOT recommended to have standalone/naked pod for deploying workloads. They should be wrapped up as deployments/replicasets or daemon sets. No retries will be made for standalone pod that fails at admission whereas deployment controller will recreate additional pods in case of failure to ensure that the deployment succeeds.
      3. In case of SNO setups with SR-IOV devices, it is mandatory to specify request for devices in the deployments as the SR-IOV device plugin depends on a network resource injector pod and an operator webhook. The absence of resource injector or webhook typically can result in failures of pods relying on the SR-IOV devices but in case of SNO that can have detrimental impact on the cluster so the failure policy is set to Ignore which means the deployment must request for SR-IOV device in the spec explicitly.
      Show
      Copy from Release Note text in OCPBUGS-2180. Please note the following: 1. Pods that fails at admission time continues to exist on the node and needs to be removed manually: kubectl delete pods --field-selector status.phase=Failed -n <Namespace> 2. It is NOT recommended to have standalone/naked pod for deploying workloads. They should be wrapped up as deployments/replicasets or daemon sets. No retries will be made for standalone pod that fails at admission whereas deployment controller will recreate additional pods in case of failure to ensure that the deployment succeeds. 3. In case of SNO setups with SR-IOV devices, it is mandatory to specify request for devices in the deployments as the SR-IOV device plugin depends on a network resource injector pod and an operator webhook. The absence of resource injector or webhook typically can result in failures of pods relying on the SR-IOV devices but in case of SNO that can have detrimental impact on the cluster so the failure policy is set to Ignore which means the deployment must request for SR-IOV device in the spec explicitly.
    • Hide
      5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180
      3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898
      Show
      5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180 3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898

    Description

      Description of problem:


      Bug to backport the bugfix in BZ-2117049 to OpenShift 4.10 as it is required from Ericsson.


      Description of problem:
      The VDU application is deployed and carriers on cell are enabled on SNO.
      A cold boot or power cycle occurs. The platform and VDU application start up at the same time. The VDU application fails to start completely because the MAC address of the VFs on the E-810C NIC's MAC are not available when the VDU application starts.

      During VDU application start-up, the baseband pod uses the rft_dpdk_getport utility to query the MAC address of the llscu VF. If the MAC address is not available, there is a core dump.

      This used to work before the kernel updates picked up new content from Intel in 4.9.37/4.10.17.

      Application pod state and core dump:

      eric-ran-du-baseband-bf6669bd-ksjjv 4/5 CrashLoopBackOff 18 (66s ago) 11h

      core.rft_dpdk_getpor.0.ee6850a4002649698f2770c8080b90d1.84295.1659446944000000.lz4

      Version-Release number of selected component (if applicable):
      SNO clusters v4.9.37 or 4.10.24

      How reproducible:
      Reproducible within customer environment

      Actual results:
      The Baseband pod is in CrashLoopBackOff

      Expected results:
      The baseband pods should spin up without failing or causing delay

      Additional info:
      There was a related case 03089320 which was closed in February with a mitigation fix in the February 15th version of the SR-IOV operator. The real fix from Intel was not available at the time.

      03089320 – SNO: After reboot node, application pods stuck in CreateContainerConfigError state - endpoint not found openshift.io/pci_sriov_net_*

      Version-Release number of selected component (if applicable):

      
      

      How reproducible:

      
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      
      

      Expected results:

      
      

      Additional info:

      
      

      Attachments

        Issue Links

          Activity

            People

              swsehgal@redhat.com Swati Sehgal
              rhn-support-igarciam Ignacio Garcia Medina
              Shereen Haj Shereen Haj
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: