Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14439

4.14.z: [Clone of OCPBugs-8287] SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

    XMLWordPrintable

Details

    • Important
    • No
    • CNF Compute Sprint 237
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Pod admission error after a node reboot in {sno} clusters

      In {sno} clusters, pod failures might occur with an `UnexpectedAdmissionError` error after a node reboot. This issue occurs if you don't drain the node to remove all running pods before a node reboot. In such scenarios, the pod recovery order is not predictable, which can cause application pods to recover before a dependent device plugin pod.

      As a workaround, if a pod fails with an `UnexpectedAdmissionError` error after a node reboot, you must manually remove the pod by running a command such as the following:
      +
      [source,terminal]
      ----
      $ kubectl delete pods --field-selector status.phase=Failed -n pod-namespace
      ----

      The deployment controller continues to reconcile the pod creation until the dependent device plugin pod fully recovers. This fix was added in an upstream contribution to the Kubernetes project, see https://github.com/kubernetes/kubernetes/pull/116376 for further information.

      [NOTE]
      ====
      To ensure that pod reconciliation occurs, ensure that the pod is part of a `Deployment`, `ReplicaSet` or `DaemonSet` resource. The deployment controller does not attempt to reconcile standalone pods that fail after a node reboot.
      ====
      Show
      Pod admission error after a node reboot in {sno} clusters In {sno} clusters, pod failures might occur with an `UnexpectedAdmissionError` error after a node reboot. This issue occurs if you don't drain the node to remove all running pods before a node reboot. In such scenarios, the pod recovery order is not predictable, which can cause application pods to recover before a dependent device plugin pod. As a workaround, if a pod fails with an `UnexpectedAdmissionError` error after a node reboot, you must manually remove the pod by running a command such as the following: + [source,terminal] ---- $ kubectl delete pods --field-selector status.phase=Failed -n pod-namespace ---- The deployment controller continues to reconcile the pod creation until the dependent device plugin pod fully recovers. This fix was added in an upstream contribution to the Kubernetes project, see https://github.com/kubernetes/kubernetes/pull/116376 for further information. [NOTE] ==== To ensure that pod reconciliation occurs, ensure that the pod is part of a `Deployment`, `ReplicaSet` or `DaemonSet` resource. The deployment controller does not attempt to reconcile standalone pods that fail after a node reboot. ====
    • Known Issue
    • In Progress
    • Hide
      5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180
      3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898
      Show
      5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180 3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898

    Description

      Description of problem:


      Bug to backport the bugfix in BZ-2117049 to OpenShift 4.10 as it is required from Ericsson.


      Description of problem:
      The VDU application is deployed and carriers on cell are enabled on SNO.
      A cold boot or power cycle occurs. The platform and VDU application start up at the same time. The VDU application fails to start completely because the MAC address of the VFs on the E-810C NIC's MAC are not available when the VDU application starts.

      During VDU application start-up, the baseband pod uses the rft_dpdk_getport utility to query the MAC address of the llscu VF. If the MAC address is not available, there is a core dump.

      This used to work before the kernel updates picked up new content from Intel in 4.9.37/4.10.17.

      Application pod state and core dump:

      eric-ran-du-baseband-bf6669bd-ksjjv 4/5 CrashLoopBackOff 18 (66s ago) 11h

      core.rft_dpdk_getpor.0.ee6850a4002649698f2770c8080b90d1.84295.1659446944000000.lz4

      Version-Release number of selected component (if applicable):
      SNO clusters v4.9.37 or 4.10.24

      How reproducible:
      Reproducible within customer environment

      Actual results:
      The Baseband pod is in CrashLoopBackOff

      Expected results:
      The baseband pods should spin up without failing or causing delay

      Additional info:
      There was a related case 03089320 which was closed in February with a mitigation fix in the February 15th version of the SR-IOV operator. The real fix from Intel was not available at the time.

      03089320 – SNO: After reboot node, application pods stuck in CreateContainerConfigError state - endpoint not found openshift.io/pci_sriov_net_*

      Version-Release number of selected component (if applicable):

      
      

      How reproducible:

      
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      
      

      Expected results:

      
      

      Additional info:

      
      

      Attachments

        Issue Links

          Activity

            People

              swsehgal@redhat.com Swati Sehgal
              rhn-support-igarciam Ignacio Garcia Medina
              Shereen Haj Shereen Haj
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: