[OCPBUGS-14439] 4.14.z: [Clone of OCPBugs-8287] SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.10
Component/s: Node / Kubelet
Labels:

Severity:
Important
Regression:
No
Sprint:
CNF Compute Sprint 237
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Pod admission error after a node reboot in {sno} clusters

In {sno} clusters, pod failures might occur with an `UnexpectedAdmissionError` error after a node reboot. This issue occurs if you don't drain the node to remove all running pods before a node reboot. In such scenarios, the pod recovery order is not predictable, which can cause application pods to recover before a dependent device plugin pod.

As a workaround, if a pod fails with an `UnexpectedAdmissionError` error after a node reboot, you must manually remove the pod by running a command such as the following:
+
[source,terminal]
----
$ kubectl delete pods --field-selector status.phase=Failed -n pod-namespace
----

The deployment controller continues to reconcile the pod creation until the dependent device plugin pod fully recovers. This fix was added in an upstream contribution to the Kubernetes project, see https://github.com/kubernetes/kubernetes/pull/116376 for further information.

[NOTE]
====
To ensure that pod reconciliation occurs, ensure that the pod is part of a `Deployment`, `ReplicaSet` or `DaemonSet` resource. The deployment controller does not attempt to reconcile standalone pods that fail after a node reboot.
====

Show
Pod admission error after a node reboot in {sno} clusters In {sno} clusters, pod failures might occur with an `UnexpectedAdmissionError` error after a node reboot. This issue occurs if you don't drain the node to remove all running pods before a node reboot. In such scenarios, the pod recovery order is not predictable, which can cause application pods to recover before a dependent device plugin pod. As a workaround, if a pod fails with an `UnexpectedAdmissionError` error after a node reboot, you must manually remove the pod by running a command such as the following: + [source,terminal] ---- $ kubectl delete pods --field-selector status.phase=Failed -n pod-namespace ---- The deployment controller continues to reconcile the pod creation until the dependent device plugin pod fully recovers. This fix was added in an upstream contribution to the Kubernetes project, see https://github.com/kubernetes/kubernetes/pull/116376 for further information. [NOTE] ==== To ensure that pod reconciliation occurs, ensure that the pod is part of a `Deployment`, `ReplicaSet` or `DaemonSet` resource. The deployment controller does not attempt to reconcile standalone pods that fail after a node reboot. ====
Release Note Type:
Known Issue
Release Note Status:
In Progress
Internal Whiteboard:
Latest Status Summary:

Hide
5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180
3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898

Show
5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180 3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898
Target Version:

4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

Bug to backport the bugfix in BZ-2117049 to OpenShift 4.10 as it is required from Ericsson.

Description of problem:
The VDU application is deployed and carriers on cell are enabled on SNO.
A cold boot or power cycle occurs. The platform and VDU application start up at the same time. The VDU application fails to start completely because the MAC address of the VFs on the E-810C NIC's MAC are not available when the VDU application starts.

During VDU application start-up, the baseband pod uses the rft_dpdk_getport utility to query the MAC address of the llscu VF. If the MAC address is not available, there is a core dump.

This used to work before the kernel updates picked up new content from Intel in 4.9.37/4.10.17.

Application pod state and core dump:

eric-ran-du-baseband-bf6669bd-ksjjv 4/5 CrashLoopBackOff 18 (66s ago) 11h

core.rft_dpdk_getpor.0.ee6850a4002649698f2770c8080b90d1.84295.1659446944000000.lz4

Version-Release number of selected component (if applicable):
SNO clusters v4.9.37 or 4.10.24

How reproducible:
Reproducible within customer environment

Actual results:
The Baseband pod is in CrashLoopBackOff

Expected results:
The baseband pods should spin up without failing or causing delay

Additional info:
There was a related case 03089320 which was closed in February with a mitigation fix in the February 15th version of the SR-IOV operator. The real fix from Intel was not available at the time.

03089320 – SNO: After reboot node, application pods stuck in CreateContainerConfigError state - endpoint not found openshift.io/pci_sriov_net_*

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

clones

OCPBUGS-8287 4.10.z: SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

Closed

is depended on by

OCPBUGS-14438 4.13.z: [Clone of OCPBugs-8287]SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

Closed

links to

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

openshift/kubernetes#1566: OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

RHBA-2023:6837 OpenShift Container Platform 4.14.z bug fix update

Assignee:: Swati Sehgal

Reporter:: Ignacio Garcia Medina

QA Contact:: Shereen Haj

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/06/01 3:44 PM

Updated:: 2023/11/16 1:13 PM

Resolved:: 2023/11/15 4:22 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide