Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.10
Component/s: Node / Kubelet
Labels:

Severity:
Important
Regression:
No
Sprint:
CNF Compute Sprint 237
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Copy from Release Note text in OCPBUGS-2180.

Please note the following:
1. Pods that fails at admission time continues to exist on the node and needs to be removed manually:
kubectl delete pods --field-selector status.phase=Failed -n <Namespace>
2. It is NOT recommended to have standalone/naked pod for deploying workloads. They should be wrapped up as deployments/replicasets or daemon sets. No retries will be made for standalone pod that fails at admission whereas deployment controller will recreate additional pods in case of failure to ensure that the deployment succeeds.
3. In case of SNO setups with SR-IOV devices, it is mandatory to specify request for devices in the deployments as the SR-IOV device plugin depends on a network resource injector pod and an operator webhook. The absence of resource injector or webhook typically can result in failures of pods relying on the SR-IOV devices but in case of SNO that can have detrimental impact on the cluster so the failure policy is set to Ignore which means the deployment must request for SR-IOV device in the spec explicitly.

Show
Copy from Release Note text in OCPBUGS-2180. Please note the following: 1. Pods that fails at admission time continues to exist on the node and needs to be removed manually: kubectl delete pods --field-selector status.phase=Failed -n <Namespace> 2. It is NOT recommended to have standalone/naked pod for deploying workloads. They should be wrapped up as deployments/replicasets or daemon sets. No retries will be made for standalone pod that fails at admission whereas deployment controller will recreate additional pods in case of failure to ensure that the deployment succeeds. 3. In case of SNO setups with SR-IOV devices, it is mandatory to specify request for devices in the deployments as the SR-IOV device plugin depends on a network resource injector pod and an operator webhook. The absence of resource injector or webhook typically can result in failures of pods relying on the SR-IOV devices but in case of SNO that can have detrimental impact on the cluster so the failure policy is set to Ignore which means the deployment must request for SR-IOV device in the spec explicitly.
Internal Whiteboard:
Latest Status Summary:

Hide
5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180
3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898

Show
5/15: Copy from status of https://issues.redhat.com/browse/OCPBUGS-2180 3/13: 4.10 copy from OCPBUGSM-47835 / OCPBUGS-9898
Target Version:

4.13.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

Bug to backport the bugfix in BZ-2117049 to OpenShift 4.10 as it is required from Ericsson.

Description of problem:
The VDU application is deployed and carriers on cell are enabled on SNO.
A cold boot or power cycle occurs. The platform and VDU application start up at the same time. The VDU application fails to start completely because the MAC address of the VFs on the E-810C NIC's MAC are not available when the VDU application starts.

During VDU application start-up, the baseband pod uses the rft_dpdk_getport utility to query the MAC address of the llscu VF. If the MAC address is not available, there is a core dump.

This used to work before the kernel updates picked up new content from Intel in 4.9.37/4.10.17.

Application pod state and core dump:

eric-ran-du-baseband-bf6669bd-ksjjv 4/5 CrashLoopBackOff 18 (66s ago) 11h

core.rft_dpdk_getpor.0.ee6850a4002649698f2770c8080b90d1.84295.1659446944000000.lz4

Version-Release number of selected component (if applicable):
SNO clusters v4.9.37 or 4.10.24

How reproducible:
Reproducible within customer environment

Actual results:
The Baseband pod is in CrashLoopBackOff

Expected results:
The baseband pods should spin up without failing or causing delay

Additional info:
There was a related case 03089320 which was closed in February with a mitigation fix in the February 15th version of the SR-IOV operator. The real fix from Intel was not available at the time.

03089320 – SNO: After reboot node, application pods stuck in CreateContainerConfigError state - endpoint not found openshift.io/pci_sriov_net_*

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

clones

OCPBUGS-8287 4.10.z: SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

Closed

depends on

OCPBUGS-14439 4.14.z: [Clone of OCPBugs-8287] SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

Closed

is depended on by

OCPBUGS-14437 4.12.z: [Clone of OCPBugs-8287] SNO 4.10: Power cycle node and MAC address of NIC not available when VDU application starts on Intel E810-C Nic

Closed

links to

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

openshift/kubernetes#1566: OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

RN doc update

(1 links to)

Assignee:: Swati Sehgal

Reporter:: Ignacio Garcia Medina

QA Contact:: Shereen Haj

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/06/01 3:42 PM

Updated:: 2023/06/27 1:46 PM

Resolved:: 2023/06/13 1:11 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates