Loading...

Type: Task
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: openvswitch3.3
Labels:
- OVS-QE-Verification

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
FDP-344
Acceptance Criteria:

Hide

( ) The bug has been reproduced and verified by QE members
( ) Test coverage has been added to downstream CI
( ) For new feature, failed test plans have bugs added as children to the epic
( ) The bug is cloned to any relevant release that we support and/or is needed

Show
( ) The bug has been reproduced and verified by QE members ( ) Test coverage has been added to downstream CI ( ) For new feature, failed test plans have bugs added as children to the epic ( ) The bug is cloned to any relevant release that we support and/or is needed
Planning:
None
AssignedTeam:
rhel-net-ovs-dpdk
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

This ticket is tracking the QE verification effort for the solution to the problem described below.
Description of problem:
Case 03597973 raised awareness that OVNKube on power can overrun can overflow the kernel stack. The current target workaround for this is to ensure that RHEL 9.4 and above have bigger stacks, so that this issue doesn't occur, it's worth investigating if there are other ways that OVNKube could reduce kernel stack usage in OpenShift 4.11 and above. This issue is present in both of the 8.6 and 9.2 kernels, but the stack size fix cannot be backported to older versions since it would break kABIs to change the kernel stack size in the middle of a RHEL 'y-stream'.

Version-Release number of selected component (if applicable):
OpenShift 4.11+. A workaround that increases the kernel stack size should relieve this pressure in versions 4.16 and up, but those would probably also benefit from any improvements that can be discovered.

How reproducible:
The multi-arch team (#forum-ocp-multiarch) hits this issue with some frequency in regular e2e jobs running in CI in OpenShift 4.14 and above. shgokul can probably provide cluster specs, and manoj5 can coordinate testing / verification.

Steps to Reproduce:
The multi-arch team has historically seen this issue when monitoring the remote-libvirt-ppc64le CI jobs:
https://prow.ci.openshift.org/?job=*ocp-e2e-ovn-remote-libvirt-ppc64le

Our jobs are pretty stable these days, but when we do hit the crash, a kdump is produced in artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/.

Here is an example of a job run that hit this issue:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-remote-libvirt-ppc64le/1745475910083022848/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/

Actual results:
One of the compute node VMs crash during test execution. If kdump is enabled, the core dump is produced and the test run proceeds as normal. If kdump is inactive, the node doesn't come back up after crash, usually resulting in a bunch of unrelated test failures.

Expected results:
OpenShift e2e tests can run to completion with stack overflows on Power when deployed with OVNKube.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:
ppc64le using OVNKube running OCP 4.11+

Issue Context:
Is it observable as an internal CI failure, but was also hit as a customer issue in case 03597973.

If it is a CI failure:

When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run

The issue has been observed in CI only from OpenShift 4.14. and up; but was hit by a customer using OpenShift 4.12.

If it is a customer / SD issue:
There is a lot of history for this issue wrapped up in RHEL-3907. I will attempt to attach a kdump and must gather for one of the jobs that hit the issue in CI.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates