Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2568

QE verification: Investigate ways to reduce OVNKube stack usage on pp64le

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      ( ) The bug has been reproduced and verified by QE members
      ( ) Test coverage has been added to downstream CI
      ( ) For new feature, failed test plans have bugs added as children to the epic
      ( ) The bug is cloned to any relevant release that we support and/or is needed

      Show
      ( ) The bug has been reproduced and verified by QE members ( ) Test coverage has been added to downstream CI ( ) For new feature, failed test plans have bugs added as children to the epic ( ) The bug is cloned to any relevant release that we support and/or is needed
    • None
    • rhel-net-ovs-dpdk

      This ticket is tracking the QE verification effort for the solution to the problem described below.
      Description of problem:
      Case 03597973 raised awareness that OVNKube on power can overrun can overflow the kernel stack. The current target workaround for this is to ensure that RHEL 9.4 and above have bigger stacks, so that this issue doesn't occur, it's worth investigating if there are other ways that OVNKube could reduce kernel stack usage in OpenShift 4.11 and above. This issue is present in both of the 8.6 and 9.2 kernels, but the stack size fix cannot be backported to older versions since it would break kABIs to change the kernel stack size in the middle of a RHEL 'y-stream'.

       

      Version-Release number of selected component (if applicable):
      OpenShift 4.11+. A workaround that increases the kernel stack size should relieve this pressure in versions 4.16 and up, but those would probably also benefit from any improvements that can be discovered.

       

      How reproducible:
      The multi-arch team (#forum-ocp-multiarch) hits this issue with some frequency in regular e2e jobs running in CI in OpenShift 4.14 and above. shgokul can probably provide cluster specs, and manoj5 can coordinate testing / verification.
       

      Steps to Reproduce:
      The multi-arch team has historically seen this issue when monitoring the remote-libvirt-ppc64le CI jobs:
      https://prow.ci.openshift.org/?job=*ocp-e2e-ovn-remote-libvirt-ppc64le

      Our jobs are pretty stable these days, but when we do hit the crash, a kdump is produced in artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/.

      Here is an example of a job run that hit this issue:
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-remote-libvirt-ppc64le/1745475910083022848/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/
       

      Actual results:
      One of the compute node VMs crash during test execution. If kdump is enabled, the core dump is produced and the test run proceeds as normal. If kdump is inactive, the node doesn't come back up after crash, usually resulting in a bunch of unrelated test failures.
       

      Expected results:
      OpenShift e2e tests can run to completion with stack overflows on Power when deployed with OVNKube.
       

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:
      ppc64le using OVNKube running OCP 4.11+

      Issue Context:
      Is it observable as an internal CI failure, but was also hit as a customer issue in case 03597973.
       

      If it is a CI failure:

      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run

      The issue has been observed in CI only from OpenShift 4.14. and up; but was hit by a customer using OpenShift 4.12.
       
      If it is a customer / SD issue:
      There is a lot of history for this issue wrapped up in RHEL-3907. I will attempt to attach a kdump and must gather for one of the jobs that hit the issue in CI.

      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

              ovsdpdk-triage ovsdpdk triage
              nstbot NST Bot
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: