Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2560

Test Coverage: Investigate ways to reduce OVNKube stack usage on pp64le

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      ( ) The test coverage is aligned with the epic's acceptance criteria

      Show
      ( ) The test coverage is aligned with the epic's acceptance criteria
    • None
    • rhel-net-ovs-dpdk

      This task is tracking the test case writing activities to cover the bug described below.

      Description of problem:
      Case 03597973 raised awareness that OVNKube on power can overrun can overflow the kernel stack. The current target workaround for this is to ensure that RHEL 9.4 and above have bigger stacks, so that this issue doesn't occur, it's worth investigating if there are other ways that OVNKube could reduce kernel stack usage in OpenShift 4.11 and above. This issue is present in both of the 8.6 and 9.2 kernels, but the stack size fix cannot be backported to older versions since it would break kABIs to change the kernel stack size in the middle of a RHEL 'y-stream'.

       

      Version-Release number of selected component (if applicable):
      OpenShift 4.11+. A workaround that increases the kernel stack size should relieve this pressure in versions 4.16 and up, but those would probably also benefit from any improvements that can be discovered.

       

      How reproducible:
      The multi-arch team (#forum-ocp-multiarch) hits this issue with some frequency in regular e2e jobs running in CI in OpenShift 4.14 and above. shgokul can probably provide cluster specs, and manoj5 can coordinate testing / verification.
       

      Steps to Reproduce:
      The multi-arch team has historically seen this issue when monitoring the remote-libvirt-ppc64le CI jobs:
      https://prow.ci.openshift.org/?job=*ocp-e2e-ovn-remote-libvirt-ppc64le

      Our jobs are pretty stable these days, but when we do hit the crash, a kdump is produced in artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/.

      Here is an example of a job run that hit this issue:
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-remote-libvirt-ppc64le/1745475910083022848/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/
       

      Actual results:
      One of the compute node VMs crash during test execution. If kdump is enabled, the core dump is produced and the test run proceeds as normal. If kdump is inactive, the node doesn't come back up after crash, usually resulting in a bunch of unrelated test failures.
       

      Expected results:
      OpenShift e2e tests can run to completion with stack overflows on Power when deployed with OVNKube.
       

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:
      ppc64le using OVNKube running OCP 4.11+

      Issue Context:
      Is it observable as an internal CI failure, but was also hit as a customer issue in case 03597973.
       

      If it is a CI failure:

      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run

      The issue has been observed in CI only from OpenShift 4.14. and up; but was hit by a customer using OpenShift 4.12.
       
      If it is a customer / SD issue:
      There is a lot of history for this issue wrapped up in RHEL-3907. I will attempt to attach a kdump and must gather for one of the jobs that hit the issue in CI.

      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

              ovsdpdk-triage ovsdpdk triage
              nstbot NST Bot
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: