Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11392

[RHELPLAN-154083] OCP 4.13 with RHEL 9.2 RT kernel - Frequent server reboot and crash dump on SNO when ptp is enabled

    XMLWordPrintable

Details

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      With 4.13 rc.2 (9.2 and RT kernel), SNO with DU profiles is noticed to have frequent server reboot in steady state when ptp profiles are applied. 
      
      Note that we used to experience server freeze once in a while, but because we were not able to collect crashdump when that happened, we enabled the following configs by default. So instead of seeing freezes, we now see server reboots with coredump generated as the new symptoms.
      
      kernel.sysrq=1
      kernel.panic_on_rcu_stall=1
      kernel.hung_task_panic=1 

      Version-Release number of selected component (if applicable):

      4.13-rc.2

      How reproducible:

      Frequent - more than 10 reboots in 24 hrs. (about 5 of them are from our automated tests, but most reboots happened itself in steady state)
      
      And we have seem multiple "still running" reboots (see end timestamp), they sometimes get cleared on a followup reboot, sometimes they keep piling up. 
      
      [core@helix49 ~]$ last reboot
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 19:00   still running
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 18:07   still running
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 17:33 - 18:04  (00:30)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 17:16 - 18:04  (00:48)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 16:47 - 18:04  (01:17)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 14:57 - 18:04  (03:07)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 14:39 - 18:04  (03:25)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 14:07 - 18:04  (03:56)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 12:20 - 18:04  (05:43)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 11:41 - 18:04  (06:23)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 11:16 - 18:04  (06:48)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 10:43 - 18:04  (07:21)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 10:02 - 18:04  (08:02)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 09:25 - 18:04  (08:38)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 07:17 - 18:04  (10:47)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 06:01 - 18:04  (12:02)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 04:14 - 18:04  (13:50)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 03:45 - 04:11  (00:25)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 02:44 - 04:11  (01:26)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 01:08 - 04:11  (03:02)
      reboot   system boot  5.14.0-284.4.1.r Mon Apr  3 19:34 - 04:11  (08:36)
      reboot   system boot  5.14.0-284.4.1.r Mon Apr  3 19:09 - 19:31  (00:21)
      reboot   system boot  5.14.0-284.4.1.e Mon Apr  3 18:23 - 19:06  (00:43)
      reboot   system boot  5.14.0-284.4.1.e Mon Apr  3 18:18 - 18:20  (00:02)wtmp begins Mon Apr  3 18:18:47 2023
      [core@helix49 ~]$ ls -la /var/crash/
      total 8
      drwxr-xr-x. 20 root root 4096 Apr  4 18:56 .
      drwxr-xr-x. 24 root root 4096 Apr  3 19:34 ..
      drwxr-xr-x.  2 root root   67 Apr  4 01:05 127.0.0.1-2023-04-04-01:05:31
      drwxr-xr-x.  2 root root   67 Apr  4 02:41 127.0.0.1-2023-04-04-02:40:54
      drwxr-xr-x.  2 root root   67 Apr  4 03:43 127.0.0.1-2023-04-04-03:42:35
      drwxr-xr-x.  2 root root   67 Apr  4 05:58 127.0.0.1-2023-04-04-05:58:27
      drwxr-xr-x.  2 root root   67 Apr  4 07:14 127.0.0.1-2023-04-04-07:13:44
      drwxr-xr-x.  2 root root   67 Apr  4 09:22 127.0.0.1-2023-04-04-09:22:26
      drwxr-xr-x.  2 root root   67 Apr  4 09:59 127.0.0.1-2023-04-04-09:58:44
      drwxr-xr-x.  2 root root   67 Apr  4 10:40 127.0.0.1-2023-04-04-10:39:44
      drwxr-xr-x.  2 root root   67 Apr  4 11:13 127.0.0.1-2023-04-04-11:12:47
      drwxr-xr-x.  2 root root   67 Apr  4 11:38 127.0.0.1-2023-04-04-11:38:25
      drwxr-xr-x.  2 root root   67 Apr  4 12:18 127.0.0.1-2023-04-04-12:17:37
      drwxr-xr-x.  2 root root   67 Apr  4 14:04 127.0.0.1-2023-04-04-14:04:27
      drwxr-xr-x.  2 root root   67 Apr  4 14:36 127.0.0.1-2023-04-04-14:36:20
      drwxr-xr-x.  2 root root   67 Apr  4 14:54 127.0.0.1-2023-04-04-14:54:20
      drwxr-xr-x.  2 root root   67 Apr  4 16:44 127.0.0.1-2023-04-04-16:44:00
      drwxr-xr-x.  2 root root   67 Apr  4 17:13 127.0.0.1-2023-04-04-17:13:12
      drwxr-xr-x.  2 root root   67 Apr  4 17:30 127.0.0.1-2023-04-04-17:30:32
      drwxr-xr-x.  2 root root   67 Apr  4 18:57 127.0.0.1-2023-04-04-18:56:41
      

      Steps to Reproduce:

      1. Install 4.13-rc.2 OCP on Single Node Openshift with Du profiles applied, specifically on this server, there are 
      - 4 ptp profiles configured - 1 Boundary clock (E810 NIC) and 3 downstream slaves (CX-6 NICs).
      - config these via Tuned, so crash dumps are auto generated on rcu stall.
      kernel.sysrq=1 
      kernel.panic_on_rcu_stall=1 
      kernel.hung_task_panic=1 
      
      2. let the system run overnight
      3.
      

      Actual results:

      system rebooted itself, and coredumps are generated

      Expected results:

      system is stable

      Additional info:

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhn-support-yliu1 Yang Liu
            Michael Nguyen Michael Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: