Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11392

[RHELPLAN-154083] OCP 4.13 with RHEL 9.2 RT kernel - Frequent server reboot and crash dump on SNO when ptp is enabled

XMLWordPrintable

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      With 4.13 rc.2 (9.2 and RT kernel), SNO with DU profiles is noticed to have frequent server reboot in steady state when ptp profiles are applied. 
      
      Note that we used to experience server freeze once in a while, but because we were not able to collect crashdump when that happened, we enabled the following configs by default. So instead of seeing freezes, we now see server reboots with coredump generated as the new symptoms.
      
      kernel.sysrq=1
      kernel.panic_on_rcu_stall=1
      kernel.hung_task_panic=1 

      Version-Release number of selected component (if applicable):

      4.13-rc.2

      How reproducible:

      Frequent - more than 10 reboots in 24 hrs. (about 5 of them are from our automated tests, but most reboots happened itself in steady state)
      
      And we have seem multiple "still running" reboots (see end timestamp), they sometimes get cleared on a followup reboot, sometimes they keep piling up. 
      
      [core@helix49 ~]$ last reboot
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 19:00   still running
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 18:07   still running
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 17:33 - 18:04  (00:30)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 17:16 - 18:04  (00:48)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 16:47 - 18:04  (01:17)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 14:57 - 18:04  (03:07)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 14:39 - 18:04  (03:25)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 14:07 - 18:04  (03:56)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 12:20 - 18:04  (05:43)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 11:41 - 18:04  (06:23)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 11:16 - 18:04  (06:48)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 10:43 - 18:04  (07:21)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 10:02 - 18:04  (08:02)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 09:25 - 18:04  (08:38)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 07:17 - 18:04  (10:47)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 06:01 - 18:04  (12:02)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 04:14 - 18:04  (13:50)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 03:45 - 04:11  (00:25)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 02:44 - 04:11  (01:26)
      reboot   system boot  5.14.0-284.4.1.r Tue Apr  4 01:08 - 04:11  (03:02)
      reboot   system boot  5.14.0-284.4.1.r Mon Apr  3 19:34 - 04:11  (08:36)
      reboot   system boot  5.14.0-284.4.1.r Mon Apr  3 19:09 - 19:31  (00:21)
      reboot   system boot  5.14.0-284.4.1.e Mon Apr  3 18:23 - 19:06  (00:43)
      reboot   system boot  5.14.0-284.4.1.e Mon Apr  3 18:18 - 18:20  (00:02)wtmp begins Mon Apr  3 18:18:47 2023
      [core@helix49 ~]$ ls -la /var/crash/
      total 8
      drwxr-xr-x. 20 root root 4096 Apr  4 18:56 .
      drwxr-xr-x. 24 root root 4096 Apr  3 19:34 ..
      drwxr-xr-x.  2 root root   67 Apr  4 01:05 127.0.0.1-2023-04-04-01:05:31
      drwxr-xr-x.  2 root root   67 Apr  4 02:41 127.0.0.1-2023-04-04-02:40:54
      drwxr-xr-x.  2 root root   67 Apr  4 03:43 127.0.0.1-2023-04-04-03:42:35
      drwxr-xr-x.  2 root root   67 Apr  4 05:58 127.0.0.1-2023-04-04-05:58:27
      drwxr-xr-x.  2 root root   67 Apr  4 07:14 127.0.0.1-2023-04-04-07:13:44
      drwxr-xr-x.  2 root root   67 Apr  4 09:22 127.0.0.1-2023-04-04-09:22:26
      drwxr-xr-x.  2 root root   67 Apr  4 09:59 127.0.0.1-2023-04-04-09:58:44
      drwxr-xr-x.  2 root root   67 Apr  4 10:40 127.0.0.1-2023-04-04-10:39:44
      drwxr-xr-x.  2 root root   67 Apr  4 11:13 127.0.0.1-2023-04-04-11:12:47
      drwxr-xr-x.  2 root root   67 Apr  4 11:38 127.0.0.1-2023-04-04-11:38:25
      drwxr-xr-x.  2 root root   67 Apr  4 12:18 127.0.0.1-2023-04-04-12:17:37
      drwxr-xr-x.  2 root root   67 Apr  4 14:04 127.0.0.1-2023-04-04-14:04:27
      drwxr-xr-x.  2 root root   67 Apr  4 14:36 127.0.0.1-2023-04-04-14:36:20
      drwxr-xr-x.  2 root root   67 Apr  4 14:54 127.0.0.1-2023-04-04-14:54:20
      drwxr-xr-x.  2 root root   67 Apr  4 16:44 127.0.0.1-2023-04-04-16:44:00
      drwxr-xr-x.  2 root root   67 Apr  4 17:13 127.0.0.1-2023-04-04-17:13:12
      drwxr-xr-x.  2 root root   67 Apr  4 17:30 127.0.0.1-2023-04-04-17:30:32
      drwxr-xr-x.  2 root root   67 Apr  4 18:57 127.0.0.1-2023-04-04-18:56:41
      

      Steps to Reproduce:

      1. Install 4.13-rc.2 OCP on Single Node Openshift with Du profiles applied, specifically on this server, there are 
      - 4 ptp profiles configured - 1 Boundary clock (E810 NIC) and 3 downstream slaves (CX-6 NICs).
      - config these via Tuned, so crash dumps are auto generated on rcu stall.
      kernel.sysrq=1 
      kernel.panic_on_rcu_stall=1 
      kernel.hung_task_panic=1 
      
      2. let the system run overnight
      3.
      

      Actual results:

      system rebooted itself, and coredumps are generated

      Expected results:

      system is stable

      Additional info:

       

              Unassigned Unassigned
              rhn-support-yliu1 Yang Liu
              Michael Nguyen Michael Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: