-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.13
-
Critical
-
No
-
Rejected
-
False
-
-
Description of problem:
With 4.13 rc.2 (9.2 and RT kernel), SNO with DU profiles is noticed to have frequent server reboot in steady state when ptp profiles are applied. Note that we used to experience server freeze once in a while, but because we were not able to collect crashdump when that happened, we enabled the following configs by default. So instead of seeing freezes, we now see server reboots with coredump generated as the new symptoms. kernel.sysrq=1 kernel.panic_on_rcu_stall=1 kernel.hung_task_panic=1
Version-Release number of selected component (if applicable):
4.13-rc.2
How reproducible:
Frequent - more than 10 reboots in 24 hrs. (about 5 of them are from our automated tests, but most reboots happened itself in steady state) And we have seem multiple "still running" reboots (see end timestamp), they sometimes get cleared on a followup reboot, sometimes they keep piling up. [core@helix49 ~]$ last reboot reboot system boot 5.14.0-284.4.1.r Tue Apr 4 19:00 still running reboot system boot 5.14.0-284.4.1.r Tue Apr 4 18:07 still running reboot system boot 5.14.0-284.4.1.r Tue Apr 4 17:33 - 18:04 (00:30) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 17:16 - 18:04 (00:48) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 16:47 - 18:04 (01:17) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 14:57 - 18:04 (03:07) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 14:39 - 18:04 (03:25) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 14:07 - 18:04 (03:56) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 12:20 - 18:04 (05:43) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 11:41 - 18:04 (06:23) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 11:16 - 18:04 (06:48) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 10:43 - 18:04 (07:21) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 10:02 - 18:04 (08:02) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 09:25 - 18:04 (08:38) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 07:17 - 18:04 (10:47) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 06:01 - 18:04 (12:02) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 04:14 - 18:04 (13:50) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 03:45 - 04:11 (00:25) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 02:44 - 04:11 (01:26) reboot system boot 5.14.0-284.4.1.r Tue Apr 4 01:08 - 04:11 (03:02) reboot system boot 5.14.0-284.4.1.r Mon Apr 3 19:34 - 04:11 (08:36) reboot system boot 5.14.0-284.4.1.r Mon Apr 3 19:09 - 19:31 (00:21) reboot system boot 5.14.0-284.4.1.e Mon Apr 3 18:23 - 19:06 (00:43) reboot system boot 5.14.0-284.4.1.e Mon Apr 3 18:18 - 18:20 (00:02)wtmp begins Mon Apr 3 18:18:47 2023 [core@helix49 ~]$ ls -la /var/crash/ total 8 drwxr-xr-x. 20 root root 4096 Apr 4 18:56 . drwxr-xr-x. 24 root root 4096 Apr 3 19:34 .. drwxr-xr-x. 2 root root 67 Apr 4 01:05 127.0.0.1-2023-04-04-01:05:31 drwxr-xr-x. 2 root root 67 Apr 4 02:41 127.0.0.1-2023-04-04-02:40:54 drwxr-xr-x. 2 root root 67 Apr 4 03:43 127.0.0.1-2023-04-04-03:42:35 drwxr-xr-x. 2 root root 67 Apr 4 05:58 127.0.0.1-2023-04-04-05:58:27 drwxr-xr-x. 2 root root 67 Apr 4 07:14 127.0.0.1-2023-04-04-07:13:44 drwxr-xr-x. 2 root root 67 Apr 4 09:22 127.0.0.1-2023-04-04-09:22:26 drwxr-xr-x. 2 root root 67 Apr 4 09:59 127.0.0.1-2023-04-04-09:58:44 drwxr-xr-x. 2 root root 67 Apr 4 10:40 127.0.0.1-2023-04-04-10:39:44 drwxr-xr-x. 2 root root 67 Apr 4 11:13 127.0.0.1-2023-04-04-11:12:47 drwxr-xr-x. 2 root root 67 Apr 4 11:38 127.0.0.1-2023-04-04-11:38:25 drwxr-xr-x. 2 root root 67 Apr 4 12:18 127.0.0.1-2023-04-04-12:17:37 drwxr-xr-x. 2 root root 67 Apr 4 14:04 127.0.0.1-2023-04-04-14:04:27 drwxr-xr-x. 2 root root 67 Apr 4 14:36 127.0.0.1-2023-04-04-14:36:20 drwxr-xr-x. 2 root root 67 Apr 4 14:54 127.0.0.1-2023-04-04-14:54:20 drwxr-xr-x. 2 root root 67 Apr 4 16:44 127.0.0.1-2023-04-04-16:44:00 drwxr-xr-x. 2 root root 67 Apr 4 17:13 127.0.0.1-2023-04-04-17:13:12 drwxr-xr-x. 2 root root 67 Apr 4 17:30 127.0.0.1-2023-04-04-17:30:32 drwxr-xr-x. 2 root root 67 Apr 4 18:57 127.0.0.1-2023-04-04-18:56:41
Steps to Reproduce:
1. Install 4.13-rc.2 OCP on Single Node Openshift with Du profiles applied, specifically on this server, there are - 4 ptp profiles configured - 1 Boundary clock (E810 NIC) and 3 downstream slaves (CX-6 NICs). - config these via Tuned, so crash dumps are auto generated on rcu stall. kernel.sysrq=1 kernel.panic_on_rcu_stall=1 kernel.hung_task_panic=1 2. let the system run overnight 3.
Actual results:
system rebooted itself, and coredumps are generated
Expected results:
system is stable
Additional info: