Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-40146

stalld core dumping on 64K page size kernel with 128 CPUs

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-9.2.0
    • stalld
    • None
    • None
    • Important
    • sst_kernel_rts
    • ssg_core_kernel
    • 5
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • aarch64
    • None

      What were you trying to do that didn't work?

      Install the OCP 4.15.15 release (which uses the RHEL 9.2 kernel) on an ARM server and use the 64K page size kernel.

      Please provide the package NVR for which bug is seen:

      stalld-1.17.1-1.el9_1.aarch64

      How reproducible:

      always

      Steps to reproduce

      1. Install OCP 4.15.15 on an ARM server
      2. Switch to the 64K page size kernel

      Expected results

      stalld functions correctly

      Actual results

      stalld continuously core dumps - for example:

      May 30 12:12:35 cnfdg37 systemd[1]: Starting Stall Monitor...
      May 30 12:12:35 cnfdg37 stalld[10159]: lockdown mode is off
      May 30 12:12:35 cnfdg37 systemd[1]: Started Stall Monitor.
      May 30 12:12:35 cnfdg37 stalld[10159]: /sys/kernel/debug/sched/features exists
      May 30 12:12:35 cnfdg37 stalld[10159]: /sys/kernel/debug/sched/debug exists
      May 30 12:12:35 cnfdg37 stalld[10159]: boosted pid 0 (undef) using SCHED_DEADLINE
      May 30 12:12:35 cnfdg37 stalld[10159]: using SCHED_DEADLINE for boosting
      May 30 12:12:35 cnfdg37 stalld[10159]: initial config_buffer_size set to 14417920
      May 30 12:12:35 cnfdg37 stalld[10159]: detected new task format
      May 30 12:12:35 cnfdg37 stalld[10159]: single threaded mode
      May 30 12:12:37 cnfdg37 systemd-coredump[10267]: [🡕] Process 10159 (stalld) of user 0 dumped core.
      May 30 12:12:37 cnfdg37 systemd[1]: stalld.service: Main process exited, code=dumped, status=11/SEGV
      May 30 12:12:37 cnfdg37 systemd[1]: stalld.service: Failed with result 'core-dump'.
      May 30 12:12:37 cnfdg37 systemd[1]: stalld.service: Scheduled restart job, restart counter is at 1.
      May 30 12:12:37 cnfdg37 systemd[1]: Stopped Stall Monitor.
      May 30 12:12:37 cnfdg37 stalld[10287]: lockdown mode is off
      May 30 12:12:37 cnfdg37 systemd[1]: Starting Stall Monitor...
      May 30 12:12:37 cnfdg37 stalld[10287]: /sys/kernel/debug/sched/features exists
      May 30 12:12:37 cnfdg37 systemd[1]: Started Stall Monitor.
      May 30 12:12:37 cnfdg37 stalld[10287]: /sys/kernel/debug/sched/debug exists
      May 30 12:12:37 cnfdg37 stalld[10287]: boosted pid 0 (undef) using SCHED_DEADLINE
      May 30 12:12:37 cnfdg37 stalld[10287]: using SCHED_DEADLINE for boosting
      May 30 12:12:37 cnfdg37 stalld[10287]: initial config_buffer_size set to 14417920
      May 30 12:12:37 cnfdg37 stalld[10287]: detected new task format
      May 30 12:12:37 cnfdg37 stalld[10287]: single threaded mode
      May 30 12:12:37 cnfdg37 systemd-coredump[10289]: [🡕] Process 10287 (stalld) of user 0 dumped core.
      May 30 12:12:37 cnfdg37 systemd[1]: stalld.service: Main process exited, code=dumped, status=11/SEGV
      May 30 12:12:37 cnfdg37 systemd[1]: stalld.service: Failed with result 'core-dump'.

      Here is a core dump info (full core dump attached):

      [core@cnfdg37 ~]$ sudo coredumpctl info 86371
                 PID: 86371 (stalld)
                 UID: 0 (root)
                 GID: 0 (root)
              Signal: 11 (SEGV)
           Timestamp: Wed 2024-06-05 14:07:24 UTC (5min ago)
        Command Line: /usr/bin/stalld --systemd -p 1000000000 -r 20000 -d 3 -t 20 --foreground --pidfile /run/stalld.pid
          Executable: /usr/bin/stalld
       Control Group: /system.slice/stalld.service
                Unit: stalld.service
               Slice: system.slice
             Boot ID: 4fd9097c2ea54270b2d0c718e11fc3f7
          Machine ID: 2600e9ed13c34ffcbfb802a8edab6275
            Hostname: cnfdg37
             Storage: /var/lib/systemd/coredump/core.stalld.0.4fd9097c2ea54270b2d0c718e11fc3f7.86371.1717596444000000.zst (present)
        Size on Disk: 137.7K
             Message: Process 86371 (stalld) of user 0 dumped core.                Stack trace of thread 86371:
                      #0  0x0000aaaacbe254a8 get_cpu_busy_list (stalld + 0x54a8)
                      #1  0x0000aaaacbe237c4 main (stalld + 0x37c4)
                      #2  0x0000ffff93d4c79c __libc_start_call_main (libc.so.6 + 0x2c79c)
                      #3  0x0000ffff93d4c86c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2c86c)
                      #4  0x0000aaaacbe23c70 _start (stalld + 0x3c70)
                      ELF object binary architecture: AARCH64
      

      System information:

      [core@cnfdg37 ~]$ uname -a
      Linux cnfdg37 5.14.0-284.67.1.el9_2.aarch64+64k #1 SMP PREEMPT_DYNAMIC Mon May 13 15:24:28 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
      [core@cnfdg37 ~]$ cat /proc/cmdline
      BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-4e20df033d8f1de063c19b1faf96533e75b4703c42c8ec06b92d6db6436f2004/vmlinuz-5.14.0-284.67.1.el9_2.aarch64+64k ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/4e20df033d8f1de063c19b1faf96533e75b4703c42c8ec06b92d6db6436f2004/0 root=UUID=bc8ec181-a943-4a79-86c2-86da64ad24e8 rw rootflags=prjquota boot=UUID=2430289c-848b-4a1c-8a24-805015977c9d skew_tick=1 rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=2-127 tuned.non_isolcpus=00000003 systemd.cpu_affinity=0,1 isolcpus=managed_irq,2-127 nohz_full=2-127 nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 rcupdate.rcu_normal_after_boot=0 efi=runtime module_blacklist=irdma vfio_pci.disable_idle_d3=1 vfio_pci.enable_sriov=1 iommu.passthrough=1 default_hugepagesz=512M hugepagesz=512M hugepages=32 systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1 intel_iommu=on iommu=pt
      

            williams@redhat.com Clark Williams
            bwensley@redhat.com Bart Wensley
            Clark Williams Clark Williams
            Chang Yin Chang Yin
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: