Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-17860

kdump crash logs are not getting generated on the coresos nodes on a cluster deployed on Power

    • None
    • Low
    • rhel-sst-arch-hw
    • ssg_platform_enablement
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Tried enabling kdump on RHCOS nodes of 4.15 cluster deployed on IBM Power. 
      The kdump crash log generation is failing sometimes.
      
      Followed the RH DOC -
      Link for 4.14 - https://docs.openshift.com/container-platform/4.14/support/troubleshooting/troubleshooting-operating-system-issues.html#enabling-kdump-day-one:~:text=Enabling-,kdump,-RHCOS%20ships%20with
      
      Trigger the crash using - 
      echo c > /proc/sysrq-trigger
      
      Configuration used to generate kdump -
      
      KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb hugetlb_cma ignition.firstboot rd.multipath=default"
      KDUMP_COMMANDLINE_APPEND="irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd hugetlb_cma=0 srcutree.big_cpu_lim=0"
      KEXEC_ARGS="--dt-no-old-root -s"
      KDUMP_IMG="vmlinuz"
      
      # kdump verification on cluster node -
      
      #cat "/etc/sysconfig/kdump" 
      KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb hugetlb_cma ignition.firstboot rd.multipath=default" KDUMP_COMMANDLINE_APPEND="irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd hugetlb_cma=0 srcutree.big_cpu_lim=0" KEXEC_ARGS="--dt-no-old-root -s" KDUMP_IMG="vmlinuz"
      
      
      # cat /etc/kdump.conf 
      path /var/crash 
      core_collector makedumpfile -l --message-level 7 -d 31
      
      
      OCP Version at Install Time: 4.15.0-ec.2
      RHCOS Version at Install Time: 4.15
      OCP Version after Upgrade (if applicable): NA
      RHCOS Version after Upgrade (if applicable): NA
      Architecture (x86_64, ppc64le, s390x, etc.): ppc64le

      Version-Release number of selected component (if applicable):

      4.15.0-ec.2

      How reproducible:

      6 out of 10 attempts failed

      Steps to Reproduce:

      1. deploy an OCP 4.14 cluster on IBM Power
      2. Enable the kdump
      3. Trigger the crash manually 
      4. Check if the crash logs are generated on the worker/master nodes of the cluster
      

      Actual results:

      Crash logs not getting generated 
      
      # ls -lart /var/crash
      total 4
      drwxr-xr-x. 24 root root 4096 Nov 28 05:42 ..
      drwxr-xr-x.  3 root root   43 Nov 28 05:51 .

      Expected results:

      Crash logs should get generated whenever crash occurs
      
      # ls -lart /var/crash
      total 4
      drwxr-xr-x. 24 root root 4096 Nov 28 05:42 ..
      drwxr-xr-x.  3 root root   43 Nov 28 05:51 .
      drwxr-xr-x.  2 root root   67 Nov 28 05:51 127.0.0.1-2023-11-28-05:51:50

      Additional info:

      Crash logs should get generated every time after the trigger
      
      
      [root@worker-2 core]# systemctl status kdump
      ● kdump.service - Crash recovery kernel arming
           Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; preset: disabled)
           Active: active (exited) since Wed 2023-11-29 09:20:53 UTC; 20h ago
         Main PID: 1254 (code=exited, status=0/SUCCESS)
              CPU: 1min 5.091sNov 29 09:20:32 worker-2 dracut[1807]: Stored kernel commandline:
      Nov 29 09:20:32 worker-2 dracut[1807]: No dracut internal kernel commandline stored in the initramfs
      Nov 29 09:20:35 worker-2 dracut[1807]: *** Install squash loader ***
      Nov 29 09:20:36 worker-2 dracut[1807]: *** Squashing the files inside the initramfs ***
      Nov 29 09:20:51 worker-2 dracut[1807]: *** Squashing the files inside the initramfs done ***
      Nov 29 09:20:51 worker-2 dracut[1807]: *** Creating image file '/var/lib/kdump/initramfs-5.14.0-284.40.1.el9_2.ppc64lekdump.img' ***
      Nov 29 09:20:51 worker-2 dracut[1807]: *** Creating initramfs image file '/var/lib/kdump/initramfs-5.14.0-284.40.1.el9_2.ppc64lekdump.img' done ***
      Nov 29 09:20:53 worker-2 kdumpctl[1298]: kdump: kexec: loaded kdump kernel
      Nov 29 09:20:53 worker-2 kdumpctl[1298]: kdump: Starting kdump: [OK]
      Nov 29 09:20:53 worker-2 systemd[1]: Finished Crash recovery kernel arming.

       

      Uploaded the logs to below location.

      link for logs - https://drive.google.com/drive/folders/1ChpifZEv2J8gxK6AJFkDjWIfkt1bOicD?usp=sharing

              minamdar Mamatha Inamdar
              sbobade Swapnil Bobade
              IBM Confidential Group
              Mamatha Inamdar Mamatha Inamdar
              Brock Organ Brock Organ
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: