Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59140

kernel dump test fails on realtime kernel with UEFI booting

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Kdump fails to complete and system gets hung without any reboot after kernel panic command is executed.
         
      On OpenShift Container Platform nodes running RHCOS kernel 5.14.0-427.72.1.el9_4.x86_64+rt, initiating a kernel panic using echo c > /proc/sysrq-trigger results in the node hanging/freezing, preventing the generation of a vmcore dump. This issue consistently occurs when a PerformanceProfile is applied to the node, specifically one that configures CPU isolation (e.g., isolcpus, nohz_full, rcu_nocbs).

      Version-Release number of selected component (if applicable):

      OCP 4.16.0 nightly build
      
      kexec-tools 2.0.27

       

      How reproducible:

      Easy to reproduc    

      Steps to Reproduce:

      1. Install OCP on single node openshift cluster
      2. Apply performance profiling with below CPU cores spec & with realtimekernel enabled=true
              isolated: "2-55,58-111"  reserved: "0,1,56,57"     
      3. Configure Kdump wirh craskkernel=512MB
      4. Execute Kernel panic command
      5. Monitor server console , verify that console gets freeze/hung , not rebooted. Manual intervention is needed to power off/on 
          

      Actual results:

          System does not get reboot and gets hung/freeze

      Expected results:

          system should get reboot after kernel panic command and Kdump file should be generated in defined path /var/crash

      Additional info:

          kind: PerformanceProfile
      apiVersion: "performance.openshift.io/v2"
      metadata:
        name: sno-perf-profile
        annotations:
          kubeletconfig.experimental: |
            {"allowedUnsafeSysctls":["net.ipv4.tcp_tw_reuse"]}
      spec:
        cpu:
          isolated: "2-55,58-111"
          reserved: "0,1,56,57"
        hugepages:
          pages:
            - size: "1G"
              count: 52
              node: 0
            - size: "1G"
              count: 52
              node: 1
        numa:
          topologyPolicy: restricted
        realTimeKernel:
          enabled: true
        workloadHints:
          highPowerConsumption: false
          perPodPowerManagement: false
          realTime: true
        nodeSelector:
          node-role.kubernetes.io/master: ""
        machineConfigPoolSelector:
          pools.operator.machineconfiguration.openshift.io/master: 
      
      =====================
      sh-5.1# uname -r
      5.14.0-427.72.1.el9_4.x86_64+rt
      
      =========================
      sh-5.1# cat /proc/cmdline 
      BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-dd7825b0c917bfcf0ccfc9d9cd41f7ae951accb7a206a56030c6c6bb02975df3/vmlinuz-5.14.0-427.72.1.el9_4.x86_64+rt ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/dd7825b0c917bfcf0ccfc9d9cd41f7ae951accb7a206a56030c6c6bb02975df3/0 root=UUID=affe3e2a-4fa1-4603-9523-40718b718026 rw rootflags=prjquota boot=UUID=46ef05dc-841c-4df5-9e9c-3f2d51c226e3 intel_iommu=on iommu=pt skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=2-55,58-111 tuned.non_isolcpus=03000000,00000003 systemd.cpu_affinity=0,1,56,57 intel_iommu=on iommu=pt isolcpus=managed_irq,2-55,58-111 nohz_full=2-55,58-111 tsc=reliable nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 intel_pstate=active systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 crashkernel=1024M
      sh-5.1# 
      
      sh-5.1# kdumpctl estimate
      Reserved crashkernel:    512M
      Recommended crashkernel: 512MKernel image size:   53M
      Kernel modules size: 23M
      Initramfs size:      68M
      Runtime reservation: 64M
      Large modules:
          xfs: 2543616
          mlx5_core: 2486272
          ext4: 1191936
          ice: 1241088
          kvm: 1351680
      sh-5.1#

       

              piliu@redhat.com Pingfan Liu
              rh-ee-rchiluve Rajesh Chiluveru
              None
              None
              Liquan Cui Liquan Cui
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: