Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20237

[SNO] API unreachable after kdump test

    XMLWordPrintable

Details

    • Important
    • No
    • OCPNODE Sprint 246 (Blue)
    • 1
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      We run a test to validate that a kdump can be generated. After provoking a KUDMP on a SNO cluster with a Telco profile, the node was rebooted(expected), but after the reboot none of the OCP pods could start.
      
      
      

      Version-Release number of selected component (if applicable):

      4.12.37

      How reproducible:

      It only happened once. We will try to reproduce it.

      Steps to Reproduce:

      1. echo 1 > /proc/sys/kernel/sysrq
      2. echo \"c\" > /proc/sysrq-trigger
      3. Wait for node restart
      

      Actual results:

      After reboot, oc commands fail due to API unavailable. OCP pods are not present

      Expected results:

      After reboot, OCP recovers successfully

      Additional info:

      System impact: Seems a race condition on a kdump generated broke the system. I did an extra reboot but the OCP pods cannot start
      
      
      CRIO service status:
      
      [core@cloudransno-site1 ~]$ sudo systemctl status crio.service 
      ● crio.service - Container Runtime Interface for OCI (CRI-O)
         Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/crio.service.d
                 └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf, 90-container-mount-namespace.conf
         Active: active (running) since Mon 2023-10-09 08:07:58 UTC; 20min ago
           Docs: https://github.com/cri-o/cri-o
       Main PID: 4986 (crio)
          Tasks: 9
         Memory: 43.0M
            CPU: 341ms
         CGroup: /system.slice/crio.service
                 └─4986 /usr/bin/crioOct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.060677274Z" level=info msg="Conmon does support the --sync option"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.060695173Z" level=info msg="Conmon does support the --log-global-size-max option"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.066464266Z" level=info msg="Conmon does support the --sync option"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.066480683Z" level=info msg="Conmon does support the --log-global-size-max option"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.072012525Z" level=info msg="Conmon does support the --sync option"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.072028811Z" level=info msg="Conmon does support the --log-global-size-max option"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.192167568Z" level=info msg="Found CNI network multus-cni-network (type=multus) at /etc/kubernetes/cni/net.d/00-multus.conf"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.192188565Z" level=info msg="Updated default CNI network name to multus-cni-network"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.250251124Z" level=warning msg="Error encountered when checking whether cri-o should wipe containers: open /var/run/crio/version: no such file or directory"
      Oct 09 08:07:58 cloudransno-site1 bash[4986]: time="2023-10-09 08:07:58.251933354Z" level=info msg="Serving metrics on :9537 via HTTP"
      
      
      However, the file is present in the machine:
      
      [core@cloudransno-site1 ~]$ cat /var/run/crio/version
      "1.25.4-4.rhaos4.12.gitb9319a2.el8+unknown"
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rlopezma@redhat.com Rodrigo Lopez Manrique
            Sunil Choudhary Sunil Choudhary
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: